Toxic speech: Approaches to modeling and moderating toxic language in online text-based environments
Society for Textual Scholarship 2014 conference Seattle, WA
Ben Miller and Nicholas Subtirelu, Georgia State University
Structure of the talk
1.
Brief description of the research topic that provided the impetus for this discussion of text analysis methods
2.
Overview of the problems inherent in researching the topic through computational text analysis
Introduction to some possible (though imperfect) solutions
3.
Toxic behavior and internet gaming
Internet gaming tends to involve conflict including toxic behavior due in part to aspects of the context:
Online disinhibition effect (Suler, 2004)
Tendency of many community leaders to ignore the presence of harmful ideologies, e.g., sexism (Salter and Blodgett, 2012)
Can be studied using ethnographic methods
e.g., Nakumura (2009) found World of Warcraft players were creating media that represented Asians using racist discourses similar to those found during the construction of railroads in the Western US
How do we study and moderate toxic behavior en masse?
Quantitative text analysis approaches assume:
The most pervasive, concrete manifestation of in-game toxic behavior can be found in players‟ production of text As such texts can be studies using approaches developed in linguistics
The dilemma
Most tools for automatically processing written text are built to work with „standard‟ forms of the language However, computer-mediated communication (CMC) differs in ways that reflect the meaning-making potential of its platforms. Thus, we need solutions for preparing CMC texts for processing while preserving features of CMC that differ from „standard‟ written language
A „text‟ as envisioned by traditional text analysis
The text (from LA Times)
The magnitude 4.4 earthquake that struck near Westwood is the most significant shake in Southern California since a 5.5 earthquake hit Chino Hills in 2008, a U.S. Geological Survey seismologist told reporters at a news conference Monday morning.
Features
Edited, „standard‟ orthography relying on consistent spelling choices
Punctuation that reliably indicates ends of sentences
Robert Graves said there have been at least six aftershocks since the 6:25 a.m. A single text earthquake. The largest so far has been a producer magnitude 2.7 earthquake that struck A single language five miles northwest of Westwood.
variety
Source: http://www.latimes.com/local/lanow/la-meln-earthquake-foreshock20140317,0,2194994.story#axzz2wEmXx9Iz
Nonstandard orthography and neologisms
World of Warcraft transcript, Nardi et al. (2007, p. 4)
Drollnar: Hey any drinkers do you want this Volatile Rum?
Akiraa: me Miggles: no don‟t drink …expensive for alchamy Physikz: lol ME ME ME! Miggles: sell in ah [#Auction House] for alot of money Drollnar: is it a trade good Miggles: yes need it to make goblin rocket fuel Drollnar: well u learn some thing new every day Miggles: good thing to for money…don‟t drink it lol Drollnar: well sorry i am going 2 have to sell this but i will send both people that want ed it something nice soon
Multithreaded discourse, often structured by turns not sentences
Erie Island transcript, from Liang (2012, p.469)
Dora: guys, I want find a fire monster sword Dora: and I need ur help Marvel: Alright everyone get drunk!
no end punctuation
Marvel: and prepare to leave for the next destination!
Dora: the sword can make me stronger
Jess: i gotta feeling Marvel: and great adventure! Cindy: how to help u?
„nonstandard‟ end punctuation
Jess: tonight is gonna be a good night
Dora: come and u will see
Plurilingualism and codeswitching
MSN chat transcript, from Seargeant and Tagg (2011, p. 508)
Dream: I got a bar of white choc Dream: Someone put it in the envelope [#second language English speaker‟s article usage] and wrote my name on it Dream: And put it in my flat postbox Big: 555 [#number 5 in Thai is pronounced “ha”] Don‟t each much na Dream: I ate it up laew [#Thai: already] lol Dream: Aroi duey [#Thai: It was delicious]
Working with „nonstandard‟ orthography
Possible causes of „nonstandard‟ orthography:
Typos or other unintentional misspellings (e.g., <alchamy> for alchemy in previous slide) Attempts to model phonological features of spoken language varieties (e.g., <goin> for going) Attempts to model prosodic features of spoken language varieties (e.g., <soooo> or <sooooooooooooo> for so)
The key here is to place strings into categories known as lemmas
<alchamy> and <alchemy> lemma alchemy <goin> and <going> lemma going (or go) <so>, <soooo>, and <sooooooooooooo> lemma so
Lemmatization
Historical linguistics is faced with a similar problem, namely studying text prior to standardization of writing systems in languages like Dutch (Kestemont et al., 2010), English (Pilz et al., 2008), and German (Pilz et al., 2006) Researchers propose that because orthographic representation is phonologically constrained, simple replacement rules based on speech varieties can effectively eliminate some variation:
-in$_VERB -ing$_VERB (e.g., <goin> lemma go)
Lemmatization
Computational linguists have developed ways of matching uncategorized strings with lemmas based on distance measures, called approximate string matching (Navarro, 2001)
First, all strings in a document are assigned lemmas on the basis of routine rules (e.g., <bake> lemma bake, <baked> lemma bake) Second, remaining strings are assigned to lemmas on the basis of a distance measure such as Levenshtein (or edit) distance:
e.g., Levenshtein distance between <freedumb> and <freedom> = 2
Step 1: substitute <u> for <o>, <freedum> Step 2: insert <b>, <freedumb>
Distance would be much higher for other lemmas like history
Lemmatization
A complication: Neologisms
We have to remain sensitive to the possibility that certain strings should not be forced into already existing lemmas but instead represent newly-emerging lemmas Researchers should keep track of frequent replacements resulting from lemmatization processes and examine them in context for evidence of neologisms
Preserving orthographic variation:
<word lemma=“go”>goin</word>
Working with multithreaded discourse, structured by turns
We may want to do automatic analyses that look beyond the level of the individual turn (e.g., when doing corpus-based critical discourse analysis, see Baker et al., 2008) We need an approach for telling the computer to view a text as a set of turns that are categorized within an intermediary category: conversational threads
Turns in one thread are not necessarily adjacent
Working with multithreaded discourse, structured by turns
Chat text
Thread A
Turn Turn Turn #1 #5 #6
Thread B
Turn Turn #2 #3
Thread C
Turn Turn Turn Turn Turn #4 #7 #8 #9 #10
Thread disentanglement
Ulthus and Aha (2013) review a number of approaches to thread disentanglement, many of which are highly successful Most approaches rely on clustering methods (e.g. Mayfield et al, 2012) which take into consideration some or all of the following features, when attempting to place each turn into a conversational thread:
Temporal distance (amount of time elapsed between turns)
Spatial distance (amount of interceding turns) Semantic content in messages (e.g., using LSA or LDA) Identity of the participants Explicit naming of participants (e.g. “@Joe, hey man”)
Example of „disentangled‟ threads
Tagging the earlier example (Liang, 2012, p.469)
<turn speaker=“Dora”, thread=1>guys, I want find a fire monster sword</turn>
<turn speaker=“Dora”, thread=1>and I need ur help</turn> <turn speaker=“Marvel”, thread=2>Alright everyone get drunk</turn> <turn speaker=“Marvel”, thread=2>and prepare to leave for the next destination!</turn>
<turn speaker=“Dora”, thread=1>the sword can make me stronger</turn>
<turn speaker=“Jess”, thread=3>i gotta feeling</turn> <turn speaker=“Marvel”, thread=2>and great adventure!</turn> <turn speaker=“Cindy”, thread=1>how to help u?</turn> <turn speaker=“Jess”, thread=3>tonight is gonna be a good night</turn>
<turn speaker=“Dora”, thread=1>come and u will see</turn>
Working in multilingual contexts with plurilingual text creators
Many internet users use English as a lingua franca (Seargeant & Tagg, 2011) and codeswitch between languages or varieties of a single language (Siebenhaar, 2006) A particularly important point in dealing with research on internet gaming and toxic speech as one of the major issues is toxic behavior stemming from nationalism and racism (Nakamura, 2009)
Working in multilingual contexts with plurilingual text creators
Language identification of a whole text is usually quite easy (assuming it‟s fairly long and there is one language that predominates)
Your web browser can do it.
However, developing a system that looks at lower levels inside the text: the thread, the turn, or the word/string is considerably more complex
Dealing with different languages at the level of the thread
Identify threads using distance and participant information and without paying attention to message content (i.e., do not use semantic similarity methods such as LSA) If thread clustering is successful, should be possible to assign each thread a language using widely available tools Advantage: Useful in multilingual contexts where participants are having parallel conversations according to the languages they use with minimal codeswitching and minimal pluriligualism at the individual level Disadvantage: Would offer little or nothing to the issue of codeswitching
An idealized pair of parallel conversations
Klaus: hallo alex! wie geht‟s? Alex: gut! dir? Shirley: u comin monica? Klaus: gar ned schlecht Monica: just a sec
Klaus: kannst mir mit was helfen?
Monica: waitin for ryan Ryan: alright let‟s go! Alex: kommt darauf an… was genau?
Dealing with different languages at the level of the turn
Graham et al. (in press) have found that several widelyavailable tools are capable of classifying the language of tweets (i.e., Twitter messages, which are constrained to a maximum 140 characters) with a high degree of accuracy (over 67% agreement with human raters) Hence, it may be possible to identify the language of a turn with reasonable accuracy largely dependent on how long the turn was
Could assign short turns a default dominant language according to the text creator
Dealing with different languages at the level of the turn
Advantage: Identifying language by turn would capture some of the plurilingualism participants demonstrate Disadvantages:
Much less accurate than approaches that identify language of entire text
Still doesn‟t capture within turn codeswitching
Dealing with different languages at the level of the word
In theory, this is possible.
Have dictionaries for all languages to be covered. Check to see which dictionary a particular string is found in.
Serious problems stemming from the preponderance of crosslingual homographs: words with similar spellings across languages (e.g., German <Bank> vs. English <bank>)
This could potentially be resolved (albeit imperfectly) by assigning such homographs to the same language as surrounding words
Dealing with different languages at the level of the word
Advantage: If it could be made to work (if!), it would allow researchers to use quantitative methods to examine codeswitching within the turn Disadvantages: Extreme difficulties automatically separating what is a word in a different language and what is nonstandard orthography / neologisms / borrowings
Recommendations for automatic language identification
An approach that classifies the language of each turn will likely be the best, incorporating:
The prediction and prediction confidence of automatic language identification
Meta-information about the text producer (e.g., language of other turns)
Code-switching within the turn can be imperfectly attended to by examining the replacements made by automatic string matching
Frequent replacements that are actually code-switching can be added as new lemmas
A final mark up of part of our idealized chat text
<turn speaker=“Alex”, thread=1, lang=“de”><word lemma=“gut”>gut!</word> <word lemma=“dir”>dir?</word></turn> <turn speaker=“Shirley”, thread=2, lang=“en”><word lemma=“you”>u</word> <word lemma=“come”>comin</word> <word lemma=“monica”>monica?</word></turn>
<turn speaker=“Klaus”, thread=1, lang=“de”><word lemma=“gar”>gar</word> <word lemma=“nicht”>ned</word> <word lemma=“schlecht”>schlecht</word></turn>
<turn speaker=“Monica”, thread=2, lang=“en”><word lemma=“just”>just</word> <word lemma=“a”>a</word> <word lemma=“second”>sec</word></turn>
but omg y?
The chat texts that we are analyzing were produced by and for people embedded within complex social networks and operating with a great deal of cultural knowledge Such tagging provides the computer some, however incomplete or imperfect, understanding of the structures and contexts that inform human interpretive processes
For those interested in language change, it gives us a framework for attending to neologisms and creative orthographic manipulation
For those wanting to apply text analysis to moderation of online behavior, it allows analysts to begin to describe toxic speech more complexly beyond “the list of bad words” approach
Thank you! Questions?
PowerPoint available from https://gsu.academia.edu/NicSubtirelu or by email: NSUBTIRELU1@GSU.EDU (Note number ONE after last name)
References
Baker, Paul, Gabrielatos, Costas, KhosraviNik, Majid, Krzyżanowski, Michał, McEnery, Tony, & Wodak, Ruth. (2008). A useful methodological synergy? Combining critical discourse analysis and corpus linguistics to examine discourses of refugees and asylum seekers in the UK press. Discourse & Society, 19(3), 273-306. Graham, M,. Hale, S., & Gaffney, D. (in press). Where in the world are you? Geolocation and language identification in Twitter. Professional Geographer. Prepublication version available from: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2224233 Kestemont, Mike, Daelemans, Walter, & De Pauw, Guy. (2010). Weigh your words—memory-based lemmatization for Middle Dutch. Literary and Linguistic Computing, 25(3), 287-301. Liang, Mei-Ya. (2012). Foreign ludicity in online role-playing games. Computer Assisted Language Learning, 25(5), 455-473.
References
Mayfield, Elijah, Adamson, David, & Rosé, Carolyn Penstein. (2012). Hierarchical conversation structure prediction in multi-party chat. Paper presented at the Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue, Seoul, South Korea. Nakamura, L. (2009). Don‟t hate the player, hate the game: The racialization of labor in World of Warcraft. Critical Studies in Media Communication, 26(2), 128-144. Nardi, B., Ly, S, & Harris, J. (2007). Learning conversations in World of Warcraft. In R. H. Sprague (Ed.) Proceedings of the 40th Annual Hawaii International Conference on System Sciences. Los Alamitos, CA: IEEE. Navarro, Gonzalo. (2001). A guided tour to approximate string matching. ACM Computing Surveys, 33(1), 31-88. Pilz, Thomas, Ernst-Gerlach, Andrea, Kempken, Sebastian, Rayson, Paul, & Archer, Dawn. (2008). The identification of spelling variants in English and German historical texts: Manual or automatic? Literary and Linguistic Computing, 23(1), 65-72.
References
Pilz, Thomas, Luther, Wolfram, Fuhr, Norbert, & Ammon, Ulrich. (2006). Rulebased search in text databases with nonstandard orthography. Literary and Linguistic Computing, 21(2),179-186. Salter, A. & Blodgett, B. (2012). Hypermasculinity and dickwolves: The contentious role of women in the new gaming public. Journal of Broadcasting & Electronic Media, 56(3), 401-416. Seargeant, P. & Tagg, C. (2011). English on the internet and a „post-varieties‟ approach to language. World Englishes, 30(4), 496-514. Siebenhaar, Beat. (2006). Code choice and code-switching in Swiss-German Internet Relay Chatrooms. Journal of Sociolinguistics, 10(4), 481-506. Suler, J. (2004). The online disinhibition effect. Cyberpsychology & Behavior, 7(3), 321-326. Uthus, David C., & Aha, David W. (2013). Multiparticipant chat analysis: A survey. Artificial Intelligence, 199–200, 106-121.
Working with multithreaded discourse, structured by turns
Translating turns for the computer
Humans visually interpret a chat transcript as unfolding over time in a spatially downward fashion Nic: hey Ben: hey Nic: good talk
A computer program may quickly recognize this structure too, but will see it like this: Nic: hey\nBen: hey\nNic: good talk
Formatting things for the computer:
<turn speaker=“Nic”>hey</turn>\n<turn speaker=“Ben”>hey</turn>\n<turn speaker=“Nic”>good talk</turn>
Thread disentanglement
Problem: Do you need to be able to cluster chat texts in a dynamic, online fashion?
For the purpose of researching online behavior, an offline static text approach is possible
For the purpose of moderating online behavior, a more dynamic model is necessary
Problem: What assumptions are prudent and necessary for your context?
Cluster models assume texts can be broken neatly into distinct conversational threads. Some models that use participants to cluster turns assume participants will only participate in one conversational thread.