Automatic verbal aggression detection for Russian and American imageboards
📝 Abstract
The problem of aggression for Internet communities is rampant. Anonymous forums usually called imageboards are notorious for their aggressive and deviant behaviour even in comparison with other Internet communities. This study is aimed at studying ways of automatic detection of verbal expression of aggression for the most popular American (4chan.org) and Russian (2ch.hk) imageboards. A set of 1,802,789 messages was used for this study. The machine learning algorithm word2vec was applied to detect the state of aggression. A decent result is obtained for English (88%), the results for Russian are yet to be improved.
💡 Analysis
The problem of aggression for Internet communities is rampant. Anonymous forums usually called imageboards are notorious for their aggressive and deviant behaviour even in comparison with other Internet communities. This study is aimed at studying ways of automatic detection of verbal expression of aggression for the most popular American (4chan.org) and Russian (2ch.hk) imageboards. A set of 1,802,789 messages was used for this study. The machine learning algorithm word2vec was applied to detect the state of aggression. A decent result is obtained for English (88%), the results for Russian are yet to be improved.
📄 Content
Automatic verbal aggression detection for Russian and American imageboards Denis Gordeevab1 aNational Research Nuclear University MEPhI (Moscow Engineering Physics Institute), Kashirskoe highway, 31 , Moscow, 115409, Russia bMoscow State Linguistic University, Ostozhenka, 38, Moscow, 119034, Russia Abstract The problem of aggression for Internet communities is rampant. Anonymous forums usually called imageboards are notorious for their aggressive and deviant behaviour even in comparison with other Internet communities. This study is aimed at studying ways of automatic detection of verbal expression of aggression for the most popular American (4chan.org) and Russian (2ch.hk) imageboards. A set of 1,802,789 messages was used for this study. The machine learning algorithm word2vec was applied to detect the state of aggression. A decent result is obtained for English (88%), the results for Russian are yet to be improved. Keywords: aggression; word2vec; imageboard; 4chan; 2ch; cyberbullying; random forest
- Introduction The Internet is sometimes considered a quite violent and rude place. Many people, especially active users, face with cyberbullying and other expressions of aggression on a daily basis. For example, the U.S. Department of Health & Human Services has launched an initiative to stop bullying, including Internet bullying [1]. Imageboards that have been a buzzword for a while are considered a truly epicentre of all kind of unruly behaviours that we can find on the Net. No wonder they are called ’the Internet hate machine’ [2]. Imageboards are usual Internet forums with no registration. Messages contain no personal details, only the message itself, date and email. However, registration mechanisms are not implemented and there is nothing that may prevent another person using the same information that you have provided with your message. Personal tripcodes are the only means to state your identity but they are used only in about 4% of cases [2]. It seems quite natural that aggression will flourish in such an environment where nobody can track you and where there are no social limits. Nevertheless, Potapova and Gordeev [3] have shown that it may be not true for Russian Internet communities, although the results are still disputed. In this research, we study aggression in the environment where it is vividly presented and is not constrained by social boundaries. This research is important because it is one of the first works on automatic detection of verbal aggression. We also release our trained neural model that can be used for other researches and our methods may be used with some tweaking for other languages.
- Related works Many researchers deal with aggression and its representation on the Internet. Potapova has been investigating aggression [4] and compiled a Russian dictionary containing words describing this emotional 1
- Corresponding author. Tel.: +7-495-788-5699; fax: +7-499-324-2111 E-mail address: DIGordeyev@mephi.ru Denis Gordeev/ Procedia - Social and Behavioral Sciences 00 (2015) 000–000 state [5]. Bernstein has conducted a research on 4chan and imageboard culture [2]. The task of sentiment analysis is rather close to aggression analysis because both tasks deal with detection of different human emotions. Twitter and social networks sentiment analysis are especially close to our research field, because the majority of anonymous forums messages are short, e.g. a 4chan message contains 15 words on average [3] and there are not more than 140 symbols for a Twitter post. A huge number of papers has been published on this and adjacent topics in recent years. Cerrea et al. studied the influence of complete anonymity on the users' behavior [6] in comparison with partial anonymity of Twitter. They have found that users tend to be more open and are more ready to express negative emotions (not only aggression) in anonymous environment. However, they have studied a site Whisper designed to share secrets and confessions, and it may influence their results. Martínez-Cámara has conducted an overview of different methods for Twitter sentiment analysis [7]. Another research was done by Dos Santos who successfully (from 76% to 88% for various measurement sets) detected the sentiment for Twitter messages [8] without using any handcrafted features, but unlike us he had labeled data. Tang and Wei analyzed Twitter sentiments using emoticons, smileys and neural networks [9].As we see, many modern researches use machine learning and neural networks methods for sentiment detection. However, Paltoglou [10] asserts that ‘unsupervised’ dictionary-based methods outperform ‘state-of-the-art’ machine learning. Nevertheless, he does not mention any deep learning or neural network-based methods, and his results are difficult to apply to other languages, besides English.
- Methods and materials Our study is focused on automatic identification of aggression for Russian and American imageboards. W
This content is AI-processed based on ArXiv data.