Building a Language Corpus of Hate Speech in the Croatian Media Space of Social Networks

Author: Tanja Grmuša, Slobodan Hadžić, Artur Šilić

Heading

Hate speech is an unacceptable form of socially harmful communication that has recently increased in prevalence. The development of digital media, and in particular the strengthening of the role of social networks in private and public communication, has opened the space for numerous public virtual forums that have encouraged passive audiences to participate and communicate more actively. Freedom of opinion and expression as fundamental democratic principles on the one hand are juxtaposed with the space for toxic communication on the other, but also with the potential for radicalization of certain social groups of like-minded individuals who congregate in virtual communities where they often select contributors, journalists, editors, media and other users as targets for future attacks. Given the boundlessness of Internet space, whose content is difficult to control, the question is how to detect socially harmful forms of communication in the public sphere, whether they can be prevented, and how to retain existing audiences. The potential uses extend into the area of moderating inappropriate user comments by relying on software that provides instant responses to media organizations, but also has a constant need for self-improvement. The author’s interest in this work relates to detecting hate speech against ethnic groups on social networks and exploring the possibility of using speech technologies to detect and prevent its spread. Using quantitative and qualitative content analysis in conjunction with a set of software solutions built on speech technologies, we enable efficient automatic and semi-automatic analysis of user-generated content. We use: the WordFinder program for fast word retrieval in large corpora, the Crontiment tool for automatic sentiment analysis of texts in Croatian, and the Text Marker application for efficient manual albeling and building of corpora. The authors use a longitudinal study to identify changes in the types and frequencies of hate speech in this topic and identify the main advantages and disadvantages of language technologies, suggesting possible directions for development.

Key words: audience, digital media, hate speech, natural language processing, socially harmful communication, user comments

Download: Click here.