A Telegram Corpus for Hate Speech, Offensive Language, and Online Harm

Veronika Solopova; Tatjana Scheffler; Mihaela Popa-Wyatt

doi:10.5334/johd.32

Abstract

We provide a new text corpus from the social medium Telegram, which is rich in indirect forms of divisive speech. We scraped all messages from one channel of Donald Trump supporters, covering a large part of his presidency, from late 2016 until January 2021, including the January 6 Capitol riot. The discussion among the group members, over this long time period, includes the spread of disinformation, disparaging of out-group members, and other forms of harmful speech. To enable research into the role of harmful speech in political discourse, we added two types of annotations to the corpus: (i) automatic annotations of offensive language for all messages, and (ii) our own manual annotations of harmful language for a portion of the posts leading up to the January 2021 Capitol riot and its aftermath.

References

1Anger, Z. (2020). List of profanity in English. https://github.com/zacanger/profane-words
Back to article
2Davidson, T., Warmsley, D., Macy, M., & Weber, I. (2017). Automated Hate Speech Detection and the Problem of Offensive Language. Proceedings of the International AAAI Conference on Web and Social Media, 11(1). Retrieved from https://ojs.aaai.org/index.php/ICWSM/article/view/14955
Back to article
3Nakayama, H. (2020). HateSonar: Hate speech detection. https://github.com/Hironsan/HateSonar
Back to article
4Poletto, F., Basile, V., Sanguinetti, M., et al. (2020). Resources and benchmark corpora for hate speech detection: a systematic review. Lang Resources & Evaluation. DOI: 10.1007/s10579-020-09502-8
Back to article
5Scheffler, T., Solopova, V., & Popa-Wyatt, M. (2021). The Telegram chronicles of online harm. Journal of Open Humanities Data, 7: 8, 1–13. DOI: 10.5334/johd.31
Back to article
6Shutterstock. (2020). List of Dirty, Naughty, Obscene, and Otherwise Bad Words. https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words
Back to article
7Wilkinson, M., Dumontier, M., Aalbersberg, I., et al. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3(160018). See also: https://www.go-fair.org/fair-principles/. DOI: 10.1038/sdata.2016.18
Back to article

A Telegram Corpus for Hate Speech, Offensive Language, and Online Harm

Abstract

Paradigm

My account