Advancements in Offensive Language Detection: A Comprehensive Review and Experimental Analysis

C. Nalini; R. Shanthakumari; Y. Agashia Maria; T. Janarthanan; M. Manibharathi

doi:10.2478/ias-2024-0012

.blurhash-client-img { display: none !important; }

Advancements in Offensive Language Detection: A Comprehensive Review and Experimental Analysis

Journal of Information Assurance and Security

Volume 19 (2024): Issue 4 (December 2024)

By: C. Nalini, R. Shanthakumari, Y. Agashia Maria, T. Janarthanan and M. Manibharathi

Open Access

|Feb 2025

Abstract

The proliferation of offensive language in digital communication has become a significant challenge in the internet era, prompting the urgent need for advanced Natural Language Processing (NLP) techniques for its identification and mitigation. With a particular focus on NLP techniques, machine learning, deep learning, and transformer models, this study presents a thorough review of the shifting landscape of offensive language identification from the years 2020 through 2023. The datasets utilized in prior research have been scrutinized, specifically those of Dravidian languages such as Tamil, Malayalam, etc. Preprocessing techniques encompass a range of data cleansing and word embedding methodologies, including TF-IDF and Word2Vec, which are employed to train and optimize the model. We reviewed past work to compare the standard supervised learning models like Support Vector Machine and Naive Bayes to emergent transformer models like BERT, identifying the superior approach that would improve a model’s accuracy and effectiveness.

References

Aziz, N. A. A., Maarof, M. A. and Zainal, A. (2021). ‘Hate speech and offensive language detection: a new feature set with filter-embedded combining feature selection’, in 2021 3rd International Cyber Resilience Conference (CRC).
Search in Google Scholar Back to article
William, P., et al. (2022). ‘Machine learning based automatic hate speech recognition system’, in 2022 International Conference on Sustainable Computing and Data Communication Systems (ICSCDS).
Search in Google Scholar Back to article
Jarquín-Vásquez, H. J., Montes-y-Gómez, M. and Villaseñor-Pineda, L. (2020). ‘Not all swear words are used equal: Attention over word n-grams for abusive language identification’, in Mexican Conference on Pattern Recognition.
Search in Google Scholar Back to article
Chakravarthi, B. R., et al. (2023). ‘Offensive language identification in Dravidian languages using MPNet and CNN’, International Journal of Information Management Data Insights, 3(1), p. 100140.
Search in Google Scholar Back to article
Dave, B., Bhat, S. and Majumder, P. (2021). ‘IRNLP_DAIICT@DravidianLangTech-EACL2021: offensive language identification in Dravidian languages using TF-IDF char n-grams and MuRIL’, in Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages.
Search in Google Scholar Back to article
Sivalingam, D. and Thavareesan, S. (2021). ‘OffTamil@DravideanLangTech-EASL2021: Offensive language identification in Tamil text’, in Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages.
Search in Google Scholar Back to article
Sharif, O., Hossain, E. and Hoque, M. M. (2021). ‘NLP-CUET@dravidianlangtech-eacl2021: Offensive language detection from multilingual code-mixed text using transformers’, in Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages, arXiv preprint arXiv:2103.00455.
Search in Google Scholar Back to article
Ekinci, E., Omurca, S. I. and Sevim, S. (2020). ‘Improve offensive language detection with ensemble classifiers’, International Journal of Intelligent Systems and Applications in Engineering, 8(2), pp. 109–115.
Search in Google Scholar Back to article
Pathak, V., et al. (2020). ‘KBCNMUJAL@HASOC-Dravidian-CodeMix-FIRE20: Using machine learning for detection of hate speech and offensive code-mixed social media’, in The 12th Meeting of the Forum for Information Retrieval Evaluation.
Search in Google Scholar Back to article
Mullah, N. S. and Wan Zainon, W. M. N. Z. (2020). ‘Advances in machine learning algorithms for hate speech detection in social media: a review’, IEEE Access, 9, pp. 88364-88376.
Search in Google Scholar Back to article
Oriola, O. and Kotzé, E. (2020). ‘Evaluating machine learning techniques for detecting offensive and hate speech in South African tweets’, IEEE Access, 8, pp. 21496–21509.
Search in Google Scholar Back to article
Palanikumar, V., et al. (2022). ‘DE-ABUSE@TamilNLP-ACL 2022: Transliteration as data augmentation for abuse detection in Tamil’, in Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages.
Search in Google Scholar Back to article
Kumar, R. P., Bhawal, S. and Chinnaudayar, S. C. (2022). ‘Hate speech and offensive language detection in Dravidian languages using deep ensemble framework’, Computer Speech & Language, 75, p. 101386.
Search in Google Scholar Back to article
Zhu, Y. and Zhou, X. (2020). ‘Zyy1510@ HASOC-Dravidian-CodeMix-FIRE2020: An Ensemble Model for Offensive Language Identification’, FIRE (Working Notes).
Search in Google Scholar Back to article
Yao, L., Mao, C. and Luo, Y. (2019). ‘Graph convolutional networks for text classification’, in Proceedings of the AAAI Conference on Artificial Intelligence.
Search in Google Scholar Back to article
Sreelakshmi, K., Premjith, B. and Kp, S. (2021). ‘Amrita_CEN_NLP@DravidianLangTech-EACL2021: Deep learning-based offensive language identification in Malayalam, Tamil and Kannada’, in Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages.
Search in Google Scholar Back to article
Liu, B., Guan, W., Yang, C., Fang, Z. and Lu, Z. (2023). ‘Transformer and Graph Convolutional Network for Text Classification’, International Journal of Computational Intelligence Systems, 16(1), 161.
Search in Google Scholar Back to article
Huang, L., Ma, D., Li, S., Zhang, X. and Wang, H. (2019). ‘Text level graph neural network for text classification’, in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), arXiv preprint arXiv:1910.02356.
Search in Google Scholar Back to article
Liu, C., Wang, X. and Xu, H. (2022). ‘Text Classification Using Document-Relational Graph Convolutional Networks’, IEEE Access, 10, 123205–123211.
Search in Google Scholar Back to article
Wang, K., Han, S. C. and Poon, J. (2022). ‘Induct-gcn: Inductive graph convolutional networks for text classification’, in 2022 26th International Conference on Pattern Recognition (ICPR), 1243-1249. IEEE.
Search in Google Scholar Back to article
Vasantharajan, C. and Thayasivam, U. (2021). ‘Hypers@DravidianLang-Tech-EACL2021: Offensive language identification in Dravidian code-mixed YouTube comments and posts’, in Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages.
Search in Google Scholar Back to article
Ratnavel, R., Selvaraj, S., Vasudevan, P. and Kumar, A. M. (2023). ‘Hottest: Hate and offensive content identification in Tamil using transformers and enhanced stemming’, Computer Speech & Language, 78.
Search in Google Scholar Back to article
Jayanthi, S. M. and Gupta, A. (2021). ‘Sj_aj@dravidianlangtech-eacl2021: Task-adaptive pre-training of multilingual BERT models for offensive language identification’, in Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages, arXiv preprint arXiv:2102.01051.
Search in Google Scholar Back to article
Bhawal, S., Roy, R. P. and Kumar, A. (2021). ‘Hate speech and offensive language identification on multilingual code mixed text using BERT’, Working Notes of FIRE 2021-Forum for Information Retrieval Evaluation (Online), CEUR.
Search in Google Scholar Back to article
Kumar, A., Saumya, S. and Singh, J. P. (2020). ‘NITP-AI-NLP@HASOC-FIRE2020: Fine Tuned BERT for the Hate Speech and Offensive Content Identification from Social Media’, FIRE (Working Notes).
Search in Google Scholar Back to article
Yasaswini, K., et al. (2021). ‘IIITT@DravidianLangTech-EACL2021: Transfer learning for offensive language detection in Dravidian languages’, in Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages.
Search in Google Scholar Back to article
Garain, A., Mandal, A. and Naskar, S. K. (2021). ‘JUNLP@Dravidian-LangTech-EACL2021: Offensive language identification in Dravidian languages’, in Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages.
Search in Google Scholar Back to article
Zhao, Y. and Tao, X. (2021). ‘ZYJ123@DravidianLangTech-EACL2021: Offensive language identification based on XLM-RoBERTa with DPCNN’, in Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages.
Search in Google Scholar Back to article
Chaitanya, B. S. N. V. and Karri, A. P. (2021). ‘Transformer Ensemble System for Detection of Offensive Content in Dravidian Languages’, Working Notes of FIRE 2021-Forum for Information Retrieval Evaluation (Online), CEUR.
Search in Google Scholar Back to article

Articles in this issue

DOI: https://doi.org/10.2478/ias-2024-0012 | Journal eISSN: 1554-1029 | Journal ISSN: 1554-1010

Journal RSS Feed

Language: English

Page range: 162 - 179

Published on: Feb 20, 2025

Published by: Cerebration Science Publishing Co., Limited

In partnership with: Paradigm Publishing Services

Publication frequency: 6 issues per year

Keywords:

Related subjects:

Fundamentals of computer sciences,

Theoretical computer sciences,

IT-security and cryptology

© 2025 C. Nalini, R. Shanthakumari, Y. Agashia Maria, T. Janarthanan, M. Manibharathi, published by Cerebration Science Publishing Co., Limited
This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 License.

Volume 19 (2024): Issue 4 (December 2024)