Have a personal or library account? Click to login
Integrating Handcrafted Features with Machine Learning for Hate Speech Detection in Albanian Social Media Cover

Integrating Handcrafted Features with Machine Learning for Hate Speech Detection in Albanian Social Media

Open Access
|Dec 2024

Abstract

Online social media has seen a significant increase in usage over the last decade, enabling people to communicate more easily. The vast amount of data generated by these platforms is mostly uncontrolled and unmanageable. This has also provided opportunities for individuals to engage in hate speech and offensive language on these platforms. To address this issue, this research aims to conduct extensive experiments using machine learning models and handcrafted feature extraction in the low-resource language Albanian. We utilized several machine-learning algorithms, including Support Vector Machine (SVM), Naive Bayes (NB), Random Forest (RF), and Logistic Regression (LR), and extracted a considerable number of handcrafted features. To improve accuracy, we carefully performed feature selection to identify the most relevant features for detecting hate speech in the Albanian language. The results show that LR performed best in terms of accuracy, with an F1 score of 76.77. Using Random Forest feature ranking and SHAP analysis revealed that many comments on Albanian social media exhibit unique characteristics, resulting in a large feature set. This suggests that there is no clear pattern for the machine learning models to accurately flag the comments, indicating that Albanian is linguistically challenging to analyze.

Language: English
Page range: 80 - 92
Published on: Dec 24, 2024
In partnership with: Paradigm Publishing Services
Publication frequency: 2 issues per year
Related subjects:

© 2024 Endrit Fetahi, Mentor Hamiti, Arsim Susuri, Xhemal Zenuni, Jaumin Ajdari, published by South East European University
This work is licensed under the Creative Commons Attribution 4.0 License.