Domain Sensitivity in Arabic Morphological Analysis: A Multi-Corpus Evaluation of Farasa, CAMeL, and ALP Across Modern, Classical Religious, and Classical Jurisprudential Domains

Behrouz Minaei-Bidgoli; Huda AlShuhayeb; Sayyed-Ali Hossayni

doi:10.5334/johd.418

Abstract

Arabic’s rich and heterogeneous morphology continues to challenge computational analysis, particularly when models trained on Modern Standard Arabic are applied to Classical and Scriptural domains. This discussion paper presents a tri-domain evaluation framework for assessing the domain sensitivity of three widely used morphological analyzers—Farasa, CAMeL, and ALP—across the NAFIS (MSA), Quranic, and Noor–Ghateh (Hadith/Jurisprudential) corpora. Using a unified normalization and segmentation-alignment pipeline, together with bootstrap confidence intervals and paired non-parametric significance tests, the study provides a statistically robust characterization of system performance across domains with markedly different sizes and linguistic profiles. The results show that, while overall accuracy can be higher on classical and scriptural text, all analyzers exhibit systematic weaknesses when confronted with classical lexical forms, dense clitic constructions, and archaic morphological patterns, especially at the stem and suffix levels. By outlining methodological, linguistic, and practical implications of these findings, the paper demonstrates how transparent, multi-domain benchmarking can expose structural limitations in Arabic morphological modeling and guide the development of more adaptable language technologies.

References

Abdelali, A., Darwish, K., Durrani, N., & Mubarak, H. (2016). Farasa: A fast and furious segmenter for arabic. In Proceedings of the 2016 conference of the north american chapter of the association for computational linguistics (naacl) (pp. 11–16). 10.18653/v1/N16-3003
Open DOI Search in Google Scholar Back to article
Alharbi, R., & Lee, M. (2022). A cross-domain evaluation of arabic nlp tools. Journal of King Saud University – Computer and Information Sciences. (In press).
Search in Google Scholar Back to article
Aljumaily, A. (2022). Evaluation of classical arabic nlp tools. Journal of Arabic Linguistics.
Search in Google Scholar Back to article
Blitzer, J., McDonald, R., & Pereira, F. (2006). Domain adaptation with structural correspondence learning. In Proceedings of the conference on empirical methods in natural language processing (emnlp) (pp. 120–128). Association for Computational Linguistics. 10.3115/1610075.1610094
Open DOI Search in Google Scholar Back to article
Darwish, K. (2014). Arabic nlp for social media. In Proceedings of the acl workshop on social nlp (pp. 1–6). Association for Computational Linguistics.
Search in Google Scholar Back to article
Daumé, H. III. (2007). Frustratingly easy domain adaptation. In Proceedings of the 45th annual meeting of the association for computational linguistics (acl) (pp. 256–263). Association for Computational Linguistics.
Search in Google Scholar Back to article
Dukes, K., & Habash, N. (2010a). Morphological annotation of quranic arabic. In Proceedings of the seventh international conference on language resources and evaluation (lrec 2010) (pp. 2530–2536). Retrieved from http://www.lrec-conf.org/proceedings/lrec2010/pdf/276_Paper.pdf
Search in Google Scholar Back to article
Dukes, K., & Habash, N. (2010b). Morphological annotation of quranic arabic. In Proceedings of the seventh international conference on language resources and evaluation (lrec) (pp. 2530–2536).
Search in Google Scholar Back to article
Freihat, A. A., Bella, G., Mubarak, H., & Giunchiglia, F. (2018). A single-model approach for arabic segmentation, pos tagging, and named entity recognition. In Proceedings of the 11th international conference on language resources and evaluation (lrec) (pp. 1756–1763). 10.1109/ICNLSP.2018.8374393
Open DOI Search in Google Scholar Back to article
Gururangan, S., Marasović, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D., & Smith, N. A. (2020). Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th annual meeting of the association for computational linguistics (acl) (pp. 8342–8360). Association for Computational Linguistics. 10.18653/v1/2020.acl-main.740
Open DOI Search in Google Scholar Back to article
Maamouri, M., Bies, A., Buckwalter, T., & Mekki, W. (2004). Penn arabic treebank: Part 1 v 2.0 (Tech. Rep. No. LDC2004T11). Philadelphia: Linguistic Data Consortium. Retrieved 2025-01-03, from https://catalog.ldc.upenn.edu/LDC2004T11
Search in Google Scholar Back to article
Maamouri, M., Graff, D., Bouziri, B., Krouna, S., Kulick, S., & Buckwalter, T. (2010). Standard arabic morphological analyzer (sama) version 3.1 (Tech. Rep.). 10.35111/wgjk-zy44 (LDC Catalog No. LDC2010L01)
Open DOI Search in Google Scholar Back to article
Namly, D., Tajmout, R., Bouzoubaa, K., & Abouenour, L. (2016). Nafis: A gold standard corpus for arabic stemmers evaluation. (Dataset).
Search in Google Scholar Back to article
Obeid, O., Zalmout, N., Khalifa, S., Taji, D., Obeid, M., Alhafni, B., … Habash, N. (2020). Camel tools: An open source python toolkit for arabic natural language processing. In Proceedings of the 12th language resources and evaluation conference (lrec) (pp. 7022–7032).
Search in Google Scholar Back to article
Pan, S. J., & Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10), 1345–1359. 10.1109/TKDE.2009.191
Open DOI Search in Google Scholar Back to article
Ramponi, A., & Plank, B. (2020). Neural unsupervised domain adaptation in nlp: A survey. Computa- tional Linguistics, 46(2), 1–42. 10.48550/arXiv.2006.00632
Open DOI Search in Google Scholar Back to article
Zaidan, O., & Callison-Burch, C. (2012). Arabic dialect identification. In Proceedings of the 50th annual meeting of the association for computational linguistics (acl) (pp. 49–54). Association for Computational Linguistics.
Search in Google Scholar Back to article

Domain Sensitivity in Arabic Morphological Analysis: A Multi-Corpus Evaluation of Farasa, CAMeL, and ALP Across Modern, Classical Religious, and Classical Jurisprudential Domains

Abstract

Paradigm

My account