Have a personal or library account? Click to login
NoorGhateh: A Benchmark Dataset for Training and Evaluating Arabic Morphological Analysis Systems Cover

NoorGhateh: A Benchmark Dataset for Training and Evaluating Arabic Morphological Analysis Systems

Open Access
|Feb 2026

References

  1. Abdelali, A., Darwish, K., Durrani, N., & Mubarak, H. (2016). Farasa: A fast and furious segmenter for arabic. In Proceedings of the 2016 conference of the north american chapter of the association for computational linguistics: Human language technologies (naacl-hlt) (pp. 1116). Retrieved from https://aclanthology.org/N16-3003.pdf.
  2. Abdelali, A., Hassan, S., Mubarak, H., Darwish, K., & Samih, Y. (2021). Pre-training bert on arabic tweets: Practical considerations. In Proceedings of the sixth arabic natural language processing workshop (wanlp 2021) (pp. 252260). Association for Computational Linguistics. 10.48550/arXiv.2102.10684
  3. Abdul-Mageed, M., Elmadany, A., & Nagoudi, E. M. B. (2021). ARBERT & MARBERT: Deep bidi- rectional transformers for Arabic. In C. Zong, F. Xia, W. Li, & R. Navigli (Eds.), Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: Long papers) (pp. 70887105). On- line: Association for Computational Linguistics. 10.18653/v1/2021.acl-long.551
  4. Alyafeai, Z., Masoud, M., Ghaleb, M., & Al-Shaibani, M. S. (2022). Masader: Metadata sourcing for arabic text and speech data resources. In Proceedings of the thirteenth language resources and evaluation conference (lrec 2022) (pp. 63406351). European Language Resources Association (ELRA). Retrieved 2025-01-03, from https://aclanthology.org/2022.lrec-1.681.
  5. Antoun, W., Baly, F., & Hajj, H. (2021a). Arabert: Transformer-based model for arabic language understanding. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT4). Retrieved 2025-01-03, from https://aclanthology.org/2020.osact-1.2/ (Association for Computational Linguistics, pp. 915)
  6. Antoun, W., Baly, F., & Hajj, H. (2021b). Aragpt2: Pre-trained transformer for arabic language generation. In Proceedings of the sixth arabic natural language processing workshop (wanlp 2021) (pp. 196207). Association for Computational Linguistics. Retrieved 2025-01-03, from https://aclanthology.org/2021.wanlp-1.21.
  7. Bouamor, H., Habash, N., Salameh, M., Zaghouani, W., Rambow, O., Abdulrahim, D., … Oflazer, K. (2018). The MADAR Arabic dialect corpus and lexicon. In Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018). Miyazaki, Japan: European Lan- guage Resources Association (ELRA). Retrieved 2025-01-03, from https://aclanthology.org/L18-1535.
  8. Buckwalter, T. (2002). Buckwalter arabic morphological analyzer version 1.0. Philadephia: Linguistic Data Consortium, University of Pennsylvania. (LDC Catalog No. LDC2002L49). 10.35111/7vzm-mb15
  9. Dukes, K., & Habash, N. (2010). Morphological annotation of quranic arabic. In Proceedings of the seventh international conference on language resources and evaluation (lrec 2010) (pp. 25302536). Retrieved 2025-01-03, from http://www.lrec-conf.org/proceedings/lrec2010/pdf/276_Paper.pdf.
  10. Freihat, A., Bella, G., Mubarak, H., & Giunchiglia, F. (2018). A single-model approach for arabic segmentation, pos-tagging and named entity recognition. 10.1109/ICNLSP.2018.8374393
  11. Habash, N., Mohit, B., Obeid, O., Oflazer, K., Tomeh, N., & Zaghouani, W. (2013). Qalb: Qatar arabic language bank. In Proceedings of the qatar annual research conference (qarc 2013) (pp. ICTP– 032). Retrieved 2025-01-03, from https://web.archive.org/web/20170517143806id_/http://nlp.qatar.cmu.edu:80/qalb/QALB_QFARCPoster.pdf.
  12. Habash, N., & Sadat, F. (2006). Arabic preprocessing schemes for statistical machine translation. In Proceedings of the human language technology conference of the naacl (pp. 4952). Retrieved 2025- 01-03, from https://aclanthology.org/N06-2013/.
  13. Hajič, J., Smrž, O., Zemánek, P., Šnaidauf, J., & Beška, E. (2004). Prague arabic dependency treebank: Development in data and tools. In Proceedings of the nemlar international conference on arabic language resources and tools (pp. 110117). Retrieved 2025-01-03, from https://catalog.ldc.upenn.edu/docs/LDC2004T23/papers/2004-nemlar-padt.pdf.
  14. Maamouri, M., Bies, A., Buckwalter, T., & Mekki, W. (2004). The penn arabic treebank: Building a large-scale annotated arabic corpus. In Nemlar conference on arabic language resources and tools (Vol. 27, pp. 466467).
  15. Maamouri, M., Graff, D., Bouziri, B., Krouna, S., Kulick, S., & Buckwalter, T. (2010). Standard Arabic morphological analyzer (sama) version 3.1. Philadelphia: Linguistic Data Consortium, University of Pennsylvania. (LDC Catalog No. LDC2010L01). 10.35111/wgjk-zy44
  16. Obeid, O., Zalmout, N., Khalifa, S., Taji, D., Oudah, M., Alhafni, B., … Habash, N. (2020). Camel tools: An open source python toolkit for arabic natural language processing. In proceedings of the twelfth language resources and evaluation conference (lrec) (pp. 70227032). Retrieved 2025-01-03, from https://aclanthology.org/2020.lrec-1.868/.
  17. Parker, R., Graff, D., Chen, K., Kong, J., & Maeda, K. (2011). Arabic gigaword fifth edition (Tech. Rep. No. LDC2011T11). Philadelphia: Linguistic Data Consortium. 10.35111/p02g-rw14
  18. Pasha, A., Al-Badrashiny, M., Diab, M. T., El Kholy, A., Eskander, R., Habash, N., … Roth, R. (2014). Madamira: A fast, comprehensive tool for morphological analysis and disambiguation of arabic. In proceedings of the ninth international conference on language resources and evaluation (lrec’14) (pp. 10941101). Retrieved 2025-01-03, from http://www.lrec-conf.org/proceedings/lrec2014/pdf/593_Paper.pdf.
  19. Zeroual, I., & Lakhouaja, A. (2016). A new quranic corpus rich in morphosyntactical information. International Journal of Speech Technology, 19(2), 339346. 10.1007/s10772-016-9335-7
  20. Zerrouki, T., & Balla, A. (2017). Tashkeela: Novel corpus of arabic vocalized texts, data for auto- diacritization systems. Data in brief, 11(6), 147151. 10.1016/j.dib.2017.01.011
DOI: https://doi.org/10.5334/johd.409 | Journal eISSN: 2059-481X
Language: English
Submitted on: Oct 12, 2025
|
Accepted on: Dec 11, 2025
|
Published on: Feb 10, 2026
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2026 Behrouz Minaei-Bidgoli, Huda AlShuhayeb, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.