Have a personal or library account? Click to login
NoorGhateh: A Benchmark Dataset for Training and Evaluating Arabic Morphological Analysis Systems Cover

NoorGhateh: A Benchmark Dataset for Training and Evaluating Arabic Morphological Analysis Systems

Open Access
|Feb 2026

Abstract

This dataset provides a linguistically and morphologically annotated sample of 313 Arabic words drawn from a larger corpus of 223,690 words compiled from Sharaye al-Islam, a classical Arabic jurisprudential text. Each token includes segmentation, lemma, part-of-speech, and affix-level annotations that have been manually verified for accuracy. The data are stored in UTF-8 CSV format and openly shared via Zenodo. This resource supports training and benchmarking of Arabic morphological analysis systems and can be used for developing and evaluating AI-based models in Arabic natural language processing.

DOI: https://doi.org/10.5334/johd.409 | Journal eISSN: 2059-481X
Language: English
Submitted on: Oct 12, 2025
|
Accepted on: Dec 11, 2025
|
Published on: Feb 10, 2026
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2026 Behrouz Minaei-Bidgoli, Huda AlShuhayeb, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.