Have a personal or library account? Click to login
NoorGhateh: A Benchmark Dataset for Training and Evaluating Arabic Morphological Analysis Systems Cover

NoorGhateh: A Benchmark Dataset for Training and Evaluating Arabic Morphological Analysis Systems

Open Access
|Feb 2026

Figures & Tables

Table 1

Comparison of major Arabic morphological datasets by size, variety, annotation scope, and domain.

DATASETWORDSVARIETYANNOTATIONDOMAIN
Noor–Ghateh Dataset (this work)223,690Classical Arabic (CA)Morphological segmentation; part-of-speech (POS) tags; lemmas; roots; derivational patterns; clitic segmentationHadith/Jurisprudence
Penn Arabic Treebank (PATB) (Maamouri et al., 2004)37MModern Standard Arabic (MSA)Tokenization; segmentation; POS tags; lemmas; diacritics; syntactic treesNews
BAMA/SAMA (Buckwalter, 2002; Maamouri et al., 2010)N/AMSALexicon-based morphological feature bundles; stems; rootsGeneral/Lexicon
Quranic Arabic Corpus (Leeds) (Dukes & Habash, 2010)77KCAMorphological segmentation; POS tags; dependency grammar; semantic ontologyQuran
Zeroual Quranic Corpus (Zeroual & Lakhouaja, 2016)1.3MCAStems; patterns; lemmas; rootsQuran
Prague Arabic Dependency Treebank (Hajič et al., 2004)114KMSAMorphology; syntax; dependency relationsNews
QALB Corpus (Habash et al., 2013)2MMSA + Dialectal Arabic (DA)Tokenization; POS tags; lemmatization; diacritization; error annotationEssays/Web
Tashkeela (Zerrouki & Balla, 2017)75MMSA+CADiacritization; morphological features; syntactic contextMixed
Arabic Gigaword Fifth Edition (Parker et al., 2011)1.1BMSAMorphology; lexical features; syntactic parsing; named entitiesNews
Masader (Alyafeai et al., 2022)N/AMSA+DADialect identification; named entity recognition (NER); sentiment; morphologyMulti-domain
johd-12-409-g1.png
Figure 1

Overview of the Noor-Ghateh dataset preparation workflow, illustrating the sequential stages from text selection and normalization to manual segmentation, verification, and final export.

johd-12-409-g2.png
Figure 2

Screenshot of the annotation environment used in the creation of the Noor-Ghateh dataset. Annotators segmented tokens, assigned lemmas, and specified morphological features such as part of speech, case, number, and gender. The interface displays the tokenized Arabic text (lower panel), the feature categories (left panel), and the annotation fields (top panel).

Table 2

Sample entries from the Noor-Ghateh dataset showing Arabic tokens, segmentations, lemmas, POS tags, and English glosses.

TOKENSEGMENTATIONLEMMAPOSENGLISH GLOSS
الطهارةال+طهارةطهارةNOUNPurification
ويعتمدو+یعتمداعتمادVERBrelies on
اربعةأربعةأربعةNOUNfour
للوضوءل+ال+وضوءوضوءNOUNfor ablution
فالواجبف+ال+واجبواجبPARTthus, the obligation is
Table 3

Formal schema definitions and valid values for the 15 morphological attributes used in the Noor-Ghateh dataset.

TAGDEFINITIONVALUES/RANGEEXAMPLE FROM DATASET
SeqSequential index of the word or morphemeNatural numbers (1–n)1, 2, 3
SliceSurface form of the morphemeArabic stringوضوء, ال كتاب
EntryCanonical or normalized formArabic stringوضوء, ال طهارة
AffixMorphological category (prefix, stem, suffix)پسوند, هسته, پيشوندپيشوند
PosPart-of-speech tagحرف, فعل, اسمفعل
LemmaBase lemma of the stemArabic lemma stringوضوء, طهارة, اعتماد
CaseGrammatical caseمبني بر كسر, مجرور, منصوب مرفوع,مبني بر كسر
Categabstract derivational patternفَعَلَ يَفْعُلُ اِسْتِفْعَال اِفْعِلَّال اِفْتِعَال تهي إِفعَال تَفَعُّل فَعْلَلَةثلاثي مجردثلاثي مجرد
DervTDerivational typeجامد غير مصدري, مفعول اسم, اسم فاعل, مصدرجامد غير مصدري
NumGrammatical numberجمع, مثنى, مفردمفرد
RootUnderlying triliteral or quadriliteral rootArabic root sequenceوجب, وضء, كتب
TOVVerb type (aspect/transitivity)1 = active; 2 = passive; 3 = reflexive; 4 = intensive1
TimeTense/temporal categoryأمر, مضارع, ماضماض
VoicVerb voiceمجهول معلوممعلوم
KolFunctional classificationشرطيه, استينافيه, عاطفه, تعريف, جارهجاره
DOI: https://doi.org/10.5334/johd.409 | Journal eISSN: 2059-481X
Language: English
Submitted on: Oct 12, 2025
|
Accepted on: Dec 11, 2025
|
Published on: Feb 10, 2026
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2026 Behrouz Minaei-Bidgoli, Huda AlShuhayeb, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.