Table 1
Comparison of major Arabic morphological datasets by size, variety, annotation scope, and domain.
| DATASET | WORDS | VARIETY | ANNOTATION | DOMAIN |
|---|---|---|---|---|
| Noor–Ghateh Dataset (this work) | 223,690 | Classical Arabic (CA) | Morphological segmentation; part-of-speech (POS) tags; lemmas; roots; derivational patterns; clitic segmentation | Hadith/Jurisprudence |
| Penn Arabic Treebank (PATB) (Maamouri et al., 2004) | 37M | Modern Standard Arabic (MSA) | Tokenization; segmentation; POS tags; lemmas; diacritics; syntactic trees | News |
| BAMA/SAMA (Buckwalter, 2002; Maamouri et al., 2010) | N/A | MSA | Lexicon-based morphological feature bundles; stems; roots | General/Lexicon |
| Quranic Arabic Corpus (Leeds) (Dukes & Habash, 2010) | 77K | CA | Morphological segmentation; POS tags; dependency grammar; semantic ontology | Quran |
| Zeroual Quranic Corpus (Zeroual & Lakhouaja, 2016) | 1.3M | CA | Stems; patterns; lemmas; roots | Quran |
| Prague Arabic Dependency Treebank (Hajič et al., 2004) | 114K | MSA | Morphology; syntax; dependency relations | News |
| QALB Corpus (Habash et al., 2013) | 2M | MSA + Dialectal Arabic (DA) | Tokenization; POS tags; lemmatization; diacritization; error annotation | Essays/Web |
| Tashkeela (Zerrouki & Balla, 2017) | 75M | MSA+CA | Diacritization; morphological features; syntactic context | Mixed |
| Arabic Gigaword Fifth Edition (Parker et al., 2011) | 1.1B | MSA | Morphology; lexical features; syntactic parsing; named entities | News |
| Masader (Alyafeai et al., 2022) | N/A | MSA+DA | Dialect identification; named entity recognition (NER); sentiment; morphology | Multi-domain |

Figure 1
Overview of the Noor-Ghateh dataset preparation workflow, illustrating the sequential stages from text selection and normalization to manual segmentation, verification, and final export.

Figure 2
Screenshot of the annotation environment used in the creation of the Noor-Ghateh dataset. Annotators segmented tokens, assigned lemmas, and specified morphological features such as part of speech, case, number, and gender. The interface displays the tokenized Arabic text (lower panel), the feature categories (left panel), and the annotation fields (top panel).
Table 2
Sample entries from the Noor-Ghateh dataset showing Arabic tokens, segmentations, lemmas, POS tags, and English glosses.
| TOKEN | SEGMENTATION | LEMMA | POS | ENGLISH GLOSS |
|---|---|---|---|---|
| الطهارة | ال+طهارة | طهارة | NOUN | Purification |
| ويعتمد | و+یعتمد | اعتماد | VERB | relies on |
| اربعة | أربعة | أربعة | NOUN | four |
| للوضوء | ل+ال+وضوء | وضوء | NOUN | for ablution |
| فالواجب | ف+ال+واجب | واجب | PART | thus, the obligation is |
Table 3
Formal schema definitions and valid values for the 15 morphological attributes used in the Noor-Ghateh dataset.
| TAG | DEFINITION | VALUES/RANGE | EXAMPLE FROM DATASET |
|---|---|---|---|
| Seq | Sequential index of the word or morpheme | Natural numbers (1–n) | 1, 2, 3 |
| Slice | Surface form of the morpheme | Arabic string | وضوء, ال كتاب |
| Entry | Canonical or normalized form | Arabic string | وضوء, ال طهارة |
| Affix | Morphological category (prefix, stem, suffix) | پسوند, هسته, پيشوند | پيشوند |
| Pos | Part-of-speech tag | حرف, فعل, اسم | فعل |
| Lemma | Base lemma of the stem | Arabic lemma string | وضوء, طهارة, اعتماد |
| Case | Grammatical case | مبني بر كسر, مجرور, منصوب مرفوع, | مبني بر كسر |
| Categ | abstract derivational pattern | فَعَلَ يَفْعُلُ اِسْتِفْعَال اِفْعِلَّال اِفْتِعَال تهي إِفعَال تَفَعُّل فَعْلَلَةثلاثي مجرد | ثلاثي مجرد |
| DervT | Derivational type | جامد غير مصدري, مفعول اسم, اسم فاعل, مصدر | جامد غير مصدري |
| Num | Grammatical number | جمع, مثنى, مفرد | مفرد |
| Root | Underlying triliteral or quadriliteral root | Arabic root sequence | وجب, وضء, كتب |
| TOV | Verb type (aspect/transitivity) | 1 = active; 2 = passive; 3 = reflexive; 4 = intensive | 1 |
| Time | Tense/temporal category | أمر, مضارع, ماض | ماض |
| Voic | Verb voice | مجهول معلوم | معلوم |
| Kol | Functional classification | شرطيه, استينافيه, عاطفه, تعريف, جاره | جاره |
