Table 1
Overall statistics of the Noor-Ghateh dataset.
| PROPERTY | COUNT |
|---|---|
| Sentences | 10,160 |
| Word tokens (annotated) | 223,690 |
| Unique lemmas | 16,420 |
| Unique roots | 8,150 |
Table 2
Morpheme-level composition in the Noor-Ghateh dataset.
| MORPHEME TYPE | OCCURRENCES |
|---|---|
| Prefix morphemes | 74,242 |
| Suffix morphemes | 18,617 |
| Average morphemes per word | 1.42 |
| Estimated clitic density (per 100 tokens) | 33.1 |

Figure 1
Sample XML annotation from the Noor–Ghateh dataset.
Table 3
Annotation schema in the Noor-Ghateh dataset.
| FIELD | DESCRIPTION |
|---|---|
| Seq | Sequential index of the word or morpheme within its sentence |
| Slice | Surface form of the token as it appears in the original text |
| Entry | Canonical or normalized form of the morpheme |
| Affix | Morphological category (prefix, stem, or suffix) |
| POS | Part-of-speech tag (e.g., noun, verb, particle) |
| Case | Grammatical case or syntactic role |
| Kol | Functional classification (e.g., conjunction, preposition) |
| Lemma | Base lemma associated with the stem |
| Categ | Morphological pattern type (e.g., triliteral root pattern) |
| DervT | Derivational type (e.g., adjective, participle) |
| Num | Grammatical number (singular, dual, plural) |
| Root | Underlying triliteral or quadriliteral Arabic root |
Table 4
Dataset Overview and Linguistic Register.
| DATASET | REGISTER | KEY CHARACTERISTICS |
|---|---|---|
| NAFIS | Modern Standard Arabic | High morphological ambiguity; MSA gold standard |
| Quranic Corpus | Classical Scriptural Arabic | Conservative segmentation; historical orthography |
| Noor–Ghateh | Classical Jurisprudential Arabic | Dense clitic stacking; templatic morphology |
Table 5
Compressed comparison of clitic representation across datasets.
| CLITIC | TYPE | FUNCTION | NAFIS | QUR’ANIC CORPUS | NOOR–GHATEH |
|---|---|---|---|---|---|
| و | Conjunction | “and” | Proclitic: و+قال | Split or attached morphologically | Always separated: و+يفعلون |
| ف | Conj./resultative | “then/so” | ف+قال | Attached in token; split in morphology | Explicit split: ف+إنهم |
| ب/ك/ل | Prepositional proclitics | “with/by”, “as”, “for” | ب+ال+بيت | Separated morphologically | ب+كلمة |
| ال | Definite article | Definiteness marking | Separated after preps | Morphological morpheme | Always explicit |
| Pronominal suffixes | Enclitics | Possession/object | كتاب+هم | Morphologically split | قول+كم |
| Multi-clitic stacks | Agglutinative forms | Complex tokens | Rare | Frequent in classical text | Dense stacks in legal style |
Table 6
Representative orthographic normalization examples.
| ORIGINAL | NORMALIZED | NOTE |
|---|---|---|
| إيمانهم | ايمانهم | Hamzated alif unified |
| مسئولية | مسؤولية | Hamza normalization |
| هديٰهم | هداهم | Dagger alif removed |
Table 7
Representative clitic-boundary harmonization.
| TOKEN | ALIGNED | NOTE |
|---|---|---|
| وبالكتاب | و+ب+ال+كتاب | Standardized proclitics |
| فزادهم | ف+زاد+هم | Qur’anic فsplit |
| تجارتهم | تجارت+هم | Unified enclitic rule |
Table 8
Canonical segmentation examples across domains.
| SURFACE | CANONICAL | DOMAIN |
|---|---|---|
| ليستغفروا | ل+يستغفر+وا | Qur’anic |
| فانهم | ف+ان+هم | Noor–Ghateh |
| اوبالحق | او+ب+ال+حق | NAFIS |
Table 9
Representative gold–prediction alignment outcomes.
| TOKEN | GOLD | PREDICTION | OUTCOME |
|---|---|---|---|
| فاستغفروا | ف+استغفر+وا | CAMeL correct | Perfect match |
| وكتابهم | و+كتاب+هم | Farasa:و+كتابهم | Under-segmentation (1 error) |
Table 10
Token-level segmentation accuracy (exact match).
| DATASET | FARASA | CAMEL | ALP |
|---|---|---|---|
| NAFIS (MSA) | 0.59 | 0.68 | 0.65 |
| Qur’anic Corpus | 0.76 | 0.80 | 0.81 |
| Noor–Ghateh (Hadith) | 0.81 | 0.81 | 0.79 |
Table 11
Domain sensitivity (Δdomain) and relative improvements.
| ANALYZER | ΔQuran | ΔHadith | IMPROVEMENT (%) |
|---|---|---|---|
| Farasa | –0.17 | –0.22 | 37.3 |
| CAMeL | –0.12 | –0.13 | 19.1 |
| ALP | –0.16 | –0.14 | 21.5 |
Table 12
Per-component segmentation accuracy across domains.
| COMPONENT/DATASET | FARASA | CAMEL | ALP |
|---|---|---|---|
| Prefix | |||
| NAFIS | 0.689 | 0.892 | 0.865 |
| Qur’anic | 0.894 | 0.939 | 0.939 |
| Noor–Ghateh | 0.776 | 0.835 | 0.812 |
| Stem | |||
| NAFIS | 0.189 | 0.730 | 0.811 |
| Qur’anic | 0.879 | 0.970 | 0.936 |
| Noor–Ghateh | 0.552 | 0.601 | 0.583 |
| Suffix | |||
| NAFIS | 0.270 | 0.838 | 0.973 |
| Qur’anic | 0.813 | 0.939 | 0.939 |
| Noor–Ghateh | 0.628 | 0.742 | 0.701 |
Table 13
Morpheme-level segmentation accuracy and 95% confidence intervals on the Noor–Ghateh dataset.
| SYSTEM | ACCURACY | 95% CI | STD. DEV. | N TOKENS |
|---|---|---|---|---|
| Farasa | 0.563 | [0.508, 0.617] | 0.028 | 311 |
| CAMeL | 0.634 | [0.579, 0.688] | 0.027 | 311 |
| ALP | 0.374 | [0.321, 0.432] | 0.028 | 308 |
Table 14
Morpheme-level segmentation accuracy and 95% confidence intervals on the Qur’anic subset.
| SYSTEM | ACCURACY | 95% CI | STD. DEV. | N TOKENS |
|---|---|---|---|---|
| Farasa | 0.785 | [0.776, 0.795] | 0.005 | 6,839 |
| CAMeL | 0.826 | [0.817, 0.835] | 0.005 | 6,837 |
| ALP | 0.840 | [0.832, 0.849] | 0.004 | 6,828 |
