1 Context and Motivation
Arabic presents a uniquely challenging setting for computational morphology due to its non-concatenative root–and–pattern system, high clitic density, and historically variable orthography. These properties make accurate segmentation a prerequisite for most downstream NLP tasks, including parsing, IR, and machine translation. Yet the interaction of orthographic variation, sparse annotated resources, and domain-specific morphological patterns often leads to substantial word-boundary ambiguity and system instability.
Current morphological analyzers such as Farasa (Abdelali et al., 2016), CAMeL Tools (Obeid et al., 2020), and ALP (Freihat et al., 2018) perform competitively on standard Modern Standard Arabic (MSA) benchmarks. However, widely used resources—including PATB (Maamouri et al., 2004), SAMA (Maamouri et al., 2010), and NAFIS (Namly et al., 2016)—are drawn primarily from contemporary newswire and expository prose. As a result, they underrepresent the templatic regularity, archaic lexical items, multi-clitic constructions, and stylistic conservatism characteristic of Classical Arabic domains such as Quranic exegesis and jurisprudential literature. This domain mismatch frequently results in systematic degradation when analyzers trained on MSA are applied to historically or religiously significant corpora.
This phenomenon reflects the broader problem of domain sensitivity, whereby models internalize domain-specific statistical regularities that fail to generalize to new textual environments. Extensive research documents this behavior in both classical and neural NLP models (Ramponi & Plank, 2020), with foundational work demonstrating limited cross-domain transferability (Blitzer et al., 2006; Daumé III, 2007; Pan & Yang, 2010). Even large language models benefit noticeably from domain-adaptive pretraining (Gururangan et al., 2020). For Arabic, diglossia, genre diversity, and diachronic linguistic evolution intensify these effects. Studies on dialect detection (Zaidan & Callison-Burch, 2012), social media processing (Darwish, 2014), and genre comparisons (Alharbi & Lee, 2022; Obeid et al., 2020) consistently show substantial performance drops outside MSA-trained domains. Furthermore, Classical and Quranic Arabic preserve morphological and lexical features—such as archaic templates, heavy clitic stacking, and rare stem patterns—not present in modern corpora (Aljumaily, 2022; Dukes & Habash, 2010a).
Despite these findings, prior work has not offered a systematic, multi-domain evaluation spanning Modern, Classical Scriptural, and Classical Jurisprudential Arabic. Such a comparison is methodologically valuable because these domains differ not only in vocabulary and orthography but also in morphological structure: Quranic Arabic tends toward templatic regularity, Hadith and legal prose exhibit dense clitic concatenation, and MSA displays greater lexical diversity and syntactic flexibility. Evaluating analyzers across these distinct registers therefore provides a more complete picture of robustness and error sources.
This study fills this gap by introducing a unified tri-domain evaluation framework that assesses Farasa (Abdelali et al., 2016), CAMeL Tools, and ALP on three representative corpora: NAFIS for MSA, the Qur’anic Corpus for Scriptural Arabic, and Noor–Ghateh for Classical Hadith and jurisprudential text. The framework incorporates a consistent normalization and segmentation-alignment pipeline, and later sections introduce bootstrap confidence intervals and paired non-parametric tests to quantify uncertainty and accommodate the substantial size disparities among datasets.
The contributions of this discussion paper are as follows:
A tri-domain evaluation framework that integrates unified normalization, segmentation alignment, and statistical uncertainty estimation.
A systematic comparison of three widely used morphological analyzers across Modern, Classical Scriptural, and Classical Jurisprudential Arabic.
Empirical identification of domain-specific weaknesses in clitic boundary resolution, stem analysis, and handling of archaic morphological patterns.
Methodological insights motivating the development of domain-balanced resources and standardized benchmarking protocols for Arabic NLP.
This work is presented as a JOHD Discussion Paper and is designed to complement a separate Noor–Ghateh data paper, which contains full documentation of the dataset’s XML schema and annotation process. Section 2 outlines the dataset characteristics, Section 3 describes the evaluation methodology, Section 4 presents the findings, and Section 5 discusses broader implications for domain-aware Arabic NLP.
2 Dataset Description
This section provides the complete metadata specification for the primary contribution of this paper: the Noor-Ghateh dataset, a novel benchmark for Arabic morphological segmentation in the Classical Hadith domain. The description follows the recommended schema for data papers. Details regarding the comparison datasets (NAFIS (Namly et al., 2016) and Qur’anic Corpus v0.4 (Dukes & Habash, 2010a)) used in our tri-domain evaluation framework are provided in Section 4 (Method).
Repository Location: The dataset is openly available on Zenodo: https://zenodo.org/records/18138582
Repository Name: Zenodo
Object Name: NoorGhateh_v3.csv
Format Names and Versions: The dataset is distributed as a UTF-8 encoded CSV file converted from the original XML annotation source. Each row corresponds to a single annotated morpheme and preserves the full set of XML attributes. The data are right-to-left encoded and comma-delimited, making them directly compatible with standard NLP preprocessing tools.
Creation Dates: 2023-07-01 to 2024-05-01
Dataset Creators:
Computer Research Center for Islamic Sciences (Noor) – data preparation and annotation
Behrouz Minaei-Bidgoli – academic supervision
Sayyed-Ali Hossayni – institutional supervision and validation
Huda AlShuhayeb – dataset validation, analysis, and paper preparation
Language: Arabic (Classical Hadith and Jurisprudential style)
License: Creative Commons Attribution 4.0 International (CC-BY 4.0)
Publication Date: 2025-10-17
Dataset Composition and Annotation Schema. The Noor-Ghateh dataset comprises 223,690 manually annotated words extracted from the classical jurisprudential text Shariat al-Islam, organized across 52 thematic chapters. This represents, to our knowledge, the largest morphologically annotated resource specifically dedicated to the Hadith domain. A curated subset of 313 tokens is publicly released to facilitate immediate benchmarking and tool evaluation.
The statistical composition of the corpus is summarized in Table 1. At a finer granularity, the distribution of prefix and suffix morphemes, along with average morpheme density per word, is reported in Table 2.
Table 1
Overall statistics of the Noor-Ghateh dataset.
| PROPERTY | COUNT |
|---|---|
| Sentences | 10,160 |
| Word tokens (annotated) | 223,690 |
| Unique lemmas | 16,420 |
| Unique roots | 8,150 |
Table 2
Morpheme-level composition in the Noor-Ghateh dataset.
| MORPHEME TYPE | OCCURRENCES |
|---|---|
| Prefix morphemes | 74,242 |
| Suffix morphemes | 18,617 |
| Average morphemes per word | 1.42 |
| Estimated clitic density (per 100 tokens) | 33.1 |
2.1 Corpus Format
The Noor–Ghateh corpus is authored natively in XML and distributed in UTF-8. Each <word> element encodes a single segment (prefix, stem, or suffix) with a consistent set of attributes capturing orthographic, grammatical, and morphological information. A representative example of the XML-based annotation structure is shown in Figure 1.

Figure 1
Sample XML annotation from the Noor–Ghateh dataset.
Attributes used in the XML are summarized in Table 3. The segmentation follows a linear convention: words are decomposed into sequences of <word> elements corresponding to clitics and stems in order (prefix* → stem → suffix*). For example, the Arabic word “الفجر” is segmented into the prefix “ال” and the stem “فجر”.
Table 3
Annotation schema in the Noor-Ghateh dataset.
| FIELD | DESCRIPTION |
|---|---|
| Seq | Sequential index of the word or morpheme within its sentence |
| Slice | Surface form of the token as it appears in the original text |
| Entry | Canonical or normalized form of the morpheme |
| Affix | Morphological category (prefix, stem, or suffix) |
| POS | Part-of-speech tag (e.g., noun, verb, particle) |
| Case | Grammatical case or syntactic role |
| Kol | Functional classification (e.g., conjunction, preposition) |
| Lemma | Base lemma associated with the stem |
| Categ | Morphological pattern type (e.g., triliteral root pattern) |
| DervT | Derivational type (e.g., adjective, participle) |
| Num | Grammatical number (singular, dual, plural) |
| Root | Underlying triliteral or quadriliteral Arabic root |
CSV export. For ease of benchmarking, the XML is provided as a flat CSV (NoorGhateh_v2.csv; UTF-8). Each row corresponds to one XML <word> element and preserves all attributes as columns. Right-to-left text is retained; fields are space-delimited and quoted for safe parsing in Python, R, or other data-analysis environments.
The dataset’s creation involved a custom-developed annotation tool and rigorous validation protocol, with technical implementation details provided in Section 4. This resource fills a critical gap in Arabic NLP by providing a high-quality, domain-specific benchmark for evaluating morphological analysis systems on classical religious texts.
2.2 Segmentation Schemes and Tokenization Conventions
The three datasets used in this study follow distinct segmentation conventions shaped by their linguistic registers and underlying orthographic traditions. To enable a valid cross-domain comparison, we identify the specific morphological boundaries recognized by each segmentation scheme.
NAFIS (MSA-Oriented). NAFIS (Namly et al., 2016) utilizes a standard Modern Standard Arabic (MSA) morphological decomposition. It follows a prefix–stem–suffix structure, where common proclitics (e.g., w+, f+, b+, l+, al+) and enclitic pronouns (e.g., +h, +hm, +na) are explicitly separated from the stem.
Quranic Corpus (Dukes & Habash, 2010a) (Classical Scriptural). Reflecting its historical orthography, the Quranic segmentation scheme is comparatively conservative. While it prioritizes morphological patterns specific to Classical Arabic, clitic separation is less aggressive than in MSA-oriented schemes. Some tokens preserve complex internal morphological structures that are specific to the Quranic register and differ from contemporary orthographic norms.
Noor–Ghateh (Classical Jurisprudence). Noor–Ghateh applies a templatic segmentation approach that emphasizes full clitic decomposition. Given the agglutinative nature of legal and Hadith texts, this scheme is designed to handle stacked morphemes, where multiple prefixes or suffixes are attached to a single root–pattern structure.
Relation to the NoorGhateh Data Paper
This discussion paper is complemented by a separate data paper dedicated to the Noor–Ghateh dataset, currently submitted to JOHD. That companion paper provides the full technical documentation of the annotation schema, including the XML-based <word> structure, field definitions, segmentation rules, and examples. To avoid redundancy, the present discussion paper provides only a high-level description of the dataset and focuses instead on the comparative evaluation of morphological analyzers across domains. Readers interested in the complete schema, attribute definitions, and annotation methodology are referred to the Noor–Ghateh data paper.
3 Method
This study investigates the cross-domain robustness of Arabic morphological analyzers by evaluating three state-of-the-art systems—Farasa, CAMeL, and ALP—across three distinct textual domains: Modern Standard Arabic (MSA), Classical Hadith Arabic, and Quranic Arabic. The methodology combines a controlled tri-domain benchmark, a harmonized annotation scheme, and consistent evaluation and statistical procedures to enable a fair and reproducible comparison.
3.1 Experimental Design: The Tri-Domain Benchmark
To examine the effect of domain variation on Arabic morphological analysis, we employ a tri-domain evaluation framework. The benchmark consists of three corpora, each representing a distinct register of written Arabic: Modern Standard Arabic, Classical Scriptural Arabic, and Classical Jurisprudential Arabic. An overview of the datasets is provided in Table 4.
Table 4
Dataset Overview and Linguistic Register.
| DATASET | REGISTER | KEY CHARACTERISTICS |
|---|---|---|
| NAFIS | Modern Standard Arabic | High morphological ambiguity; MSA gold standard |
| Quranic Corpus | Classical Scriptural Arabic | Conservative segmentation; historical orthography |
| Noor–Ghateh | Classical Jurisprudential Arabic | Dense clitic stacking; templatic morphology |
NAFIS (Namly et al., 2016). NAFIS is a gold-standard corpus developed for the evaluation of Arabic stemmers and segmentation systems. In this study, we use a released subset consisting of 172 manually annotated word forms. The selected tokens exhibit high levels of morphological ambiguity and clitic attachment, providing a compact test set for evaluating segmentation behavior in Modern Standard Arabic.
Quranic Corpus (Dukes & Habash, 2010b). The Quranic Corpus is a linguistically annotated resource representing Classical Scriptural Arabic. Our evaluation uses a subset covering the chapters Al-Fātiḥa and Al-Baqarah. This subset has been widely used in prior work and reflects conservative segmentation practices grounded in historical orthography.
Noor–Ghateh. Noor–Ghateh is derived from the jurisprudential text Sharāyeʿ al-Islām and represents Classical Hadith and legal Arabic. The corpus is characterized by extensive clitic agglutination and templatic morphological structures. Detailed documentation of the annotation scheme, XML schema, and attribute inventory is provided in a companion JOHD data paper. The present study limits its description to Noor–Ghateh’s role within the cross-domain evaluation.
Data Harmonization. All corpora were converted into a unified representation containing two fields: Token, corresponding to the surface form, and Segmentation, representing the gold-standard morpheme sequence. This harmonization enables consistent evaluation across domains while preserving the original segmentation conventions of each dataset.
Relation to the Companion Data Paper. The Noor–Ghateh XML schema, attribute inventory, and detailed annotation methodology are fully documented in a separate JOHD data paper. The present discussion paper therefore restricts itself to a high-level description of Noor–Ghateh and focuses on the cross-domain evaluation framework and findings.
3.2 Systems Under Evaluation
Three widely used Arabic morphological analyzers were selected based on accessibility, popularity, and linguistic coverage:
Farasa. Farasa (Abdelali et al., 2016) is a high-speed segmenter based on a rank-SVM model trained on the Penn Arabic Treebank. It is optimized for MSA newswire and has been deployed in information retrieval and machine translation pipelines.
CAMeL Tools. CAMeL Tools (Obeid et al., 2020) is an open-source Python toolkit that integrates tokenization, morphological analysis, disambiguation, POS tagging, sentiment analysis, and named entity recognition. In this study, we use its deep morphological analyzer with the CALIMA-MSA database and default clitic-segmentation settings.
ALP. ALP (Freihat et al., 2018) is a single-model pipeline for segmentation, POS tagging, and named-entity recognition built on top of OpenNLP. It employs a maximum entropy POS tagger and draws on training data from the Aljazeera and Altibbi corpora, making it competitive in both formal and domain-specific settings.
All systems were run using their publicly available implementations and default configurations, to reflect realistic research usage. Where necessary, only input normalization (Section 3.5) was applied, without altering the internal behavior of the analyzers.
3.3 Evaluation Metric
Each analyzer was evaluated on each dataset using a unified segmentation alignment script. Because the analyzers differ in tokenization conventions and clitic attachment, the script ensures that each predicted morpheme sequence is aligned and compared directly with the corresponding gold-standard segmentation.
The primary evaluation metric is:
Morpheme-level Segmentation Accuracy: the proportion of correctly predicted morphemes relative to the total number of gold-standard morphemes. Accuracy is computed separately for prefixes, stems, and suffixes, and then aggregated across all tokens. This component-based metric captures how precisely each system reproduces gold morphological boundaries within tokens.
In addition, we report token-level exact-match accuracy, where a token is counted as correct only if its entire morpheme sequence exactly matches the gold segmentation. This complementary metric provides a stricter view of segmentation performance across domains, and the corresponding results are reported in Table 10.
3.4 Segmentation Schemes and Clitic Representation
The three evaluated corpora—NAFIS (Modern Standard Arabic), the Qur’anic Corpus (Classical Arabic), and Noor–Ghateh (Classical Hadith Arabic)—all segment orthographic words into clitic-like morphemes. However, they differ substantially in (i) segmentation depth, (ii) consistency of prefix–stem–suffix structure, and (iii) treatment of multi-clitic stacks.
Table 5 provides a compressed comparison of how major Arabic clitics are represented across the three datasets.
Table 5
Compressed comparison of clitic representation across datasets.
| CLITIC | TYPE | FUNCTION | NAFIS | QUR’ANIC CORPUS | NOOR–GHATEH |
|---|---|---|---|---|---|
| و | Conjunction | “and” | Proclitic: و+قال | Split or attached morphologically | Always separated: و+يفعلون |
| ف | Conj./resultative | “then/so” | ف+قال | Attached in token; split in morphology | Explicit split: ف+إنهم |
| ب/ك/ل | Prepositional proclitics | “with/by”, “as”, “for” | ب+ال+بيت | Separated morphologically | ب+كلمة |
| ال | Definite article | Definiteness marking | Separated after preps | Morphological morpheme | Always explicit |
| Pronominal suffixes | Enclitics | Possession/object | كتاب+هم | Morphologically split | قول+كم |
| Multi-clitic stacks | Agglutinative forms | Complex tokens | Rare | Frequent in classical text | Dense stacks in legal style |
These systematic differences necessitated a unified normalization and alignment procedure, described in the following subsection.
3.5 Normalization and Alignment Pipeline
Because the three corpora differ in tokenization depth, clitic treatment, and orthographic conventions, we implemented a deterministic normalization and alignment pipeline to ensure comparability across domains and analyzers.
Orthographic Normalization (Gold and Input). All corpora were first normalized using a shared set of rules, including:
Mapping of all alif variants {أ، إ، آ} to bare alif (ا);
Removal of Arabic diacritics, including shadda and tanwīn;
Removal of tatwīl () and other non-semantic marks;
Unicode normalization and cleanup of non-Arabic control characters;
Standardization of punctuation spacing and whitespace collapsing.
Representative normalization examples are shown in Table 6.
Table 6
Representative orthographic normalization examples.
| ORIGINAL | NORMALIZED | NOTE |
|---|---|---|
| إيمانهم | ايمانهم | Hamzated alif unified |
| مسئولية | مسؤولية | Hamza normalization |
| هديٰهم | هداهم | Dagger alif removed |
Segmentation Normalization (Gold). Gold-standard segmentations were mapped into a unified prefix–stem–suffix representation:
Dataset-specific conventions were harmonized by explicitly separating all proclitics and enclitics. Examples of clitic-boundary harmonization are provided in Table 7.
Table 7
Representative clitic-boundary harmonization.
| TOKEN | ALIGNED | NOTE |
|---|---|---|
| وبالكتاب | و+ب+ال+كتاب | Standardized proclitics |
| فزادهم | ف+زاد+هم | Qur’anic فsplit |
| تجارتهم | تجارت+هم | Unified enclitic rule |
Canonical Token Mapping. After boundary normalization, tokens were mapped to a canonical segmentation form. Table 8 shows one representative example per domain.
Table 8
Canonical segmentation examples across domains.
| SURFACE | CANONICAL | DOMAIN |
|---|---|---|
| ليستغفروا | ل+يستغفر+وا | Qur’anic |
| فانهم | ف+ان+هم | Noor–Ghateh |
| اوبالحق | او+ب+ال+حق | NAFIS |
Gold–Prediction Alignment. Gold and predicted morphemes were aligned by splitting on the “+” symbol. When systems over- or under-segmented a token, NULL placeholders were inserted to preserve positional correspondence. Representative alignment outcomes are shown in Table 9.
3.6 Cross-Domain Comparative Framework
To summarize domain sensitivity, we measure the relative performance difference between Modern Standard Arabic (MSA, represented by NAFIS) and the two Classical domains. The decline in accuracy for a given domain is defined as:
where ACCMSA denotes token-level exact-match accuracy on NAFIS, and ACCdomain is the corresponding score on either Noor–Ghateh or the Qur’anic corpus. Negative values indicate improved performance on Classical text relative to MSA.
In addition to this scalar summary, we qualitatively analyze common error types (e.g., clitic boundary errors, suffix attachment errors, and misanalysis of rare Classical forms) in Section 4, where we present detailed error categories and their distribution across domains.
3.7 Reproducibility Materials
To ensure complete transparency and reproducibility, all resources required to replicate the benchmark will be released upon publication:
Segmentation alignment and normalization scripts;
Configuration files and command-line parameters for Farasa (Abdelali et al., 2016), CAMeL Tools, and ALP;
Token-level evaluation subsets extracted from each corpus, after normalization;
Example analyzer outputs for each domain, aligned with gold segmentations.
These materials will be deposited in a public repository (Zenodo/GitHub) with a persistent DOI, enabling independent verification and facilitating downstream reuse of the evaluation framework.
4 Results and Discussion
This section reports the quantitative and qualitative findings of the tri-domain evaluation, focusing on how morphological analyzers respond to variation across Modern Standard Arabic (MSA), Classical Hadith Arabic, and Scriptural Arabic. Results are presented at both the token and morpheme levels to provide a comprehensive view of system behavior.
4.1 Token-Level Segmentation Accuracy
Table 10 presents exact-match token-level accuracy. All three analyzers perform substantially better on the Qur’anic and Hadith datasets than on NAFIS. This pattern reflects the comparatively stable morphological structures of classical texts, where clitic attachment is more predictable and lexical borrowings are minimal. In contrast, MSA contains greater lexical diversity, modern derivations, and non-canonical clitic structures, which introduces ambiguity for segmentation systems.
Table 10
Token-level segmentation accuracy (exact match).
| DATASET | FARASA | CAMEL | ALP |
|---|---|---|---|
| NAFIS (MSA) | 0.59 | 0.68 | 0.65 |
| Qur’anic Corpus | 0.76 | 0.80 | 0.81 |
| Noor–Ghateh (Hadith) | 0.81 | 0.81 | 0.79 |
Token-level accuracy is intentionally reported alongside the morpheme-level results that follow. Token-level scores highlight holistic segmentation success, while morpheme-level accuracy (Sections 13–15) captures fine-grained clitic and stem boundary behavior.
4.2 Domain Sensitivity
Domain sensitivity is measured as the difference between MSA performance and performance on each classical dataset. Negative values correspond to improved accuracy in classical Arabic. The measured domain sensitivity values and relative performance improvements for each analyzer are reported in Table 11.
Table 11
Domain sensitivity (Δdomain) and relative improvements.
| ANALYZER | ΔQuran | ΔHadith | IMPROVEMENT (%) |
|---|---|---|---|
| Farasa | –0.17 | –0.22 | 37.3 |
| CAMeL | –0.12 | –0.13 | 19.1 |
| ALP | –0.16 | –0.14 | 21.5 |
All systems improve markedly on classical data. These gains should be interpreted cautiously, as classical Arabic is morphologically regular but lexically distinct; predictable templatic patterns can ease segmentation even when the lexicon differs from modern usage.
4.3 Component-Level Results
Table 12 reports prefix, stem, and suffix accuracy. Prefix segmentation is consistently strong due to the limited inventory of Arabic proclitics. Suffix segmentation is more difficult, reflecting the richness of Arabic pronominal and inflectional endings. Stem accuracy shows the greatest domain-driven variation, especially for Farasa on NAFIS, where non-canonical MSA forms and loanwords are more common.
Table 12
Per-component segmentation accuracy across domains.
| COMPONENT/DATASET | FARASA | CAMEL | ALP |
|---|---|---|---|
| Prefix | |||
| NAFIS | 0.689 | 0.892 | 0.865 |
| Qur’anic | 0.894 | 0.939 | 0.939 |
| Noor–Ghateh | 0.776 | 0.835 | 0.812 |
| Stem | |||
| NAFIS | 0.189 | 0.730 | 0.811 |
| Qur’anic | 0.879 | 0.970 | 0.936 |
| Noor–Ghateh | 0.552 | 0.601 | 0.583 |
| Suffix | |||
| NAFIS | 0.270 | 0.838 | 0.973 |
| Qur’anic | 0.813 | 0.939 | 0.939 |
| Noor–Ghateh | 0.628 | 0.742 | 0.701 |
Stem errors in classical datasets frequently involve templatic forms such as Form VIII and Form X verbs, hollow verbs, or quadriliteral roots. These patterns appear frequently in the Qur’anic and Hadith corpora and can diverge from analyzers’ MSA-trained lexicons.
4.4 Uncertainty and Statistical Significance
Tables 13, 14, 15 show that bootstrap confidence intervals are narrow for large datasets (e.g., Qur’anic), and wider for small datasets (e.g., NAFIS), as expected. All statistical comparisons use paired Wilcoxon Signed-Rank Tests applied only to tokens where both gold and predicted segmentations were aligned.
Table 13
Morpheme-level segmentation accuracy and 95% confidence intervals on the Noor–Ghateh dataset.
| SYSTEM | ACCURACY | 95% CI | STD. DEV. | N TOKENS |
|---|---|---|---|---|
| Farasa | 0.563 | [0.508, 0.617] | 0.028 | 311 |
| CAMeL | 0.634 | [0.579, 0.688] | 0.027 | 311 |
| ALP | 0.374 | [0.321, 0.432] | 0.028 | 308 |
Table 14
Morpheme-level segmentation accuracy and 95% confidence intervals on the Qur’anic subset.
| SYSTEM | ACCURACY | 95% CI | STD. DEV. | N TOKENS |
|---|---|---|---|---|
| Farasa | 0.785 | [0.776, 0.795] | 0.005 | 6,839 |
| CAMeL | 0.826 | [0.817, 0.835] | 0.005 | 6,837 |
| ALP | 0.840 | [0.832, 0.849] | 0.004 | 6,828 |
Table 15
Morpheme-level segmentation accuracy and 95% confidence intervals on the NAFIS (MSA) dataset.
| SYSTEM | ACCURACY | 95% CI | STD. DEV. | N TOKENS |
|---|---|---|---|---|
| Farasa | 0.667 | [0.585, 0.741] | 0.040 | 135 |
| CAMeL | 0.793 | [0.726, 0.859] | 0.035 | 135 |
| ALP | 0.830 | [0.763, 0.889] | 0.032 | 135 |
ALP significantly outperforms CAMeL and Farasa on the Qur’anic corpus (p < 0.01), and CAMeL significantly outperforms Farasa across all domains. These results confirm that observed differences reflect genuine model behavior rather than sampling noise.
4.5 Clitic Density
Accuracy decreases as the number of attached clitics increases. Farasa shows the strongest decline, consistent with its optimization for ATB-style segmentation. CAMeL and ALP remain relatively stable across clitic counts, suggesting better modeling of multi-morpheme tokens.
4.6 Error Analysis
Common error classes include:
Clitic boundary errors (e.g. -و-/ف-/ب)
Archaic or templatic forms absent from modern lexicons
Named entities, especially foreign-origin names
Multi-clitic sequences, which trigger cascading boundary errors
Misclassification patterns (Table 16) show that the most common error is over-splitting of prefixes or under-identification of suffixes—typical behaviors of systems trained primarily on MSA.
4.7 Reuse Potential
The multi-domain nature of this benchmark allows controlled comparison across linguistically distinct registers. Beyond tool evaluation, the resources support:
Extension to new corpora following the unified segmentation schema.
Comparative studies of templatic morphology across genres.
All reproducibility materials, including code and gold subsets, will publicly released to facilitate such extensions.
4.8 Summary of Findings
Across domains, CAMeL achieves the highest overall accuracy, particularly on the Qur’anic corpus. ALP performs competitively, especially on suffix segmentation, while Farasa shows the greatest variability and is most sensitive to clitic density and non-MSA lexical forms. The Noor–Ghateh dataset reveals fundamental challenges associated with classical morphology and provides a strong benchmark for future research.
5 Implications and Applications
The results of this study offer methodological, technological, and linguistic insights with direct relevance to the development and evaluation of Arabic morphological analysis systems. By combining a tri-domain experimental design with transparent normalization and alignment procedures, this work demonstrates how domain variation shapes segmentation accuracy and highlights the practical importance of evaluating systems beyond a single register of Arabic.
5.1 Methodological Implications
The findings underscore the value of multi-domain benchmarking for morphologically rich languages. Evaluating systems only on Modern Standard Arabic risks masking systematic weaknesses that become salient in classical or scriptural text. The integration of uncertainty quantification through bootstrap confidence intervals and paired Wilcoxon tests further illustrates how statistical rigor can clarify the stability of performance estimates, especially when dataset sizes differ substantially.
The alignment and normalization procedures developed for this study—particularly the unified prefix–stem–suffix representation—provide a reusable protocol for future evaluations. As morphological analyzers vary in their tokenization conventions, explicit alignment ensures that comparative studies remain fair and replicable. Making these resources publicly available strengthens reproducibility and provides a methodological template for future domain-sensitivity research.
5.2 Applications in AI and Language Technology
The Noor–Ghateh dataset contributes a linguistically rich and manually validated resource for training and evaluating morphological models. Its fine-grained segmentation and extensive coverage of classical jurisprudential text make it suitable for:
supervised training of neural and hybrid models (e.g., BiLSTM, Transformer, BERT-based systems);
improving rule-based analyzers requiring explicit affix and templatic structure annotation;
diagnostic evaluation of clitic handling, stem recognition, and orthographic normalization.
The error patterns revealed in this study—especially failures on multi-clitic tokens and archaic templatic forms—can guide the design of more morphology-aware architectures. In practical applications such as information extraction, semantic search, and text classification for classical Arabic sources, such improvements are essential for model reliability.
5.3 Linguistic and Educational Applications
Beyond computational settings, Noor–Ghateh offers a structured resource for linguistic inquiry and pedagogy. Its explicit annotation of roots, patterns, clitics, and grammatical functions enables:
corpus-based instruction in Arabic morphology and syntax,
empirical studies of templatic variation in Classical Arabic,
comparative analyses across Quranic, Hadith, and Modern Standard Arabic,
visualization of morphological segmentation for teaching purposes.
The dataset’s coverage of 52 thematic chapters further supports investigations of intra-domain linguistic variation, including how morphological patterns correlate with genre, topic, and stylistic register.
5.4 Future Directions
The methodological framework established here can be extended to additional Arabic domains, including dialectal corpora, historical prose, and contemporary web text. Future work may explore domain-adaptive pretraining and fine-tuning strategies to reduce sensitivity to genre and register. Integrating Noor–Ghateh with established benchmarks such as PATB (Maamouri et al., 2004), SAMA (Maamouri et al., 2010), and NAFIS would help consolidate a unified evaluation suite for Arabic morphological analysis.
In summary, this study connects resource creation with systematic evaluation by introducing a reproducible framework for analyzing domain sensitivity in Arabic morphological tools. The Noor–Ghateh dataset not only enriches the landscape of Classical Arabic resources but also provides a diagnostic lens for building more robust, domain-aware Arabic NLP systems.
Competing Interests
The authors have no competing interests to declare.
Author Contributions
Huda AlShuhayeb: Conceptualization; Methodology; Data curation; Formal analysis; Validation; Visualization; Writing – original draft; Writing – review & editing.
Dr. Behrouz Minaei-Bidgoli: Supervision; Methodology; Conceptualization; Writing – review & editing.
Dr. Sayyed-Ali Hossayni: Supervision; Project administration; Resources; Validation; Writing – review & editing.
