Have a personal or library account? Click to login
Domain Sensitivity in Arabic Morphological Analysis: A Multi-Corpus Evaluation of Farasa, CAMeL, and ALP Across Modern, Classical Religious, and Classical Jurisprudential Domains Cover

Domain Sensitivity in Arabic Morphological Analysis: A Multi-Corpus Evaluation of Farasa, CAMeL, and ALP Across Modern, Classical Religious, and Classical Jurisprudential Domains

Open Access
|Jan 2026

Full Article

1 Context and Motivation

Arabic presents a uniquely challenging setting for computational morphology due to its non-concatenative root–and–pattern system, high clitic density, and historically variable orthography. These properties make accurate segmentation a prerequisite for most downstream NLP tasks, including parsing, IR, and machine translation. Yet the interaction of orthographic variation, sparse annotated resources, and domain-specific morphological patterns often leads to substantial word-boundary ambiguity and system instability.

Current morphological analyzers such as Farasa (Abdelali et al., 2016), CAMeL Tools (Obeid et al., 2020), and ALP (Freihat et al., 2018) perform competitively on standard Modern Standard Arabic (MSA) benchmarks. However, widely used resources—including PATB (Maamouri et al., 2004), SAMA (Maamouri et al., 2010), and NAFIS (Namly et al., 2016)—are drawn primarily from contemporary newswire and expository prose. As a result, they underrepresent the templatic regularity, archaic lexical items, multi-clitic constructions, and stylistic conservatism characteristic of Classical Arabic domains such as Quranic exegesis and jurisprudential literature. This domain mismatch frequently results in systematic degradation when analyzers trained on MSA are applied to historically or religiously significant corpora.

This phenomenon reflects the broader problem of domain sensitivity, whereby models internalize domain-specific statistical regularities that fail to generalize to new textual environments. Extensive research documents this behavior in both classical and neural NLP models (Ramponi & Plank, 2020), with foundational work demonstrating limited cross-domain transferability (Blitzer et al., 2006; Daumé III, 2007; Pan & Yang, 2010). Even large language models benefit noticeably from domain-adaptive pretraining (Gururangan et al., 2020). For Arabic, diglossia, genre diversity, and diachronic linguistic evolution intensify these effects. Studies on dialect detection (Zaidan & Callison-Burch, 2012), social media processing (Darwish, 2014), and genre comparisons (Alharbi & Lee, 2022; Obeid et al., 2020) consistently show substantial performance drops outside MSA-trained domains. Furthermore, Classical and Quranic Arabic preserve morphological and lexical features—such as archaic templates, heavy clitic stacking, and rare stem patterns—not present in modern corpora (Aljumaily, 2022; Dukes & Habash, 2010a).

Despite these findings, prior work has not offered a systematic, multi-domain evaluation spanning Modern, Classical Scriptural, and Classical Jurisprudential Arabic. Such a comparison is methodologically valuable because these domains differ not only in vocabulary and orthography but also in morphological structure: Quranic Arabic tends toward templatic regularity, Hadith and legal prose exhibit dense clitic concatenation, and MSA displays greater lexical diversity and syntactic flexibility. Evaluating analyzers across these distinct registers therefore provides a more complete picture of robustness and error sources.

This study fills this gap by introducing a unified tri-domain evaluation framework that assesses Farasa (Abdelali et al., 2016), CAMeL Tools, and ALP on three representative corpora: NAFIS for MSA, the Qur’anic Corpus for Scriptural Arabic, and Noor–Ghateh for Classical Hadith and jurisprudential text. The framework incorporates a consistent normalization and segmentation-alignment pipeline, and later sections introduce bootstrap confidence intervals and paired non-parametric tests to quantify uncertainty and accommodate the substantial size disparities among datasets.

The contributions of this discussion paper are as follows:

  • A tri-domain evaluation framework that integrates unified normalization, segmentation alignment, and statistical uncertainty estimation.

  • A systematic comparison of three widely used morphological analyzers across Modern, Classical Scriptural, and Classical Jurisprudential Arabic.

  • Empirical identification of domain-specific weaknesses in clitic boundary resolution, stem analysis, and handling of archaic morphological patterns.

  • Methodological insights motivating the development of domain-balanced resources and standardized benchmarking protocols for Arabic NLP.

This work is presented as a JOHD Discussion Paper and is designed to complement a separate Noor–Ghateh data paper, which contains full documentation of the dataset’s XML schema and annotation process. Section 2 outlines the dataset characteristics, Section 3 describes the evaluation methodology, Section 4 presents the findings, and Section 5 discusses broader implications for domain-aware Arabic NLP.

2 Dataset Description

This section provides the complete metadata specification for the primary contribution of this paper: the Noor-Ghateh dataset, a novel benchmark for Arabic morphological segmentation in the Classical Hadith domain. The description follows the recommended schema for data papers. Details regarding the comparison datasets (NAFIS (Namly et al., 2016) and Qur’anic Corpus v0.4 (Dukes & Habash, 2010a)) used in our tri-domain evaluation framework are provided in Section 4 (Method).

Repository Location: The dataset is openly available on Zenodo: https://zenodo.org/records/18138582

Repository Name: Zenodo

Object Name: NoorGhateh_v3.csv

Format Names and Versions: The dataset is distributed as a UTF-8 encoded CSV file converted from the original XML annotation source. Each row corresponds to a single annotated morpheme and preserves the full set of XML attributes. The data are right-to-left encoded and comma-delimited, making them directly compatible with standard NLP preprocessing tools.

Creation Dates: 2023-07-01 to 2024-05-01

Dataset Creators:

  • Computer Research Center for Islamic Sciences (Noor) – data preparation and annotation

  • Behrouz Minaei-Bidgoli – academic supervision

  • Sayyed-Ali Hossayni – institutional supervision and validation

  • Huda AlShuhayeb – dataset validation, analysis, and paper preparation

Language: Arabic (Classical Hadith and Jurisprudential style)

License: Creative Commons Attribution 4.0 International (CC-BY 4.0)

Publication Date: 2025-10-17

Dataset Composition and Annotation Schema. The Noor-Ghateh dataset comprises 223,690 manually annotated words extracted from the classical jurisprudential text Shariat al-Islam, organized across 52 thematic chapters. This represents, to our knowledge, the largest morphologically annotated resource specifically dedicated to the Hadith domain. A curated subset of 313 tokens is publicly released to facilitate immediate benchmarking and tool evaluation.

The statistical composition of the corpus is summarized in Table 1. At a finer granularity, the distribution of prefix and suffix morphemes, along with average morpheme density per word, is reported in Table 2.

Table 1

Overall statistics of the Noor-Ghateh dataset.

PROPERTYCOUNT
Sentences10,160
Word tokens (annotated)223,690
Unique lemmas16,420
Unique roots8,150
Table 2

Morpheme-level composition in the Noor-Ghateh dataset.

MORPHEME TYPEOCCURRENCES
Prefix morphemes74,242
Suffix morphemes18,617
Average morphemes per word1.42
Estimated clitic density (per 100 tokens)33.1

2.1 Corpus Format

The Noor–Ghateh corpus is authored natively in XML and distributed in UTF-8. Each <word> element encodes a single segment (prefix, stem, or suffix) with a consistent set of attributes capturing orthographic, grammatical, and morphological information. A representative example of the XML-based annotation structure is shown in Figure 1.

johd-12-418-g1.png
Figure 1

Sample XML annotation from the Noor–Ghateh dataset.

Attributes used in the XML are summarized in Table 3. The segmentation follows a linear convention: words are decomposed into sequences of <word> elements corresponding to clitics and stems in order (prefix* → stem → suffix*). For example, the Arabic word “الفجر” is segmented into the prefix “ال” and the stem “فجر”.

Table 3

Annotation schema in the Noor-Ghateh dataset.

FIELDDESCRIPTION
SeqSequential index of the word or morpheme within its sentence
SliceSurface form of the token as it appears in the original text
EntryCanonical or normalized form of the morpheme
AffixMorphological category (prefix, stem, or suffix)
POSPart-of-speech tag (e.g., noun, verb, particle)
CaseGrammatical case or syntactic role
KolFunctional classification (e.g., conjunction, preposition)
LemmaBase lemma associated with the stem
CategMorphological pattern type (e.g., triliteral root pattern)
DervTDerivational type (e.g., adjective, participle)
NumGrammatical number (singular, dual, plural)
RootUnderlying triliteral or quadriliteral Arabic root

CSV export. For ease of benchmarking, the XML is provided as a flat CSV (NoorGhateh_v2.csv; UTF-8). Each row corresponds to one XML <word> element and preserves all attributes as columns. Right-to-left text is retained; fields are space-delimited and quoted for safe parsing in Python, R, or other data-analysis environments.

The dataset’s creation involved a custom-developed annotation tool and rigorous validation protocol, with technical implementation details provided in Section 4. This resource fills a critical gap in Arabic NLP by providing a high-quality, domain-specific benchmark for evaluating morphological analysis systems on classical religious texts.

2.2 Segmentation Schemes and Tokenization Conventions

The three datasets used in this study follow distinct segmentation conventions shaped by their linguistic registers and underlying orthographic traditions. To enable a valid cross-domain comparison, we identify the specific morphological boundaries recognized by each segmentation scheme.

NAFIS (MSA-Oriented). NAFIS (Namly et al., 2016) utilizes a standard Modern Standard Arabic (MSA) morphological decomposition. It follows a prefix–stem–suffix structure, where common proclitics (e.g., w+, f+, b+, l+, al+) and enclitic pronouns (e.g., +h, +hm, +na) are explicitly separated from the stem.

Quranic Corpus (Dukes & Habash, 2010a) (Classical Scriptural). Reflecting its historical orthography, the Quranic segmentation scheme is comparatively conservative. While it prioritizes morphological patterns specific to Classical Arabic, clitic separation is less aggressive than in MSA-oriented schemes. Some tokens preserve complex internal morphological structures that are specific to the Quranic register and differ from contemporary orthographic norms.

Noor–Ghateh (Classical Jurisprudence). Noor–Ghateh applies a templatic segmentation approach that emphasizes full clitic decomposition. Given the agglutinative nature of legal and Hadith texts, this scheme is designed to handle stacked morphemes, where multiple prefixes or suffixes are attached to a single root–pattern structure.

Relation to the NoorGhateh Data Paper

This discussion paper is complemented by a separate data paper dedicated to the Noor–Ghateh dataset, currently submitted to JOHD. That companion paper provides the full technical documentation of the annotation schema, including the XML-based <word> structure, field definitions, segmentation rules, and examples. To avoid redundancy, the present discussion paper provides only a high-level description of the dataset and focuses instead on the comparative evaluation of morphological analyzers across domains. Readers interested in the complete schema, attribute definitions, and annotation methodology are referred to the Noor–Ghateh data paper.

3 Method

This study investigates the cross-domain robustness of Arabic morphological analyzers by evaluating three state-of-the-art systems—Farasa, CAMeL, and ALP—across three distinct textual domains: Modern Standard Arabic (MSA), Classical Hadith Arabic, and Quranic Arabic. The methodology combines a controlled tri-domain benchmark, a harmonized annotation scheme, and consistent evaluation and statistical procedures to enable a fair and reproducible comparison.

3.1 Experimental Design: The Tri-Domain Benchmark

To examine the effect of domain variation on Arabic morphological analysis, we employ a tri-domain evaluation framework. The benchmark consists of three corpora, each representing a distinct register of written Arabic: Modern Standard Arabic, Classical Scriptural Arabic, and Classical Jurisprudential Arabic. An overview of the datasets is provided in Table 4.

Table 4

Dataset Overview and Linguistic Register.

DATASETREGISTERKEY CHARACTERISTICS
NAFISModern Standard ArabicHigh morphological ambiguity; MSA gold standard
Quranic CorpusClassical Scriptural ArabicConservative segmentation; historical orthography
Noor–GhatehClassical Jurisprudential ArabicDense clitic stacking; templatic morphology

NAFIS (Namly et al., 2016). NAFIS is a gold-standard corpus developed for the evaluation of Arabic stemmers and segmentation systems. In this study, we use a released subset consisting of 172 manually annotated word forms. The selected tokens exhibit high levels of morphological ambiguity and clitic attachment, providing a compact test set for evaluating segmentation behavior in Modern Standard Arabic.

Quranic Corpus (Dukes & Habash, 2010b). The Quranic Corpus is a linguistically annotated resource representing Classical Scriptural Arabic. Our evaluation uses a subset covering the chapters Al-Fātiḥa and Al-Baqarah. This subset has been widely used in prior work and reflects conservative segmentation practices grounded in historical orthography.

Noor–Ghateh. Noor–Ghateh is derived from the jurisprudential text Sharāyeʿ al-Islām and represents Classical Hadith and legal Arabic. The corpus is characterized by extensive clitic agglutination and templatic morphological structures. Detailed documentation of the annotation scheme, XML schema, and attribute inventory is provided in a companion JOHD data paper. The present study limits its description to Noor–Ghateh’s role within the cross-domain evaluation.

Data Harmonization. All corpora were converted into a unified representation containing two fields: Token, corresponding to the surface form, and Segmentation, representing the gold-standard morpheme sequence. This harmonization enables consistent evaluation across domains while preserving the original segmentation conventions of each dataset.

Relation to the Companion Data Paper. The Noor–Ghateh XML schema, attribute inventory, and detailed annotation methodology are fully documented in a separate JOHD data paper. The present discussion paper therefore restricts itself to a high-level description of Noor–Ghateh and focuses on the cross-domain evaluation framework and findings.

3.2 Systems Under Evaluation

Three widely used Arabic morphological analyzers were selected based on accessibility, popularity, and linguistic coverage:

  • Farasa. Farasa (Abdelali et al., 2016) is a high-speed segmenter based on a rank-SVM model trained on the Penn Arabic Treebank. It is optimized for MSA newswire and has been deployed in information retrieval and machine translation pipelines.

  • CAMeL Tools. CAMeL Tools (Obeid et al., 2020) is an open-source Python toolkit that integrates tokenization, morphological analysis, disambiguation, POS tagging, sentiment analysis, and named entity recognition. In this study, we use its deep morphological analyzer with the CALIMA-MSA database and default clitic-segmentation settings.

  • ALP. ALP (Freihat et al., 2018) is a single-model pipeline for segmentation, POS tagging, and named-entity recognition built on top of OpenNLP. It employs a maximum entropy POS tagger and draws on training data from the Aljazeera and Altibbi corpora, making it competitive in both formal and domain-specific settings.

All systems were run using their publicly available implementations and default configurations, to reflect realistic research usage. Where necessary, only input normalization (Section 3.5) was applied, without altering the internal behavior of the analyzers.

3.3 Evaluation Metric

Each analyzer was evaluated on each dataset using a unified segmentation alignment script. Because the analyzers differ in tokenization conventions and clitic attachment, the script ensures that each predicted morpheme sequence is aligned and compared directly with the corresponding gold-standard segmentation.

The primary evaluation metric is:

  • Morpheme-level Segmentation Accuracy: the proportion of correctly predicted morphemes relative to the total number of gold-standard morphemes. Accuracy is computed separately for prefixes, stems, and suffixes, and then aggregated across all tokens. This component-based metric captures how precisely each system reproduces gold morphological boundaries within tokens.

In addition, we report token-level exact-match accuracy, where a token is counted as correct only if its entire morpheme sequence exactly matches the gold segmentation. This complementary metric provides a stricter view of segmentation performance across domains, and the corresponding results are reported in Table 10.

3.4 Segmentation Schemes and Clitic Representation

The three evaluated corpora—NAFIS (Modern Standard Arabic), the Qur’anic Corpus (Classical Arabic), and Noor–Ghateh (Classical Hadith Arabic)—all segment orthographic words into clitic-like morphemes. However, they differ substantially in (i) segmentation depth, (ii) consistency of prefix–stem–suffix structure, and (iii) treatment of multi-clitic stacks.

Table 5 provides a compressed comparison of how major Arabic clitics are represented across the three datasets.

Table 5

Compressed comparison of clitic representation across datasets.

CLITICTYPEFUNCTIONNAFISQUR’ANIC CORPUSNOOR–GHATEH
وConjunction“and”Proclitic: و+قالSplit or attached morphologicallyAlways separated: و+يفعلون
فConj./resultative“then/so”ف+قالAttached in token; split in morphologyExplicit split: ف+إنهم
ب/ك/لPrepositional proclitics“with/by”, “as”, “for”ب+ال+بيتSeparated morphologicallyب+كلمة
الDefinite articleDefiniteness markingSeparated after prepsMorphological morphemeAlways explicit
Pronominal suffixesEncliticsPossession/objectكتاب+همMorphologically splitقول+كم
Multi-clitic stacksAgglutinative formsComplex tokensRareFrequent in classical textDense stacks in legal style

These systematic differences necessitated a unified normalization and alignment procedure, described in the following subsection.

3.5 Normalization and Alignment Pipeline

Because the three corpora differ in tokenization depth, clitic treatment, and orthographic conventions, we implemented a deterministic normalization and alignment pipeline to ensure comparability across domains and analyzers.

Orthographic Normalization (Gold and Input). All corpora were first normalized using a shared set of rules, including:

  1. Mapping of all alif variants {أ، إ، آ} to bare alif (ا);

  2. Removal of Arabic diacritics, including shadda and tanwīn;

  3. Removal of tatwīl () and other non-semantic marks;

  4. Unicode normalization and cleanup of non-Arabic control characters;

  5. Standardization of punctuation spacing and whitespace collapsing.

Representative normalization examples are shown in Table 6.

Table 6

Representative orthographic normalization examples.

ORIGINALNORMALIZEDNOTE
إيمانهمايمانهمHamzated alif unified
مسئوليةمسؤوليةHamza normalization
هديٰهمهداهمDagger alif removed

Segmentation Normalization (Gold). Gold-standard segmentations were mapped into a unified prefix–stem–suffix representation:

PREFIX1..n+STEM+SUFFIX1..m.

Dataset-specific conventions were harmonized by explicitly separating all proclitics and enclitics. Examples of clitic-boundary harmonization are provided in Table 7.

Table 7

Representative clitic-boundary harmonization.

TOKENALIGNEDNOTE
وبالكتابو+ب+ال+كتابStandardized proclitics
فزادهمف+زاد+همQur’anic فsplit
تجارتهمتجارت+همUnified enclitic rule

Canonical Token Mapping. After boundary normalization, tokens were mapped to a canonical segmentation form. Table 8 shows one representative example per domain.

Table 8

Canonical segmentation examples across domains.

SURFACECANONICALDOMAIN
ليستغفروال+يستغفر+واQur’anic
فانهمف+ان+همNoor–Ghateh
اوبالحقاو+ب+ال+حقNAFIS

Gold–Prediction Alignment. Gold and predicted morphemes were aligned by splitting on the “+” symbol. When systems over- or under-segmented a token, NULL placeholders were inserted to preserve positional correspondence. Representative alignment outcomes are shown in Table 9.

Table 9

Representative gold–prediction alignment outcomes.

TOKENGOLDPREDICTIONOUTCOME
فاستغفرواف+استغفر+واCAMeL correctPerfect match
وكتابهمو+كتاب+همFarasa:و+كتابهمUnder-segmentation (1 error)

3.6 Cross-Domain Comparative Framework

To summarize domain sensitivity, we measure the relative performance difference between Modern Standard Arabic (MSA, represented by NAFIS) and the two Classical domains. The decline in accuracy for a given domain is defined as:

1
Δdomain=ACCMSAACCdomain,

where ACCMSA denotes token-level exact-match accuracy on NAFIS, and ACCdomain is the corresponding score on either Noor–Ghateh or the Qur’anic corpus. Negative values indicate improved performance on Classical text relative to MSA.

In addition to this scalar summary, we qualitatively analyze common error types (e.g., clitic boundary errors, suffix attachment errors, and misanalysis of rare Classical forms) in Section 4, where we present detailed error categories and their distribution across domains.

3.7 Reproducibility Materials

To ensure complete transparency and reproducibility, all resources required to replicate the benchmark will be released upon publication:

  • Segmentation alignment and normalization scripts;

  • Configuration files and command-line parameters for Farasa (Abdelali et al., 2016), CAMeL Tools, and ALP;

  • Token-level evaluation subsets extracted from each corpus, after normalization;

  • Example analyzer outputs for each domain, aligned with gold segmentations.

These materials will be deposited in a public repository (Zenodo/GitHub) with a persistent DOI, enabling independent verification and facilitating downstream reuse of the evaluation framework.

4 Results and Discussion

This section reports the quantitative and qualitative findings of the tri-domain evaluation, focusing on how morphological analyzers respond to variation across Modern Standard Arabic (MSA), Classical Hadith Arabic, and Scriptural Arabic. Results are presented at both the token and morpheme levels to provide a comprehensive view of system behavior.

4.1 Token-Level Segmentation Accuracy

Table 10 presents exact-match token-level accuracy. All three analyzers perform substantially better on the Qur’anic and Hadith datasets than on NAFIS. This pattern reflects the comparatively stable morphological structures of classical texts, where clitic attachment is more predictable and lexical borrowings are minimal. In contrast, MSA contains greater lexical diversity, modern derivations, and non-canonical clitic structures, which introduces ambiguity for segmentation systems.

Table 10

Token-level segmentation accuracy (exact match).

DATASETFARASACAMELALP
NAFIS (MSA)0.590.680.65
Qur’anic Corpus0.760.800.81
Noor–Ghateh (Hadith)0.810.810.79

Token-level accuracy is intentionally reported alongside the morpheme-level results that follow. Token-level scores highlight holistic segmentation success, while morpheme-level accuracy (Sections 13–15) captures fine-grained clitic and stem boundary behavior.

4.2 Domain Sensitivity

Domain sensitivity is measured as the difference between MSA performance and performance on each classical dataset. Negative values correspond to improved accuracy in classical Arabic. The measured domain sensitivity values and relative performance improvements for each analyzer are reported in Table 11.

Table 11

Domain sensitivity (Δdomain) and relative improvements.

ANALYZERΔQuranΔHadithIMPROVEMENT (%)
Farasa–0.17–0.2237.3
CAMeL–0.12–0.1319.1
ALP–0.16–0.1421.5

All systems improve markedly on classical data. These gains should be interpreted cautiously, as classical Arabic is morphologically regular but lexically distinct; predictable templatic patterns can ease segmentation even when the lexicon differs from modern usage.

4.3 Component-Level Results

Table 12 reports prefix, stem, and suffix accuracy. Prefix segmentation is consistently strong due to the limited inventory of Arabic proclitics. Suffix segmentation is more difficult, reflecting the richness of Arabic pronominal and inflectional endings. Stem accuracy shows the greatest domain-driven variation, especially for Farasa on NAFIS, where non-canonical MSA forms and loanwords are more common.

Table 12

Per-component segmentation accuracy across domains.

COMPONENT/DATASETFARASACAMELALP
Prefix
NAFIS0.6890.8920.865
Qur’anic0.8940.9390.939
Noor–Ghateh0.7760.8350.812
Stem
NAFIS0.1890.7300.811
Qur’anic0.8790.9700.936
Noor–Ghateh0.5520.6010.583
Suffix
NAFIS0.2700.8380.973
Qur’anic0.8130.9390.939
Noor–Ghateh0.6280.7420.701

Stem errors in classical datasets frequently involve templatic forms such as Form VIII and Form X verbs, hollow verbs, or quadriliteral roots. These patterns appear frequently in the Qur’anic and Hadith corpora and can diverge from analyzers’ MSA-trained lexicons.

4.4 Uncertainty and Statistical Significance

Tables 13, 14, 15 show that bootstrap confidence intervals are narrow for large datasets (e.g., Qur’anic), and wider for small datasets (e.g., NAFIS), as expected. All statistical comparisons use paired Wilcoxon Signed-Rank Tests applied only to tokens where both gold and predicted segmentations were aligned.

Table 13

Morpheme-level segmentation accuracy and 95% confidence intervals on the Noor–Ghateh dataset.

SYSTEMACCURACY95% CISTD. DEV.N TOKENS
Farasa0.563[0.508, 0.617]0.028311
CAMeL0.634[0.579, 0.688]0.027311
ALP0.374[0.321, 0.432]0.028308
Table 14

Morpheme-level segmentation accuracy and 95% confidence intervals on the Qur’anic subset.

SYSTEMACCURACY95% CISTD. DEV.N TOKENS
Farasa0.785[0.776, 0.795]0.0056,839
CAMeL0.826[0.817, 0.835]0.0056,837
ALP0.840[0.832, 0.849]0.0046,828
Table 15

Morpheme-level segmentation accuracy and 95% confidence intervals on the NAFIS (MSA) dataset.

SYSTEMACCURACY95% CISTD. DEV.N TOKENS
Farasa0.667[0.585, 0.741]0.040135
CAMeL0.793[0.726, 0.859]0.035135
ALP0.830[0.763, 0.889]0.032135

ALP significantly outperforms CAMeL and Farasa on the Qur’anic corpus (p < 0.01), and CAMeL significantly outperforms Farasa across all domains. These results confirm that observed differences reflect genuine model behavior rather than sampling noise.

4.5 Clitic Density

Accuracy decreases as the number of attached clitics increases. Farasa shows the strongest decline, consistent with its optimization for ATB-style segmentation. CAMeL and ALP remain relatively stable across clitic counts, suggesting better modeling of multi-morpheme tokens.

4.6 Error Analysis

Common error classes include:

  • Clitic boundary errors (e.g. -و-/ف-/ب)

  • Archaic or templatic forms absent from modern lexicons

  • Named entities, especially foreign-origin names

  • Multi-clitic sequences, which trigger cascading boundary errors

Misclassification patterns (Table 16) show that the most common error is over-splitting of prefixes or under-identification of suffixes—typical behaviors of systems trained primarily on MSA.

Table 16

Component-level misclassification patterns on the Noor–Ghateh dataset, reported as proportions of tokens.

SYSTEMPREFIX→STEMSTEM→PREFIXSUFFIX→STEMOTHER
Farasa0.140.020.180.09
CAMeL0.090.010.110.06
ALP0.110.010.130.07

4.7 Reuse Potential

The multi-domain nature of this benchmark allows controlled comparison across linguistically distinct registers. Beyond tool evaluation, the resources support:

  • Extension to new corpora following the unified segmentation schema.

  • Comparative studies of templatic morphology across genres.

All reproducibility materials, including code and gold subsets, will publicly released to facilitate such extensions.

4.8 Summary of Findings

Across domains, CAMeL achieves the highest overall accuracy, particularly on the Qur’anic corpus. ALP performs competitively, especially on suffix segmentation, while Farasa shows the greatest variability and is most sensitive to clitic density and non-MSA lexical forms. The Noor–Ghateh dataset reveals fundamental challenges associated with classical morphology and provides a strong benchmark for future research.

5 Implications and Applications

The results of this study offer methodological, technological, and linguistic insights with direct relevance to the development and evaluation of Arabic morphological analysis systems. By combining a tri-domain experimental design with transparent normalization and alignment procedures, this work demonstrates how domain variation shapes segmentation accuracy and highlights the practical importance of evaluating systems beyond a single register of Arabic.

5.1 Methodological Implications

The findings underscore the value of multi-domain benchmarking for morphologically rich languages. Evaluating systems only on Modern Standard Arabic risks masking systematic weaknesses that become salient in classical or scriptural text. The integration of uncertainty quantification through bootstrap confidence intervals and paired Wilcoxon tests further illustrates how statistical rigor can clarify the stability of performance estimates, especially when dataset sizes differ substantially.

The alignment and normalization procedures developed for this study—particularly the unified prefix–stem–suffix representation—provide a reusable protocol for future evaluations. As morphological analyzers vary in their tokenization conventions, explicit alignment ensures that comparative studies remain fair and replicable. Making these resources publicly available strengthens reproducibility and provides a methodological template for future domain-sensitivity research.

5.2 Applications in AI and Language Technology

The Noor–Ghateh dataset contributes a linguistically rich and manually validated resource for training and evaluating morphological models. Its fine-grained segmentation and extensive coverage of classical jurisprudential text make it suitable for:

  • supervised training of neural and hybrid models (e.g., BiLSTM, Transformer, BERT-based systems);

  • improving rule-based analyzers requiring explicit affix and templatic structure annotation;

  • diagnostic evaluation of clitic handling, stem recognition, and orthographic normalization.

The error patterns revealed in this study—especially failures on multi-clitic tokens and archaic templatic forms—can guide the design of more morphology-aware architectures. In practical applications such as information extraction, semantic search, and text classification for classical Arabic sources, such improvements are essential for model reliability.

5.3 Linguistic and Educational Applications

Beyond computational settings, Noor–Ghateh offers a structured resource for linguistic inquiry and pedagogy. Its explicit annotation of roots, patterns, clitics, and grammatical functions enables:

  • corpus-based instruction in Arabic morphology and syntax,

  • empirical studies of templatic variation in Classical Arabic,

  • comparative analyses across Quranic, Hadith, and Modern Standard Arabic,

  • visualization of morphological segmentation for teaching purposes.

The dataset’s coverage of 52 thematic chapters further supports investigations of intra-domain linguistic variation, including how morphological patterns correlate with genre, topic, and stylistic register.

5.4 Future Directions

The methodological framework established here can be extended to additional Arabic domains, including dialectal corpora, historical prose, and contemporary web text. Future work may explore domain-adaptive pretraining and fine-tuning strategies to reduce sensitivity to genre and register. Integrating Noor–Ghateh with established benchmarks such as PATB (Maamouri et al., 2004), SAMA (Maamouri et al., 2010), and NAFIS would help consolidate a unified evaluation suite for Arabic morphological analysis.

In summary, this study connects resource creation with systematic evaluation by introducing a reproducible framework for analyzing domain sensitivity in Arabic morphological tools. The Noor–Ghateh dataset not only enriches the landscape of Classical Arabic resources but also provides a diagnostic lens for building more robust, domain-aware Arabic NLP systems.

Competing Interests

The authors have no competing interests to declare.

Author Contributions

Huda AlShuhayeb: Conceptualization; Methodology; Data curation; Formal analysis; Validation; Visualization; Writing – original draft; Writing – review & editing.

Dr. Behrouz Minaei-Bidgoli: Supervision; Methodology; Conceptualization; Writing – review & editing.

Dr. Sayyed-Ali Hossayni: Supervision; Project administration; Resources; Validation; Writing – review & editing.

DOI: https://doi.org/10.5334/johd.418 | Journal eISSN: 2059-481X
Language: English
Submitted on: Oct 18, 2025
|
Accepted on: Jan 5, 2026
|
Published on: Jan 30, 2026
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2026 Behrouz Minaei-Bidgoli, Huda AlShuhayeb, Sayyed-Ali Hossayni, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.