Domain Sensitivity in Arabic Morphological Analysis: A Multi-Corpus Evaluation of Farasa, CAMeL, and ALP Across Modern, Classical Religious, and Classical Jurisprudential Domains

Behrouz Minaei-Bidgoli; Huda AlShuhayeb; Sayyed-Ali Hossayni

doi:10.5334/johd.418

Full Article

1 Context and Motivation

Arabic presents a uniquely challenging setting for computational morphology due to its non-concatenative root–and–pattern system, high clitic density, and historically variable orthography. These properties make accurate segmentation a prerequisite for most downstream NLP tasks, including parsing, IR, and machine translation. Yet the interaction of orthographic variation, sparse annotated resources, and domain-specific morphological patterns often leads to substantial word-boundary ambiguity and system instability.

Current morphological analyzers such as Farasa (Abdelali et al., 2016), CAMeL Tools (Obeid et al., 2020), and ALP (Freihat et al., 2018) perform competitively on standard Modern Standard Arabic (MSA) benchmarks. However, widely used resources—including PATB (Maamouri et al., 2004), SAMA (Maamouri et al., 2010), and NAFIS (Namly et al., 2016)—are drawn primarily from contemporary newswire and expository prose. As a result, they underrepresent the templatic regularity, archaic lexical items, multi-clitic constructions, and stylistic conservatism characteristic of Classical Arabic domains such as Quranic exegesis and jurisprudential literature. This domain mismatch frequently results in systematic degradation when analyzers trained on MSA are applied to historically or religiously significant corpora.

This phenomenon reflects the broader problem of domain sensitivity, whereby models internalize domain-specific statistical regularities that fail to generalize to new textual environments. Extensive research documents this behavior in both classical and neural NLP models (Ramponi & Plank, 2020), with foundational work demonstrating limited cross-domain transferability (Blitzer et al., 2006; Daumé III, 2007; Pan & Yang, 2010). Even large language models benefit noticeably from domain-adaptive pretraining (Gururangan et al., 2020). For Arabic, diglossia, genre diversity, and diachronic linguistic evolution intensify these effects. Studies on dialect detection (Zaidan & Callison-Burch, 2012), social media processing (Darwish, 2014), and genre comparisons (Alharbi & Lee, 2022; Obeid et al., 2020) consistently show substantial performance drops outside MSA-trained domains. Furthermore, Classical and Quranic Arabic preserve morphological and lexical features—such as archaic templates, heavy clitic stacking, and rare stem patterns—not present in modern corpora (Aljumaily, 2022; Dukes & Habash, 2010a).

Despite these findings, prior work has not offered a systematic, multi-domain evaluation spanning Modern, Classical Scriptural, and Classical Jurisprudential Arabic. Such a comparison is methodologically valuable because these domains differ not only in vocabulary and orthography but also in morphological structure: Quranic Arabic tends toward templatic regularity, Hadith and legal prose exhibit dense clitic concatenation, and MSA displays greater lexical diversity and syntactic flexibility. Evaluating analyzers across these distinct registers therefore provides a more complete picture of robustness and error sources.

This study fills this gap by introducing a unified tri-domain evaluation framework that assesses Farasa (Abdelali et al., 2016), CAMeL Tools, and ALP on three representative corpora: NAFIS for MSA, the Qur’anic Corpus for Scriptural Arabic, and Noor–Ghateh for Classical Hadith and jurisprudential text. The framework incorporates a consistent normalization and segmentation-alignment pipeline, and later sections introduce bootstrap confidence intervals and paired non-parametric tests to quantify uncertainty and accommodate the substantial size disparities among datasets.

The contributions of this discussion paper are as follows:

A tri-domain evaluation framework that integrates unified normalization, segmentation alignment, and statistical uncertainty estimation.
A systematic comparison of three widely used morphological analyzers across Modern, Classical Scriptural, and Classical Jurisprudential Arabic.
Empirical identification of domain-specific weaknesses in clitic boundary resolution, stem analysis, and handling of archaic morphological patterns.
Methodological insights motivating the development of domain-balanced resources and standardized benchmarking protocols for Arabic NLP.

This work is presented as a JOHD Discussion Paper and is designed to complement a separate Noor–Ghateh data paper, which contains full documentation of the dataset’s XML schema and annotation process. Section 2 outlines the dataset characteristics, Section 3 describes the evaluation methodology, Section 4 presents the findings, and Section 5 discusses broader implications for domain-aware Arabic NLP.

2 Dataset Description

This section provides the complete metadata specification for the primary contribution of this paper: the Noor-Ghateh dataset, a novel benchmark for Arabic morphological segmentation in the Classical Hadith domain. The description follows the recommended schema for data papers. Details regarding the comparison datasets (NAFIS (Namly et al., 2016) and Qur’anic Corpus v0.4 (Dukes & Habash, 2010a)) used in our tri-domain evaluation framework are provided in Section 4 (Method).

Repository Location: The dataset is openly available on Zenodo: https://zenodo.org/records/18138582

Repository Name: Zenodo

Object Name: NoorGhateh_v3.csv

Format Names and Versions: The dataset is distributed as a UTF-8 encoded CSV file converted from the original XML annotation source. Each row corresponds to a single annotated morpheme and preserves the full set of XML attributes. The data are right-to-left encoded and comma-delimited, making them directly compatible with standard NLP preprocessing tools.

Creation Dates: 2023-07-01 to 2024-05-01

Dataset Creators:

Computer Research Center for Islamic Sciences (Noor) – data preparation and annotation
Behrouz Minaei-Bidgoli – academic supervision
Sayyed-Ali Hossayni – institutional supervision and validation
Huda AlShuhayeb – dataset validation, analysis, and paper preparation

Language: Arabic (Classical Hadith and Jurisprudential style)

License: Creative Commons Attribution 4.0 International (CC-BY 4.0)

Publication Date: 2025-10-17

Dataset Composition and Annotation Schema. The Noor-Ghateh dataset comprises 223,690 manually annotated words extracted from the classical jurisprudential text Shariat al-Islam, organized across 52 thematic chapters. This represents, to our knowledge, the largest morphologically annotated resource specifically dedicated to the Hadith domain. A curated subset of 313 tokens is publicly released to facilitate immediate benchmarking and tool evaluation.

The statistical composition of the corpus is summarized in Table 1. At a finer granularity, the distribution of prefix and suffix morphemes, along with average morpheme density per word, is reported in Table 2.

Table 1

Overall statistics of the Noor-Ghateh dataset.

PROPERTY	COUNT
Sentences	10,160
Word tokens (annotated)	223,690
Unique lemmas	16,420
Unique roots	8,150

Table 2

Morpheme-level composition in the Noor-Ghateh dataset.

MORPHEME TYPE	OCCURRENCES
Prefix morphemes	74,242
Suffix morphemes	18,617
Average morphemes per word	1.42
Estimated clitic density (per 100 tokens)	33.1

2.1 Corpus Format

The Noor–Ghateh corpus is authored natively in XML and distributed in UTF-8. Each <word> element encodes a single segment (prefix, stem, or suffix) with a consistent set of attributes capturing orthographic, grammatical, and morphological information. A representative example of the XML-based annotation structure is shown in Figure 1.

Sample XML annotation from the *Noor–Ghateh* dataset.

Attributes used in the XML are summarized in Table 3. The segmentation follows a linear convention: words are decomposed into sequences of <word> elements corresponding to clitics and stems in order (prefix* → stem → suffix*). For example, the Arabic word “الفجر” is segmented into the prefix “ال” and the stem “فجر”.

Table 3

Annotation schema in the Noor-Ghateh dataset.

FIELD	DESCRIPTION
Seq	Sequential index of the word or morpheme within its sentence
Slice	Surface form of the token as it appears in the original text
Entry	Canonical or normalized form of the morpheme
Affix	Morphological category (prefix, stem, or suffix)
POS	Part-of-speech tag (e.g., noun, verb, particle)
Case	Grammatical case or syntactic role
Kol	Functional classification (e.g., conjunction, preposition)
Lemma	Base lemma associated with the stem
Categ	Morphological pattern type (e.g., triliteral root pattern)
DervT	Derivational type (e.g., adjective, participle)
Num	Grammatical number (singular, dual, plural)
Root	Underlying triliteral or quadriliteral Arabic root

CSV export. For ease of benchmarking, the XML is provided as a flat CSV (NoorGhateh_v2.csv; UTF-8). Each row corresponds to one XML <word> element and preserves all attributes as columns. Right-to-left text is retained; fields are space-delimited and quoted for safe parsing in Python, R, or other data-analysis environments.

The dataset’s creation involved a custom-developed annotation tool and rigorous validation protocol, with technical implementation details provided in Section 4. This resource fills a critical gap in Arabic NLP by providing a high-quality, domain-specific benchmark for evaluating morphological analysis systems on classical religious texts.

2.2 Segmentation Schemes and Tokenization Conventions

The three datasets used in this study follow distinct segmentation conventions shaped by their linguistic registers and underlying orthographic traditions. To enable a valid cross-domain comparison, we identify the specific morphological boundaries recognized by each segmentation scheme.

NAFIS (MSA-Oriented). NAFIS (Namly et al., 2016) utilizes a standard Modern Standard Arabic (MSA) morphological decomposition. It follows a prefix–stem–suffix structure, where common proclitics (e.g., w+, f+, b+, l+, al+) and enclitic pronouns (e.g., +h, +hm, +na) are explicitly separated from the stem.

Quranic Corpus (Dukes & Habash, 2010a) (Classical Scriptural). Reflecting its historical orthography, the Quranic segmentation scheme is comparatively conservative. While it prioritizes morphological patterns specific to Classical Arabic, clitic separation is less aggressive than in MSA-oriented schemes. Some tokens preserve complex internal morphological structures that are specific to the Quranic register and differ from contemporary orthographic norms.

Noor–Ghateh (Classical Jurisprudence). Noor–Ghateh applies a templatic segmentation approach that emphasizes full clitic decomposition. Given the agglutinative nature of legal and Hadith texts, this scheme is designed to handle stacked morphemes, where multiple prefixes or suffixes are attached to a single root–pattern structure.

Relation to the NoorGhateh Data Paper

This discussion paper is complemented by a separate data paper dedicated to the Noor–Ghateh dataset, currently submitted to JOHD. That companion paper provides the full technical documentation of the annotation schema, including the XML-based <word> structure, field definitions, segmentation rules, and examples. To avoid redundancy, the present discussion paper provides only a high-level description of the dataset and focuses instead on the comparative evaluation of morphological analyzers across domains. Readers interested in the complete schema, attribute definitions, and annotation methodology are referred to the Noor–Ghateh data paper.

3 Method

This study investigates the cross-domain robustness of Arabic morphological analyzers by evaluating three state-of-the-art systems—Farasa, CAMeL, and ALP—across three distinct textual domains: Modern Standard Arabic (MSA), Classical Hadith Arabic, and Quranic Arabic. The methodology combines a controlled tri-domain benchmark, a harmonized annotation scheme, and consistent evaluation and statistical procedures to enable a fair and reproducible comparison.

3.1 Experimental Design: The Tri-Domain Benchmark

To examine the effect of domain variation on Arabic morphological analysis, we employ a tri-domain evaluation framework. The benchmark consists of three corpora, each representing a distinct register of written Arabic: Modern Standard Arabic, Classical Scriptural Arabic, and Classical Jurisprudential Arabic. An overview of the datasets is provided in Table 4.

Table 4

Dataset Overview and Linguistic Register.

DATASET	REGISTER	KEY CHARACTERISTICS
NAFIS	Modern Standard Arabic	High morphological ambiguity; MSA gold standard
Quranic Corpus	Classical Scriptural Arabic	Conservative segmentation; historical orthography
Noor–Ghateh	Classical Jurisprudential Arabic	Dense clitic stacking; templatic morphology

NAFIS (Namly et al., 2016). NAFIS is a gold-standard corpus developed for the evaluation of Arabic stemmers and segmentation systems. In this study, we use a released subset consisting of 172 manually annotated word forms. The selected tokens exhibit high levels of morphological ambiguity and clitic attachment, providing a compact test set for evaluating segmentation behavior in Modern Standard Arabic.

Quranic Corpus (Dukes & Habash, 2010b). The Quranic Corpus is a linguistically annotated resource representing Classical Scriptural Arabic. Our evaluation uses a subset covering the chapters Al-Fātiḥa and Al-Baqarah. This subset has been widely used in prior work and reflects conservative segmentation practices grounded in historical orthography.

Noor–Ghateh. Noor–Ghateh is derived from the jurisprudential text Sharāyeʿ al-Islām and represents Classical Hadith and legal Arabic. The corpus is characterized by extensive clitic agglutination and templatic morphological structures. Detailed documentation of the annotation scheme, XML schema, and attribute inventory is provided in a companion JOHD data paper. The present study limits its description to Noor–Ghateh’s role within the cross-domain evaluation.

Data Harmonization. All corpora were converted into a unified representation containing two fields: Token, corresponding to the surface form, and Segmentation, representing the gold-standard morpheme sequence. This harmonization enables consistent evaluation across domains while preserving the original segmentation conventions of each dataset.

Relation to the Companion Data Paper. The Noor–Ghateh XML schema, attribute inventory, and detailed annotation methodology are fully documented in a separate JOHD data paper. The present discussion paper therefore restricts itself to a high-level description of Noor–Ghateh and focuses on the cross-domain evaluation framework and findings.

3.2 Systems Under Evaluation

Three widely used Arabic morphological analyzers were selected based on accessibility, popularity, and linguistic coverage:

Farasa. Farasa (Abdelali et al., 2016) is a high-speed segmenter based on a rank-SVM model trained on the Penn Arabic Treebank. It is optimized for MSA newswire and has been deployed in information retrieval and machine translation pipelines.
CAMeL Tools. CAMeL Tools (Obeid et al., 2020) is an open-source Python toolkit that integrates tokenization, morphological analysis, disambiguation, POS tagging, sentiment analysis, and named entity recognition. In this study, we use its deep morphological analyzer with the CALIMA-MSA database and default clitic-segmentation settings.
ALP. ALP (Freihat et al., 2018) is a single-model pipeline for segmentation, POS tagging, and named-entity recognition built on top of OpenNLP. It employs a maximum entropy POS tagger and draws on training data from the Aljazeera and Altibbi corpora, making it competitive in both formal and domain-specific settings.

All systems were run using their publicly available implementations and default configurations, to reflect realistic research usage. Where necessary, only input normalization (Section 3.5) was applied, without altering the internal behavior of the analyzers.

3.3 Evaluation Metric

Each analyzer was evaluated on each dataset using a unified segmentation alignment script. Because the analyzers differ in tokenization conventions and clitic attachment, the script ensures that each predicted morpheme sequence is aligned and compared directly with the corresponding gold-standard segmentation.

The primary evaluation metric is:

Morpheme-level Segmentation Accuracy: the proportion of correctly predicted morphemes relative to the total number of gold-standard morphemes. Accuracy is computed separately for prefixes, stems, and suffixes, and then aggregated across all tokens. This component-based metric captures how precisely each system reproduces gold morphological boundaries within tokens.

In addition, we report token-level exact-match accuracy, where a token is counted as correct only if its entire morpheme sequence exactly matches the gold segmentation. This complementary metric provides a stricter view of segmentation performance across domains, and the corresponding results are reported in Table 10.

3.4 Segmentation Schemes and Clitic Representation

The three evaluated corpora—NAFIS (Modern Standard Arabic), the Qur’anic Corpus (Classical Arabic), and Noor–Ghateh (Classical Hadith Arabic)—all segment orthographic words into clitic-like morphemes. However, they differ substantially in (i) segmentation depth, (ii) consistency of prefix–stem–suffix structure, and (iii) treatment of multi-clitic stacks.

Table 5 provides a compressed comparison of how major Arabic clitics are represented across the three datasets.

Table 5

Compressed comparison of clitic representation across datasets.

CLITIC	TYPE	FUNCTION	NAFIS	QUR’ANIC CORPUS	NOOR–GHATEH
و	Conjunction	“and”	Proclitic: و+قال	Split or attached morphologically	Always separated: و+يفعلون
ف	Conj./resultative	“then/so”	ف+قال	Attached in token; split in morphology	Explicit split: ف+إنهم
ب/ك/ل	Prepositional proclitics	“with/by”, “as”, “for”	ب+ال+بيت	Separated morphologically	ب+كلمة
ال	Definite article	Definiteness marking	Separated after preps	Morphological morpheme	Always explicit
Pronominal suffixes	Enclitics	Possession/object	كتاب+هم	Morphologically split	قول+كم
Multi-clitic stacks	Agglutinative forms	Complex tokens	Rare	Frequent in classical text	Dense stacks in legal style

These systematic differences necessitated a unified normalization and alignment procedure, described in the following subsection.

3.5 Normalization and Alignment Pipeline

Because the three corpora differ in tokenization depth, clitic treatment, and orthographic conventions, we implemented a deterministic normalization and alignment pipeline to ensure comparability across domains and analyzers.

Orthographic Normalization (Gold and Input). All corpora were first normalized using a shared set of rules, including:

Mapping of all alif variants {أ، إ، آ} to bare alif (ا);
Removal of Arabic diacritics, including shadda and tanwīn;
Removal of tatwīl () and other non-semantic marks;
Unicode normalization and cleanup of non-Arabic control characters;
Standardization of punctuation spacing and whitespace collapsing.

Representative normalization examples are shown in Table 6.

Table 6

Representative orthographic normalization examples.

ORIGINAL	NORMALIZED	NOTE
إيمانهم	ايمانهم	Hamzated alif unified
مسئولية	مسؤولية	Hamza normalization
هديٰهم	هداهم	Dagger alif removed

Segmentation Normalization (Gold). Gold-standard segmentations were mapped into a unified prefix–stem–suffix representation:

PREFI X_{1.. n} + STEM + SUFFI X_{1.. m} .

Dataset-specific conventions were harmonized by explicitly separating all proclitics and enclitics. Examples of clitic-boundary harmonization are provided in Table 7.

Table 7

Representative clitic-boundary harmonization.

TOKEN	ALIGNED	NOTE
وبالكتاب	و+ب+ال+كتاب	Standardized proclitics
فزادهم	ف+زاد+هم	Qur’anic فsplit
تجارتهم	تجارت+هم	Unified enclitic rule

Canonical Token Mapping. After boundary normalization, tokens were mapped to a canonical segmentation form. Table 8 shows one representative example per domain.

Table 8

Canonical segmentation examples across domains.

SURFACE	CANONICAL	DOMAIN
ليستغفروا	ل+يستغفر+وا	Qur’anic
فانهم	ف+ان+هم	Noor–Ghateh
اوبالحق	او+ب+ال+حق	NAFIS

Gold–Prediction Alignment. Gold and predicted morphemes were aligned by splitting on the “+” symbol. When systems over- or under-segmented a token, NULL placeholders were inserted to preserve positional correspondence. Representative alignment outcomes are shown in Table 9.

Table 9

Representative gold–prediction alignment outcomes.

TOKEN	GOLD	PREDICTION	OUTCOME
فاستغفروا	ف+استغفر+وا	CAMeL correct	Perfect match
وكتابهم	و+كتاب+هم	Farasa:و+كتابهم	Under-segmentation (1 error)

3.6 Cross-Domain Comparative Framework

To summarize domain sensitivity, we measure the relative performance difference between Modern Standard Arabic (MSA, represented by NAFIS) and the two Classical domains. The decline in accuracy for a given domain is defined as:

1

Δ_{domain} = AC C_{MSA} - AC C_{domain},

where ACC_MSA denotes token-level exact-match accuracy on NAFIS, and ACC_domain is the corresponding score on either Noor–Ghateh or the Qur’anic corpus. Negative values indicate improved performance on Classical text relative to MSA.

In addition to this scalar summary, we qualitatively analyze common error types (e.g., clitic boundary errors, suffix attachment errors, and misanalysis of rare Classical forms) in Section 4, where we present detailed error categories and their distribution across domains.

3.7 Reproducibility Materials

To ensure complete transparency and reproducibility, all resources required to replicate the benchmark will be released upon publication:

Segmentation alignment and normalization scripts;
Configuration files and command-line parameters for Farasa (Abdelali et al., 2016), CAMeL Tools, and ALP;
Token-level evaluation subsets extracted from each corpus, after normalization;
Example analyzer outputs for each domain, aligned with gold segmentations.

These materials will be deposited in a public repository (Zenodo/GitHub) with a persistent DOI, enabling independent verification and facilitating downstream reuse of the evaluation framework.

4 Results and Discussion

This section reports the quantitative and qualitative findings of the tri-domain evaluation, focusing on how morphological analyzers respond to variation across Modern Standard Arabic (MSA), Classical Hadith Arabic, and Scriptural Arabic. Results are presented at both the token and morpheme levels to provide a comprehensive view of system behavior.

4.1 Token-Level Segmentation Accuracy

Table 10 presents exact-match token-level accuracy. All three analyzers perform substantially better on the Qur’anic and Hadith datasets than on NAFIS. This pattern reflects the comparatively stable morphological structures of classical texts, where clitic attachment is more predictable and lexical borrowings are minimal. In contrast, MSA contains greater lexical diversity, modern derivations, and non-canonical clitic structures, which introduces ambiguity for segmentation systems.

Table 10

Token-level segmentation accuracy (exact match).

DATASET	FARASA	CAMEL	ALP
NAFIS (MSA)	0.59	0.68	0.65
Qur’anic Corpus	0.76	0.80	0.81
Noor–Ghateh (Hadith)	0.81	0.81	0.79

Token-level accuracy is intentionally reported alongside the morpheme-level results that follow. Token-level scores highlight holistic segmentation success, while morpheme-level accuracy (Sections 13–15) captures fine-grained clitic and stem boundary behavior.

4.2 Domain Sensitivity

Domain sensitivity is measured as the difference between MSA performance and performance on each classical dataset. Negative values correspond to improved accuracy in classical Arabic. The measured domain sensitivity values and relative performance improvements for each analyzer are reported in Table 11.

Table 11

Domain sensitivity (Δ_domain) and relative improvements.

ANALYZER	Δ_Quran	Δ_Hadith	IMPROVEMENT (%)
Farasa	–0.17	–0.22	37.3
CAMeL	–0.12	–0.13	19.1
ALP	–0.16	–0.14	21.5

All systems improve markedly on classical data. These gains should be interpreted cautiously, as classical Arabic is morphologically regular but lexically distinct; predictable templatic patterns can ease segmentation even when the lexicon differs from modern usage.

4.3 Component-Level Results

Table 12 reports prefix, stem, and suffix accuracy. Prefix segmentation is consistently strong due to the limited inventory of Arabic proclitics. Suffix segmentation is more difficult, reflecting the richness of Arabic pronominal and inflectional endings. Stem accuracy shows the greatest domain-driven variation, especially for Farasa on NAFIS, where non-canonical MSA forms and loanwords are more common.

Table 12

Per-component segmentation accuracy across domains.

COMPONENT/DATASET	FARASA	CAMEL	ALP
Prefix
NAFIS	0.689	0.892	0.865
Qur’anic	0.894	0.939	0.939
Noor–Ghateh	0.776	0.835	0.812
Stem
NAFIS	0.189	0.730	0.811
Qur’anic	0.879	0.970	0.936
Noor–Ghateh	0.552	0.601	0.583
Suffix
NAFIS	0.270	0.838	0.973
Qur’anic	0.813	0.939	0.939
Noor–Ghateh	0.628	0.742	0.701

Stem errors in classical datasets frequently involve templatic forms such as Form VIII and Form X verbs, hollow verbs, or quadriliteral roots. These patterns appear frequently in the Qur’anic and Hadith corpora and can diverge from analyzers’ MSA-trained lexicons.

4.4 Uncertainty and Statistical Significance

Tables 13, 14, 15 show that bootstrap confidence intervals are narrow for large datasets (e.g., Qur’anic), and wider for small datasets (e.g., NAFIS), as expected. All statistical comparisons use paired Wilcoxon Signed-Rank Tests applied only to tokens where both gold and predicted segmentations were aligned.

Table 13

Morpheme-level segmentation accuracy and 95% confidence intervals on the Noor–Ghateh dataset.

SYSTEM	ACCURACY	95% CI	STD. DEV.	N TOKENS
Farasa	0.563	[0.508, 0.617]	0.028	311
CAMeL	0.634	[0.579, 0.688]	0.027	311
ALP	0.374	[0.321, 0.432]	0.028	308

Table 14

Morpheme-level segmentation accuracy and 95% confidence intervals on the Qur’anic subset.

SYSTEM	ACCURACY	95% CI	STD. DEV.	N TOKENS
Farasa	0.785	[0.776, 0.795]	0.005	6,839
CAMeL	0.826	[0.817, 0.835]	0.005	6,837
ALP	0.840	[0.832, 0.849]	0.004	6,828

Table 15

Morpheme-level segmentation accuracy and 95% confidence intervals on the NAFIS (MSA) dataset.

SYSTEM	ACCURACY	95% CI	STD. DEV.	N TOKENS
Farasa	0.667	[0.585, 0.741]	0.040	135
CAMeL	0.793	[0.726, 0.859]	0.035	135
ALP	0.830	[0.763, 0.889]	0.032	135

ALP significantly outperforms CAMeL and Farasa on the Qur’anic corpus (p < 0.01), and CAMeL significantly outperforms Farasa across all domains. These results confirm that observed differences reflect genuine model behavior rather than sampling noise.

4.5 Clitic Density

Accuracy decreases as the number of attached clitics increases. Farasa shows the strongest decline, consistent with its optimization for ATB-style segmentation. CAMeL and ALP remain relatively stable across clitic counts, suggesting better modeling of multi-morpheme tokens.

4.6 Error Analysis

Common error classes include:

Clitic boundary errors (e.g. -و-/ف-/ب)
Archaic or templatic forms absent from modern lexicons
Named entities, especially foreign-origin names
Multi-clitic sequences, which trigger cascading boundary errors

Misclassification patterns (Table 16) show that the most common error is over-splitting of prefixes or under-identification of suffixes—typical behaviors of systems trained primarily on MSA.

Table 16

Component-level misclassification patterns on the Noor–Ghateh dataset, reported as proportions of tokens.

SYSTEM	PREFIX→STEM	STEM→PREFIX	SUFFIX→STEM	OTHER
Farasa	0.14	0.02	0.18	0.09
CAMeL	0.09	0.01	0.11	0.06
ALP	0.11	0.01	0.13	0.07

4.7 Reuse Potential

The multi-domain nature of this benchmark allows controlled comparison across linguistically distinct registers. Beyond tool evaluation, the resources support:

Extension to new corpora following the unified segmentation schema.
Comparative studies of templatic morphology across genres.

All reproducibility materials, including code and gold subsets, will publicly released to facilitate such extensions.

4.8 Summary of Findings

Across domains, CAMeL achieves the highest overall accuracy, particularly on the Qur’anic corpus. ALP performs competitively, especially on suffix segmentation, while Farasa shows the greatest variability and is most sensitive to clitic density and non-MSA lexical forms. The Noor–Ghateh dataset reveals fundamental challenges associated with classical morphology and provides a strong benchmark for future research.

5 Implications and Applications

The results of this study offer methodological, technological, and linguistic insights with direct relevance to the development and evaluation of Arabic morphological analysis systems. By combining a tri-domain experimental design with transparent normalization and alignment procedures, this work demonstrates how domain variation shapes segmentation accuracy and highlights the practical importance of evaluating systems beyond a single register of Arabic.

5.1 Methodological Implications

The findings underscore the value of multi-domain benchmarking for morphologically rich languages. Evaluating systems only on Modern Standard Arabic risks masking systematic weaknesses that become salient in classical or scriptural text. The integration of uncertainty quantification through bootstrap confidence intervals and paired Wilcoxon tests further illustrates how statistical rigor can clarify the stability of performance estimates, especially when dataset sizes differ substantially.

The alignment and normalization procedures developed for this study—particularly the unified prefix–stem–suffix representation—provide a reusable protocol for future evaluations. As morphological analyzers vary in their tokenization conventions, explicit alignment ensures that comparative studies remain fair and replicable. Making these resources publicly available strengthens reproducibility and provides a methodological template for future domain-sensitivity research.

5.2 Applications in AI and Language Technology

The Noor–Ghateh dataset contributes a linguistically rich and manually validated resource for training and evaluating morphological models. Its fine-grained segmentation and extensive coverage of classical jurisprudential text make it suitable for:

supervised training of neural and hybrid models (e.g., BiLSTM, Transformer, BERT-based systems);
improving rule-based analyzers requiring explicit affix and templatic structure annotation;
diagnostic evaluation of clitic handling, stem recognition, and orthographic normalization.

The error patterns revealed in this study—especially failures on multi-clitic tokens and archaic templatic forms—can guide the design of more morphology-aware architectures. In practical applications such as information extraction, semantic search, and text classification for classical Arabic sources, such improvements are essential for model reliability.

5.3 Linguistic and Educational Applications

Beyond computational settings, Noor–Ghateh offers a structured resource for linguistic inquiry and pedagogy. Its explicit annotation of roots, patterns, clitics, and grammatical functions enables:

corpus-based instruction in Arabic morphology and syntax,
empirical studies of templatic variation in Classical Arabic,
comparative analyses across Quranic, Hadith, and Modern Standard Arabic,
visualization of morphological segmentation for teaching purposes.

The dataset’s coverage of 52 thematic chapters further supports investigations of intra-domain linguistic variation, including how morphological patterns correlate with genre, topic, and stylistic register.

5.4 Future Directions

The methodological framework established here can be extended to additional Arabic domains, including dialectal corpora, historical prose, and contemporary web text. Future work may explore domain-adaptive pretraining and fine-tuning strategies to reduce sensitivity to genre and register. Integrating Noor–Ghateh with established benchmarks such as PATB (Maamouri et al., 2004), SAMA (Maamouri et al., 2010), and NAFIS would help consolidate a unified evaluation suite for Arabic morphological analysis.

In summary, this study connects resource creation with systematic evaluation by introducing a reproducible framework for analyzing domain sensitivity in Arabic morphological tools. The Noor–Ghateh dataset not only enriches the landscape of Classical Arabic resources but also provides a diagnostic lens for building more robust, domain-aware Arabic NLP systems.

Competing Interests

The authors have no competing interests to declare.

Author Contributions

Huda AlShuhayeb: Conceptualization; Methodology; Data curation; Formal analysis; Validation; Visualization; Writing – original draft; Writing – review & editing.

Dr. Behrouz Minaei-Bidgoli: Supervision; Methodology; Conceptualization; Writing – review & editing.

Dr. Sayyed-Ali Hossayni: Supervision; Project administration; Resources; Validation; Writing – review & editing.