Have a personal or library account? Click to login
Domain Sensitivity in Arabic Morphological Analysis: A Multi-Corpus Evaluation of Farasa, CAMeL, and ALP Across Modern, Classical Religious, and Classical Jurisprudential Domains Cover

Domain Sensitivity in Arabic Morphological Analysis: A Multi-Corpus Evaluation of Farasa, CAMeL, and ALP Across Modern, Classical Religious, and Classical Jurisprudential Domains

Open Access
|Jan 2026

Figures & Tables

Table 1

Overall statistics of the Noor-Ghateh dataset.

PROPERTYCOUNT
Sentences10,160
Word tokens (annotated)223,690
Unique lemmas16,420
Unique roots8,150
Table 2

Morpheme-level composition in the Noor-Ghateh dataset.

MORPHEME TYPEOCCURRENCES
Prefix morphemes74,242
Suffix morphemes18,617
Average morphemes per word1.42
Estimated clitic density (per 100 tokens)33.1
johd-12-418-g1.png
Figure 1

Sample XML annotation from the Noor–Ghateh dataset.

Table 3

Annotation schema in the Noor-Ghateh dataset.

FIELDDESCRIPTION
SeqSequential index of the word or morpheme within its sentence
SliceSurface form of the token as it appears in the original text
EntryCanonical or normalized form of the morpheme
AffixMorphological category (prefix, stem, or suffix)
POSPart-of-speech tag (e.g., noun, verb, particle)
CaseGrammatical case or syntactic role
KolFunctional classification (e.g., conjunction, preposition)
LemmaBase lemma associated with the stem
CategMorphological pattern type (e.g., triliteral root pattern)
DervTDerivational type (e.g., adjective, participle)
NumGrammatical number (singular, dual, plural)
RootUnderlying triliteral or quadriliteral Arabic root
Table 4

Dataset Overview and Linguistic Register.

DATASETREGISTERKEY CHARACTERISTICS
NAFISModern Standard ArabicHigh morphological ambiguity; MSA gold standard
Quranic CorpusClassical Scriptural ArabicConservative segmentation; historical orthography
Noor–GhatehClassical Jurisprudential ArabicDense clitic stacking; templatic morphology
Table 5

Compressed comparison of clitic representation across datasets.

CLITICTYPEFUNCTIONNAFISQUR’ANIC CORPUSNOOR–GHATEH
وConjunction“and”Proclitic: و+قالSplit or attached morphologicallyAlways separated: و+يفعلون
فConj./resultative“then/so”ف+قالAttached in token; split in morphologyExplicit split: ف+إنهم
ب/ك/لPrepositional proclitics“with/by”, “as”, “for”ب+ال+بيتSeparated morphologicallyب+كلمة
الDefinite articleDefiniteness markingSeparated after prepsMorphological morphemeAlways explicit
Pronominal suffixesEncliticsPossession/objectكتاب+همMorphologically splitقول+كم
Multi-clitic stacksAgglutinative formsComplex tokensRareFrequent in classical textDense stacks in legal style
Table 6

Representative orthographic normalization examples.

ORIGINALNORMALIZEDNOTE
إيمانهمايمانهمHamzated alif unified
مسئوليةمسؤوليةHamza normalization
هديٰهمهداهمDagger alif removed
Table 7

Representative clitic-boundary harmonization.

TOKENALIGNEDNOTE
وبالكتابو+ب+ال+كتابStandardized proclitics
فزادهمف+زاد+همQur’anic فsplit
تجارتهمتجارت+همUnified enclitic rule
Table 8

Canonical segmentation examples across domains.

SURFACECANONICALDOMAIN
ليستغفروال+يستغفر+واQur’anic
فانهمف+ان+همNoor–Ghateh
اوبالحقاو+ب+ال+حقNAFIS
Table 9

Representative gold–prediction alignment outcomes.

TOKENGOLDPREDICTIONOUTCOME
فاستغفرواف+استغفر+واCAMeL correctPerfect match
وكتابهمو+كتاب+همFarasa:و+كتابهمUnder-segmentation (1 error)
Table 10

Token-level segmentation accuracy (exact match).

DATASETFARASACAMELALP
NAFIS (MSA)0.590.680.65
Qur’anic Corpus0.760.800.81
Noor–Ghateh (Hadith)0.810.810.79
Table 11

Domain sensitivity (Δdomain) and relative improvements.

ANALYZERΔQuranΔHadithIMPROVEMENT (%)
Farasa–0.17–0.2237.3
CAMeL–0.12–0.1319.1
ALP–0.16–0.1421.5
Table 12

Per-component segmentation accuracy across domains.

COMPONENT/DATASETFARASACAMELALP
Prefix
NAFIS0.6890.8920.865
Qur’anic0.8940.9390.939
Noor–Ghateh0.7760.8350.812
Stem
NAFIS0.1890.7300.811
Qur’anic0.8790.9700.936
Noor–Ghateh0.5520.6010.583
Suffix
NAFIS0.2700.8380.973
Qur’anic0.8130.9390.939
Noor–Ghateh0.6280.7420.701
Table 13

Morpheme-level segmentation accuracy and 95% confidence intervals on the Noor–Ghateh dataset.

SYSTEMACCURACY95% CISTD. DEV.N TOKENS
Farasa0.563[0.508, 0.617]0.028311
CAMeL0.634[0.579, 0.688]0.027311
ALP0.374[0.321, 0.432]0.028308
Table 14

Morpheme-level segmentation accuracy and 95% confidence intervals on the Qur’anic subset.

SYSTEMACCURACY95% CISTD. DEV.N TOKENS
Farasa0.785[0.776, 0.795]0.0056,839
CAMeL0.826[0.817, 0.835]0.0056,837
ALP0.840[0.832, 0.849]0.0046,828
Table 15

Morpheme-level segmentation accuracy and 95% confidence intervals on the NAFIS (MSA) dataset.

SYSTEMACCURACY95% CISTD. DEV.N TOKENS
Farasa0.667[0.585, 0.741]0.040135
CAMeL0.793[0.726, 0.859]0.035135
ALP0.830[0.763, 0.889]0.032135
Table 16

Component-level misclassification patterns on the Noor–Ghateh dataset, reported as proportions of tokens.

SYSTEMPREFIX→STEMSTEM→PREFIXSUFFIX→STEMOTHER
Farasa0.140.020.180.09
CAMeL0.090.010.110.06
ALP0.110.010.130.07
DOI: https://doi.org/10.5334/johd.418 | Journal eISSN: 2059-481X
Language: English
Submitted on: Oct 18, 2025
|
Accepted on: Jan 5, 2026
|
Published on: Jan 30, 2026
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2026 Behrouz Minaei-Bidgoli, Huda AlShuhayeb, Sayyed-Ali Hossayni, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.