NoorGhateh: A Benchmark Dataset for Training and Evaluating Arabic Morphological Analysis Systems

Behrouz Minaei-Bidgoli; Huda AlShuhayeb

doi:10.5334/johd.409

Full Article

1 Overview

The NoorGhateh dataset is openly available via Zenodo (DOI: 10.5281/zenodo.18138582). This repository provides a curated subset of the corpus together with detailed documentation of the annotation schema and data structure, ensuring transparency and reproducibility.

1.1 Context

The NoorGhateh dataset was developed through a collaborative effort between researchers at the Computer Research Center for Islamic Sciences (Noor) and the School of Computer Engineering, Iran University of Science and Technology (IUST), under the supervision of Prof. Behrouz Minaei-Bidgoli.

The dataset is derived from Sharaye al-Islam, a classical Arabic jurisprudential text composed in the 13th century (7th century AH). As one of the foundational works in Islamic legal scholarship, this source provides a linguistically rich and domain-specific foundation, making the resulting corpus particularly valuable for Classical Arabic morphological analysis and related Natural Language Processing (NLP) tasks.

The complete annotated corpus comprises 223,690 manually segmented and morphologically annotated tokens. For public release, a subset of 313 words has been carefully selected to demonstrate the annotation schema, structural design, and range of linguistic features captured in the full dataset. This openly available subset enables researchers to examine the annotation methodology, test preprocessing workflows, and conduct small-scale benchmarking experiments. Access to the full dataset is provided upon reasonable request for academic research purposes.

The NoorGhateh dataset was designed to address the scarcity of high-quality, domain-specific morphological resources for Arabic, particularly those suited for training and evaluating AI-based linguistic systems. By providing a carefully curated, expert-annotated corpus, it offers a reliable gold-standard benchmark for assessing rule-based, machine-learning, and deep-learning morphological analyzers across a wide range of Arabic NLP applications.

2 Literature Review

2.1 Comparison of Arabic Morphological Datasets

A wide range of Arabic morphological corpora have been developed, differing in scale, linguistic variety, annotation depth, and topical domain. Table 1 situates the proposed Noor–Ghateh dataset within this landscape. The comparison includes major benchmarks such as the Penn Arabic Treebank (PATB), Standard Arabic Morphological Analyzer (SAMA) and Buckwalter Arabic Morphological Analyzer (BAMA), the Quranic Arabic Corpus, Zeroual’s Quranic corpus, the Prague Arabic Dependency Treebank, Qatar Arabic Language Bank (QALB), Tashkeela, Arabic Gigaword, and Masader.

Table 1

Comparison of major Arabic morphological datasets by size, variety, annotation scope, and domain.

DATASET	WORDS	VARIETY	ANNOTATION	DOMAIN
Noor–Ghateh Dataset (this work)	223,690	Classical Arabic (CA)	Morphological segmentation; part-of-speech (POS) tags; lemmas; roots; derivational patterns; clitic segmentation	Hadith/Jurisprudence
Penn Arabic Treebank (PATB) (Maamouri et al., 2004)	37M	Modern Standard Arabic (MSA)	Tokenization; segmentation; POS tags; lemmas; diacritics; syntactic trees	News
BAMA/SAMA (Buckwalter, 2002; Maamouri et al., 2010)	N/A	MSA	Lexicon-based morphological feature bundles; stems; roots	General/Lexicon
Quranic Arabic Corpus (Leeds) (Dukes & Habash, 2010)	77K	CA	Morphological segmentation; POS tags; dependency grammar; semantic ontology	Quran
Zeroual Quranic Corpus (Zeroual & Lakhouaja, 2016)	1.3M	CA	Stems; patterns; lemmas; roots	Quran
Prague Arabic Dependency Treebank (Hajič et al., 2004)	114K	MSA	Morphology; syntax; dependency relations	News
QALB Corpus (Habash et al., 2013)	2M	MSA + Dialectal Arabic (DA)	Tokenization; POS tags; lemmatization; diacritization; error annotation	Essays/Web
Tashkeela (Zerrouki & Balla, 2017)	75M	MSA+CA	Diacritization; morphological features; syntactic context	Mixed
Arabic Gigaword Fifth Edition (Parker et al., 2011)	1.1B	MSA	Morphology; lexical features; syntactic parsing; named entities	News
Masader (Alyafeai et al., 2022)	N/A	MSA+DA	Dialect identification; named entity recognition (NER); sentiment; morphology	Multi-domain

As the comparison illustrates, existing corpora overwhelmingly focus on MSA or Qur’anic Classical Arabic. Very few resources cover Classical fiqh prose—a domain characterized by specialized terminology, dense grammatical constructions, and non-narrative syntax. The Noor–Ghateh dataset addresses this gap by offering a morphology-rich, domain-specific benchmark tailored to Islamic jurisprudence, thereby expanding the landscape of Classical Arabic NLP.

2.2 Existing Morphological Analyzers

A number of influential Arabic morphological analyzers have shaped the development of Arabic NLP over the past two decades. Among the most prominent systems are MADAMIRA, CAMeL Tools, Farasa, and the more recent Arabic Linguistic Pipeline (ALP) analyzer. Each system embodies a distinct methodological approach and has achieved strong performance on MSA, particularly on corpora derived from the Penn Arabic Treebank (PATB). However, despite their maturity, their performance declines considerably when applied to Classical Arabic—especially jurisprudential texts—due to fundamental differences in vocabulary, morphology, and syntactic style.

MADAMIRA (Pasha et al., 2014) integrates a hybrid design that combines the rule-based lexicons of BAMA/SAMA with statistical disambiguation models trained on Linguistic Data Consortium (LDC) resources. This architecture enables strong performance on MSA tasks such as segmentation, POS tagging, and lemmatization. Nonetheless, MADAMIRA’s dependence on modern lexicons and contemporary usage patterns limits its coverage of archaic stems, case marking conventions, and specialized legal terminology characteristic of fiqh literature.

CAMeL Tools (Obeid et al., 2020) provides a modular suite for segmentation, morphological analysis, and lemmatization. Its analyzer is grounded in extensive lexicons extracted primarily from PATB and automatically induced paradigms, for example inflectional paradigms that relate surface forms such as kataba ‘he wrote’, yaktubu ‘he writes’, and kutiba ‘it was written’ under a shared morphological analysis.

While supporting multiple dialects and outperforming legacy systems on modern genres, CAMeL Tools encounters difficulties with Classical Arabic structures, including complex derivational patterns, ambiguous clitic sequences, and non-standard affixes common in pre-modern scholarly prose.

Farasa (Abdelali et al., 2016) adopts a lightweight, data-driven approach optimized for speed and ease of deployment. Its segmentation and POS-tagging modules rely on discriminative learning models that perform strongly on newswire and general-domain MSA. However, Farasa’s lexicon and training data were not designed for Classical Arabic, making the system less effective in handling the dense morphology, rare lexical forms, and deep syntactic cues required for correctly analyzing fiqh texts.

ALP (Freihat et al., 2018) represents a streamlined, rule-based morphological analyzer designed around a single-core engine. By relying on compact templatic rules rather than large lexicons or multi-stage disambiguation, ALP offers interpretable outputs and efficient processing. Nevertheless, the rule base of ALP is optimized for contemporary Arabic and does not fully capture the morphological complexity of Classical jurisprudential writing. As a result, it struggles with archaic affixes, uncommon derivational patterns, and case-governed constructions central to Classical syntax.

Collectively, these systems highlight a critical limitation in the Arabic NLP landscape: although Arabic morphological processing is well-established for MSA, there is a lack of high-quality, domain-adapted resources for Classical Arabic, particularly in fields such as fiqh. This gap underscores the importance of datasets such as Noor–Ghateh, whose fine-grained, manually curated annotations provide a robust foundation for evaluating and improving existing analyzers—and for developing new models capable of handling the linguistic richness of Classical jurisprudential texts.

2.3 Limitations of Existing Resources and Need for Domain-Specific Datasets

Although Arabic morphological analysis has benefited from decades of tool development and the availability of several modern corpora, most existing resources remain limited in their ability to support deep analysis of Classical Arabic, particularly in specialized scholarly domains such as fiqh. Tools like MADAMIRA, CAMeL Tools, Farasa, and ALP are primarily trained on MSA datasets—mainly the Penn Arabic Treebank—and therefore reflect the lexical, syntactic, and morphological patterns of contemporary usage rather than the structures typical of pre-modern texts. As a result, they frequently struggle with archaic vocabulary, rare derivational forms, complex affixation patterns, and case-governed constructions that play a central role in Classical Arabic writing.

Existing corpora provide valuable contributions, but none offers the depth, domain-specificity, or fine-grained morphological details required for precise analysis of jurisprudential literature. Jurisprudential texts feature dense scholarly terminology, nuanced syntactic dependencies, and morphological richness that exceed the representational capacity of many modern corpora. The lack of specialized resources leads to inconsistent lemma assignments, incorrect affix segmentation, inaccurate POS predictions, and reduced model transferability across domains.

This gap underscores the need for domain-adapted, manually curated datasets that reflect the linguistic and conceptual style of Islamic scholarly writing. A dataset designed specifically for fiqh can capture the recurrent morphological constructions, formulaic expressions, derivational regularities, and case-based syntactic cues essential for accurate Classical Arabic processing. Such datasets enable not only more reliable benchmarking of existing tools but also support the development and fine-tuning of new models capable of handling the linguistic complexity of this genre.

Within this context, Noor–Ghateh provides a critical contribution. Through its fine-grained, 15-attribute schema and manual verification process, it offers the level of morphological detail necessary for robust analysis of jurisprudential texts. The dataset fills a long-standing gap by supplying a gold-standard resource tailored to Classical Arabic, enabling researchers to evaluate domain sensitivity, perform detailed error analysis, and build more effective language models for pre-modern Islamic scholarly corpora.

3 Method

3.1 Steps

The Noor-Ghateh dataset was derived from Sharaye al-Islam, a classical Arabic jurisprudential text noted for its precise grammatical construction and rich morphological patterns. The source material was digitized, normalized, and tokenized under the supervision of linguistic experts at the Computer Research Center for Islamic Sciences (Noor).

The annotation process proceeded through several structured stages:

Text normalization: Orthographic variations were unified, including the treatment of alif maqṣūra, alif mamdūda, and hamza forms. Inconsistent diacritics and punctuation were removed to ensure a uniform input representation.
Tokenization: The text was segmented into tokens following linguistic rules consistent with Arabic morphological conventions.
Manual segmentation: Human annotators identified clitic boundaries for each word, marking prefixes, suffixes, conjunctions, and prepositions.
Morphological annotation: Each token was assigned its corresponding lemma, part of speech, affix type (prefix/suffix/infix), and associated morphological features such as gender, number, case, and tense.
Verification and export: All annotations underwent review by a separate linguistic specialist before being exported in UTF-8 encoded CSV format.

The annotation platform and quality-assurance procedures were developed internally at Noor Center to ensure precision, transparency, and reproducibility. No external morphological analyzers were used during dataset construction; all analyses were performed manually by trained linguists.

An overview of the full dataset preparation and annotation workflow is illustrated in Figure 1. The diagram summarizes the major stages, beginning with the selection of a verified digital edition of Sharaye al-Islam, followed by normalization, lexical refinement, and the design of a segmentation-oriented tagging schema. It also visualizes the manual annotation process conducted using a custom tool developed at Noor Center, the supervisory validation phase, and the final export of the curated dataset to UTF-8 CSV format for open-access publication on Zenodo. This visual representation offers readers—particularly those unfamiliar with Arabic morphological workflows—an accessible summary of the methodological pipeline and the interdependence of its components.

Overview of the Noor-Ghateh dataset preparation workflow, illustrating the sequential stages from text selection and normalization to manual segmentation, verification, and final export.

3.2 Sampling Strategy

The publicly released subset of 313 words was selected using a stratified random sampling approach designed to preserve the linguistic diversity of the full 223,690-token corpus. The sample includes:

A balanced distribution of nouns, verbs, and particles;
Tokens exhibiting different clitic structures (from zero to three clitics);
Morphologically complex forms, including derived, inflected, and root-based patterns.

This strategy ensures that the open sample reflects the full dataset’s morphological richness and complexity while remaining compact enough for demonstration, educational purposes, and reproducibility-focused experimentation.

3.3 Quality Control

Multiple layers of validation were implemented to guarantee the accuracy and internal consistency of the annotations:

Double annotation: Each token was annotated independently by two trained annotators.
Adjudication: Any disagreements were resolved through discussion under the supervision of a senior linguist.
Normalization checks: Automated scripts ensured unified Unicode encoding for Arabic characters, hamza positions, and diacritic usage.
Random verification: A random 10% subset of the exported CSV file was cross-checked against the internal annotation database to confirm consistency.

These procedures resulted in full agreement for segmentation boundaries and lemma assignments, ensuring that Noor-Ghateh functions as a reliable gold-standard reference dataset for Classical Arabic morphological research.

The methodological steps detailed above produced a linguistically consistent and fully reproducible dataset that captures the morphological characteristics of Classical Arabic. The following section presents a comprehensive description of the Noor-Ghateh dataset—its metadata schema, file structure, and publication details—to illustrate how these methodological principles are reflected in the final resource.

The annotation work was conducted using a custom-built software environment developed at the Computer Research Center for Islamic Sciences (Noor). As shown in Figure 2, the upper panel provided access to annotation fields and drop-down menus for selecting affix types and grammatical categories, while the lower panel displayed the color-coded tokenized text, enabling immediate visual verification of each annotation. This environment streamlined the manual workflow and ensured both consistency and accuracy throughout the annotation and review process.

Screenshot of the annotation environment used in the creation of the Noor-Ghateh dataset. Annotators segmented tokens, assigned lemmas, and specified morphological features such as part of speech, case, number, and gender. The interface displays the tokenized Arabic text (lower panel), the feature categories (left panel), and the annotation fields (top panel).

4 Dataset Description

Repository name

Zenodo.

Object name

NoorGhateh_v3.csv, NoorGhateh_wordsxml.json, NoorGhateh_wordsxml.xml.

Format

CSV (UTF-8 encoded).

Creation dates

2023-07-01 to 2024-05-01.

Dataset creators

Computer Research Center for Islamic Sciences (Noor) — data preparation and annotation; Behrouz Minaei-Bidgoli (School of Computer Engineering, Iran University of Science and Technology) — academic supervision; Huda AlShuhayeb (School of Computer Engineering, Iran University of Science and Technology) — dataset validation, analysis, and paper preparation.

Language

Arabic.

License

Creative Commons Attribution 4.0 International (CC BY 4.0).

Publication date

2025-10-17.

4.1 Sample Data Excerpt

Table 2 provides an illustrative excerpt from the released 313-token sample. Each entry shows the original Arabic token, its segmentation, lemma, part-of-speech tag (POS), and an English gloss. This bilingual presentation helps readers from different disciplinary backgrounds—linguistics, philology, and computational NLP—quickly understand the dataset’s annotation format.

Table 2

Sample entries from the Noor-Ghateh dataset showing Arabic tokens, segmentations, lemmas, POS tags, and English glosses.

TOKEN	SEGMENTATION	LEMMA	POS	ENGLISH GLOSS
الطهارة	ال+طهارة	طهارة	NOUN	Purification
ويعتمد	و+یعتمد	اعتماد	VERB	relies on
اربعة	أربعة	أربعة	NOUN	four
للوضوء	ل+ال+وضوء	وضوء	NOUN	for ablution
فالواجب	ف+ال+واجب	واجب	PART	thus, the obligation is

To complement the tabular representation, the dataset is also released in a machine-readable XML format, which preserves the full hierarchical structure of each token’s morphological attributes.

4.2 Source Documentation

The annotated sample is extracted from the classical Imami legal compendium Sharaye al-Islam fi Masail al-Halal wa-l-Haram by al-MuHaqqiq al-Hilli (d. 676 AH).

Representativeness of the Jurisprudential Domain. The selected text belongs to the core genre of Classical Arabic religious scholarship: fiqh (Islamic jurisprudence). This domain is characterized by a highly formulaic style, dense clitic usage, conditional legal phrasing, and domain-specific terminology. Such features introduce systematic morphological challenges distinct from MSA newswire or narrative classical prose. Including fiqh data therefore broadens the typological and stylistic coverage of Arabic NLP resources and enables segmentation systems to be evaluated across sharply differing linguistic registers.

Socio-Cultural and Scientific Value. The jurisprudential domain also holds significant socio-cultural and academic value. Works such as Sharaye al-Islam remain foundational teaching texts in Shīʿī seminaries (Hawzas) and document centuries of legal reasoning, classification, and intellectual history. A digitized, morphologically annotated sample from such a canonical text enables:

computational access to Islamic legal heritage;
development of NLP tools for processing religious-legal discourse;
cross-domain robustness testing for segmentation and morphological analysis systems;
comparative studies between classical and modern varieties of Arabic;
future work in syntactic, semantic, or argument-mining analysis of legal texts.

Format Accessibility. To maximize reusability, the dataset is released in CSV, XML, and JSON formats. In the CSV version, morphological annotations are encoded as embedded XML strings within a dedicated column, whereas the standalone XML and JSON versions explicitly preserve the full attribute hierarchies. These machine-readable representations facilitate integration with NLP pipelines and support downstream tasks such as lemmatization, POS tagging, and rule-based or neural morphological parsing.

Together, the dataset’s domain specificity, philological precision, and multi-format release make it a valuable resource for Arabic computational linguistics, digital humanities, and cross-domain NLP evaluation.

4.3 Schema and Attribute Documentation

For transparency and interoperability, all morphological attributes used in Noor-Ghateh are formally defined in Table 3. Each attribute corresponds to a distinct linguistic or structural feature annotated at the token level.

Table 3

Formal schema definitions and valid values for the 15 morphological attributes used in the Noor-Ghateh dataset.

TAG	DEFINITION	VALUES/RANGE	EXAMPLE FROM DATASET
Seq	Sequential index of the word or morpheme	Natural numbers (1–n)	1, 2, 3
Slice	Surface form of the morpheme	Arabic string	وضوء, ال كتاب
Entry	Canonical or normalized form	Arabic string	وضوء, ال طهارة
Affix	Morphological category (prefix, stem, suffix)	پسوند, هسته, پيشوند	پيشوند
Pos	Part-of-speech tag	حرف, فعل, اسم	فعل
Lemma	Base lemma of the stem	Arabic lemma string	وضوء, طهارة, اعتماد
Case	Grammatical case	مبني بر كسر, مجرور, منصوب مرفوع,	مبني بر كسر
Categ	abstract derivational pattern	فَعَلَ يَفْعُلُ اِسْتِفْعَال اِفْعِلَّال اِفْتِعَال تهي إِفعَال تَفَعُّل فَعْلَلَةثلاثي مجرد	ثلاثي مجرد
DervT	Derivational type	جامد غير مصدري, مفعول اسم, اسم فاعل, مصدر	جامد غير مصدري
Num	Grammatical number	جمع, مثنى, مفرد	مفرد
Root	Underlying triliteral or quadriliteral root	Arabic root sequence	وجب, وضء, كتب
TOV	Verb type (aspect/transitivity)	1 = active; 2 = passive; 3 = reflexive; 4 = intensive	1
Time	Tense/temporal category	أمر, مضارع, ماض	ماض
Voic	Verb voice	مجهول معلوم	معلوم
Kol	Functional classification	شرطيه, استينافيه, عاطفه, تعريف, جاره	جاره

The dataset is provided in UTF-8 CSV format with three columns: (1) ID, a unique row identifier; (2) Token, the surface form extracted from the source text; and (3) AnnotationXML, an embedded XML string encoding all 15 attributes. The XML schema is fully documented on Zenodo (DOI: 10.5281/zenodo.18138582) and can be processed using standard XML libraries such as ElementTree, lxml, or equivalent parsers.

Missing Lemmas and Roots. Absence of lemma or root values does not indicate incomplete annotation; it reflects valid linguistic properties. Several token types—especially particles, clitics, and functional prefixes such as لـ, الـ, and و—do not possess dictionary lemmas or triliteral roots. These morphemes serve syntactic functions rather than lexical ones; hence, lemma and root fields are intentionally left blank for such tokens.

4.4 XML Representation Example

The dataset is also released in a structured XML format to support integration with corpus-linguistic workflows and NLP tools. Each token and its attributes are represented as an independent <word> element, facilitating parsing and feature extraction. The authoritative XML schema corresponds to the Zenodo release (version 3).

For typographic robustness in the manuscript PDF, the snippet below illustrates the XML structure and attribute layout with values omitted. Fully instantiated XML examples (with Arabic surface forms and complete attribute values) are provided in the Zenodo distribution.


<word Seq=”” Slice=”” Entry=”” Affix=”” Pos=”” Lemma=”” Root=””TOV=”” Time=”” Voic=”” Num=”” Kol=”” Lang=”” />

This hierarchical representation enables precise and reproducible processing and supports downstream tasks such as morphological disambiguation, lemmatization, and the training of machine-learning models for Classical Arabic.

5 Reuse Potential

The NoorGhateh dataset provides an important resource for researchers in computational linguistics, machine learning, and Arabic natural language processing. Its fine-grained manual annotations and carefully curated structure make it suitable for both training and evaluation across multiple applications.

5.1 Benchmarking and Evaluation

Researchers can use NoorGhateh as a gold-standard benchmark to evaluate the performance of morphological analyzers, segmentation systems, lemmatizers, and part-of-speech taggers for Classical Arabic. Previous benchmark studies have relied on resources such as the Buckwalter Arabic Morphological Analyzer (Buckwalter, 2002), the Standard Arabic Morphological Analyzer (SAMA) (Maamouri et al., 2010), and the Penn Arabic Treebank (PATB) (Maamouri et al., 2004).

The manual annotations in NoorGhateh support precise accuracy measurements and detailed error analyses across different system architectures, offering a domain-specific complement to existing resources while enabling rigorous comparison of rule-based, statistical, and neural approaches.

5.2 Machine Learning and Model Development

The dataset provides high-quality training material for neural architectures such as BiLSTM, Transformer, and BERT-based models aimed at Arabic morphological disambiguation, segmentation, and feature prediction. Prior research has shown that consistent morphological preprocessing substantially improves translation and NLP performance (Habash & Sadat, 2006). NoorGhateh extends this insight by offering fully verified, domain-specific annotations that support supervised learning, reproducible experiments, and meaningful cross-model evaluations.

5.3 Cross-domain and Diachronic Studies

By complementing modern Arabic corpora such as PATB (Maamouri et al., 2004) and SAMA (Maamouri et al., 2010), NoorGhateh enables research on domain adaptation, transfer learning, and diachronic linguistic analysis. Its Classical Arabic annotation style supports comparative investigations into how morphological patterns shift between medieval jurisprudential prose and contemporary MSA, building on the comparative tradition of earlier morphological frameworks (Buckwalter, 2002).

5.4 Digital Humanities and Corpus Linguistics

For scholars in digital humanities, NoorGhateh offers structured access to morphological phenomena, lexical variation, and syntactic patterns in pre-modern Islamic texts. Its design parallels the structured annotation layers found in SAMA (Maamouri et al., 2010) and PATB (Maamouri et al., 2004), making it suitable for integration with broader Classical Arabic corpora and enabling large-scale quantitative analyses of legal and juristic writing.

5.5 Educational and Demonstration Applications

The publicly released 313-token sample serves as an effective pedagogical resource for illustrating Arabic morphological annotation, clitic segmentation, and feature extraction in classroom settings or workshop demonstrations. It can be used alongside established resources (Buckwalter, 2002; Maamouri et al., 2004, 2010) to teach the evolution of annotation practices and to demonstrate differences between Classical and Modern Arabic.

5.6 Enhancing AI-Based Approaches in Arabic NLP

Recent advances in Arabic NLP have been driven largely by large-scale pretrained language models and toolkits designed for MSA and dialectal varieties. Models such as AraBERT (Antoun et al., 2021a), QARiB (Abdelali et al., 2021) and MARBERT (Abdul-Mageed et al., 2021), and AraGPT2 (Antoun et al., 2021b) demonstrate substantial improvements in understanding and generation tasks. Frameworks such as CAMeL Tools (Obeid et al., 2020) and datasets from the MADAR project (Bouamor et al., 2018) further support morphology, dialect identification, and language modeling.

However, these systems are trained predominantly on contemporary or social-media text. AraBERT relies on approximately 77 GB of MSA data from OSCAR, Arabic Wikipedia, and news corpora; MARBERT and QARiB ext.

5.7 Limitations and Challenges

While the Noor-Ghateh dataset provides a verified benchmark for Classical Arabic morphological analysis, several limitations should be acknowledged to guide its effective use. First, the dataset originates entirely from a single source text—Sharaye al-Islam—which reflects the linguistic and stylistic characteristics of a specific scholarly domain. As a result, certain lexical items, syntactic structures, and morphological features common in other genres (e.g., literary, Qurʾānic, or modern texts) are underrepresented. This domain specificity may affect the generalizability of models trained exclusively on this dataset to other forms of Arabic.

Second, the open-access sample contains 313 tokens, designed primarily for demonstration, benchmarking, and methodological replication rather than for large-scale model training. Researchers intending to develop data-intensive systems are therefore encouraged to request access to the full 223,690-token corpus.

Third, as with all manually annotated resources, minor inconsistencies or subjective decisions may persist despite the multi-stage verification process. Users should exercise caution when extending the annotation schema to new data or comparing results across different morphological frameworks.

Finally, the dataset is presented in normalized Arabic orthography without diacritics, which may introduce ambiguity for tasks that rely on phonological or syntactic disambiguation. These factors should be considered when applying the dataset in downstream applications such as translation, lemmatization, or morphological disambiguation.

Acknowledgements

The authors acknowledge the Computer Research Center for Islamic Sciences (Noor) for preparing and annotating the dataset, and the School of Computer Engineering at Iran University of Science and Technology (IUST) for providing academic supervision and computational infrastructure. Huda AlShuhayeb verified the dataset’s integrity, prepared the data paper, and conducted analytical evaluation to support future deep-learning and segmentation research.

Competing Interests

The authors have no competing interests to declare.

Author Contributions

Behrouz Minaei-Bidgoli: Conceptualization, Supervision, Methodology, Validation, Review.

Huda AlShuhayeb: Methodology, Investigation, Formal analysis, Validation, Visualization, Writing – original draft, Writing – review & editing.