OpenITI MAKHZAN: An Open Annotated Dataset of Arabic, Persian, Ottoman Turkish, and Urdu Print and Manuscript Data

Jonathan Parkes Allen; John Mullan; Lorenz Nigst; Mathew Barber; Taimoor Shahid-Khan; Masoumeh Seydi; Danlu Chen; Yufei Weng; Nikolai Vogler; Jacob Murel; Osama Eshera; Taylor Berg-Kirkpatrick; David Smith; Sarah Bowen Savant; Matthew Thomas Miller

doi:10.5334/johd.465

(1) Context and motivation

The Open Islamicate Texts Initiative (OpenITI) was founded in 2017 with the goal of building the digital infrastructure for the study of the premodern Islamicate world. The underperformance of automatic transcription via Optical Character Recognition (OCR) and Handwritten Text Recognition (HTR) for Arabic-script languages long stymied the production of large-scale corpora in Islamicate languages, such as Persian, Arabic, Ottoman Turkish, and Urdu. As late as 2017, researchers evaluating the performances of the most prominent proprietary and open-source OCR systems found that they all have “low performance accuracy rates,” typically “below 75 per cent” (Alghamdi & Teahan, 2017). The character accuracy of these engines on handwritten documents was, not surprisingly, no better and often quite a bit worse — especially in the case of the open-source engines tested — ranging from 60% to mid-70% on in-domain data (Clausner et al., 2018; Keinan-Schoonbaert, 2019, 2020; Vogler et al., 2022).

OpenITI has sought to address these barriers to high-quality automatic transcription by collaborating with computer scientists to help optimize the Kraken OCR system (Kiessling, 2019) for Arabic-script languages and to develop two new machine learning techniques —unsupervised pretraining by lacuna reconstruction (Vogler et al. 2022) and automatic collation-driven distant supervision (Smith et al., 2023) — that reduce the reliance on human annotated training data for model improvement. These efforts have yielded the most accurate transcription rates to date for both print and handwritten Arabic-script documents (Kiessling et al., 2017; Kiessling et al., 2021; Vogler et al., 2022; Smith et al., 2023).

But, as Alghamdi and Teahan (2017) note, the barriers to improving Arabic-script automatic transcription are not only technical in nature; rather, there is also a lack of large, high-quality, open-access Arabic-script datasets to train and evaluate on. The OpenITI MAKHZAN dataset aims to help address this desideratum by publishing all the ground truth data we have produced in our projects from 2018–2025, which is composed of the following (Table 1):

Table 1

Distribution of documents by language and type.

	ARABIC	PERSIAN	URDU	OTTOMAN TURKISH	MALAY	MULTI-LANGUAGE	SUM
Manuscript	121	44	0	26	1	8	200
Publication	0	0	8	0	0	0	8
Sum	121	44	8	26	1	8	208

[i] Note: The multi-language category includes documents tagged as Arabic-Persian, Ottoman Turkish-Arabic-Persian, Ottoman Turkish-Baleybelen, Ottoman Turkish-Persian, Arabic-Javanese, and Malay-Arabic. The eight documents contain pages in multiple languages.

(2) Dataset Description

Repository location

10.5281/zenodo.19861912

Repository name

Zenodo

Object name

OpenITI MAKHZAN (Version 2026.1.2)

Format names and versions

png or jpg, xml, alto

Creation dates

2019-07-01–2025-05-01

Dataset creators

Jonathan Parkes Allen, John Mullan, Taimoor Shahid-Khan, Şaban Ağalar, Mehdy Sedaghat Payam, Gennady Kurin, Matthew Thomas Miller, Asad Zaman (UMD); Lorenz Nigst, Mathew Barber, Shireen Zaineb, Yusuf Umrethwala, Masoumeh Seydi, Sarah Bowen Savant (AKU); Furrukh Shahzad, Mirza Sohail Liaqat, Moqueet Afzaal, Muhammad Imtiaz Ahmad, Abdul Majeed, Madiha Asghar, Mehreen Tahir, and Hafsah Khalid (FCCU); David Smith (Northeastern); Taylor Berg-Kirkpatrick, Nikolai Vogler (UCSD)

Languages

Arabic, Persian, Ottoman Turkish, Urdu, Malay, Javanese, Baleybelen

License

CC BY-NC-SA

Publication date

April 28, 2026

(3) Method

The OpenITI MAKHZAN dataset comprises ground truth transcription and segmentation data for 208 documents (200 manuscripts and eight printed Urdu publications) encompassing nearly 1,500 page images in Arabic, Persian, Ottoman Turkish, Urdu, Malay, Javanese, and the Ottoman-period constructed language Baleybelen, produced between 2019 and 2025 using the eScriptorium platform. Across all sources, the dataset draws on manuscripts and printed texts chronologically ranging across more than a millennium from 900 CE to 1990, from 30 repositories worldwide including the Staatsbibliothek zu Berlin, Bibliothèque nationale de France, Princeton University Library, the Bodleian Library, and the Library of Congress, among others. A full list is provided in the metadata. The dataset provides page images paired with line-level segmentation and transcription exported in ALTO XML format. To the extent possible, information about transcription dates of manuscripts and printing dates of publications, along with the scripts represented, is provided in the accompanying metadata. Two tables at the end of this section provide a full breakdown of the dataset’s 1,497 page images by language and document type (Table 3), and by script type and language (Table 4).

All transcriptions in this dataset followed a common workflow: digitized page images were uploaded to eScriptorium, manually segmented at the line level, and then manually transcribed. Transcriptions were subsequently reviewed by a second team member before export. While the bulk of this work was completed by core OpenITI team members, some transcription was carried out by supervised graduate students and participants in our digital paleography courses, whose work was vetted by a team member before inclusion. Whether a given page has been vetted once or twice is indicated in the metadata under the heading “Category.” The manuscript data originates from two distinct production contexts, described below as Source 1 and Source 2, and the metadata identifies the source of each page under the heading “Dataset Source.”

Source 1: Systematic model-development and evaluation data

The first source of manuscript data consists of transcriptions produced through the Arabic-script Corpus Development and Curation (ACDC) initiative with the specific aim of developing and evaluating HTR and segmentation models for Arabic-script manuscripts. This source comprises 523 pages from 150 manuscripts and is itself divided into two complementary subsets — a training set built around canonical texts and a broader evaluation set — each described below. The metadata identifies these subsets under the heading “Dataset Source” as ACDC-Training and ACDC-Evaluation respectively.

Training data: canonical texts. To build a training corpus with both depth and diversity, we selected five widely-copied canonical works — two in Persian and three in Arabic — for which numerous manuscript copies were accessible in digitized and publicly available formats. By sampling many copies of a small number of texts, we were able to hold textual content roughly constant while maximizing variation in script, hand, date, and region. Table 2 summarizes the five canonical works and their representation in the dataset.

Table 2

Canonical works in the ACDC training set.

TEXT	LANGUAGE	MANUSCRIPTS	PAGES	DATE RANGE	GENRE	SCRIPTS REPRESENTED
Dīvān of Ḥāfiẓ	Persian	7	37	1420–1587	Poetry	Nastaliq, Ta’liq
Gulistān of Saʿdī	Persian	17	157	1500–1841	Poetry	Nastaliq, Ta’liq, Eurabic, Naskh
al-Qāmūs al-Muḥīṭ of al-Fīrūzābādī	Arabic	12	43	1350–1700	Dictionaries	Naskh, Sudani, Maghribi
Dalāʾil al-Khayrāt of al-Jazūlī	Arabic	25	128	1650–1873	Devotional works	Maghribi, Naskh, Naskh-Nastaliq interlinear, Maghribi-Sudani, Naskh-Nastaliq mixed
Sharḥ al-ʿAqāʾid al-Nasafīya of al-Taftāzānī	Arabic	9	35	1441–1750	Commentary/Theology	Naskh, Ta’liq, Nastaliq
Total	2	70	400	1350–1873	4 genres	9 scripts or combinations

Table 3

Distribution of page images by language and document type.

	ARABIC	PERSIAN	URDU	OTTOMAN TURKISH	MALAY	MULTI-LANGUAGE	SUM
Manuscript	822	343	0	241	2	15	1,423
Publication	0	0	74	0	0	0	74
Sum	822	343	74	241	2	15	1,497

Table 4

Distribution of page images by script and language.³

	ARABIC	PERSIAN	URDU	OTTOMAN TURKISH	MALAY	MULTI-LANGUAGE	SUM
Naskh	732	19	1	229	0	8	989
Nastaliq	10	200	73	0	0	3	286
Eurabic	0	100	0	0	0	0	100
Maghribi	45	0	0	0	0	0	45
Ta’liq	12	9	0	5	0	0	26
Naskh/Nastaliq mixed	13	0	0	0	0	0	13
Shikaste	0	11	0	0	0	0	11
Divani	0	0	0	5	0	0	5
Jawi Naskh	0	0	0	0	2	3	5
Yemeni Naskh	4	0	0	0	0	0	4
Maghribi/Sudani	3	0	0	0	0	0	3
Other	3	4	0	2	0	1	10

The choice of these five texts was motivated by several factors: all are among the most frequently copied works in their respective literary traditions, ensuring wide availability of digitized manuscript copies; they span four distinct genres (poetry, lexicography, devotional literature, and theological commentary); and their manuscript traditions encompass a broad chronological range from the mid-14th to the late-19th century. Across the 70 manuscripts sampled, the training set captures nine distinct script types or combinations, including not only the dominant naskh and nastaliq but also Maghribi, Sudani, Ta’liq, and even a handwritten instance of “Eurabic” — a term coined by Thomas Milo to describe European Arabic script and type that erroneously combines naskh and thuluth styles.¹

Evaluation data: broad sampling. In addition to the canonical training texts, we assembled a broader evaluation set designed to test the generalization of HTR models beyond their training domain. Where the training set prioritized depth (many copies of few texts), the evaluation set prioritized breadth: 123 pages drawn from 80 manuscripts representing 75 distinct works, 39 genres, and 14 script types, with a date range extending from approximately 900 CE to 1874 CE. Most works in the evaluation set are represented by only one or two pages, ensuring that evaluation covers a wide variety of hands, text layouts, and textual traditions rather than testing repeatedly on similar material. The evaluation set includes works on geography, astrology, grammar, law, history, agriculture, philosophy, medicine, and the occult sciences, among other fields — a substantially broader generic range than the training set.

Towards the goal of achieving wide linguistic coverage, the evaluation set extends beyond Arabic and Persian to include manuscripts in Ottoman Turkish, Malay, Javanese, and even the Ottoman-period constructed language Baleybelen.

Specificities and limitations. Because the two ACDC subsets were designed for different model-development purposes, they differ in sampling depth and transcription scope. The canonical training subset was built from multiple witnesses of five frequently copied texts and therefore includes deeper sampling from selected manuscripts where this was useful for training across script, hand, date, and layout variation. In this subset, transcription generally targeted the main text body; interlinear, marginal, and other paratextual material was excluded, except in manuscripts where the main text itself is distributed across both central and marginal text blocks. The evaluation subset was prepared for broader out-of-domain testing and is therefore more comprehensively segmented: on the selected pages, all visible lines were segmented and their regions marked, with the majority of main, marginal, interlinear, and other textual material also transcribed. Across both the training and evaluation subsets, we generally favored manuscripts in more legible hands written for a readership beyond the copyist, rather than those in tight, cramped, rapid, or otherwise difficult-to-read hands intended primarily for private consumption. Some regional script variations were underrepresented due to limitations in digitization or public availability — most notably ṣīnī, the Arabic-script developed in China.

Source 2: Research-driven transcriptions

The second source of manuscript data consists of 974 pages from 50 manuscripts and eight printed calligraphic editions, representing approximately 50 distinct works across roughly 15 broad genre families, ranging from poetry, narrative romance, hagiography, devotional literature, hadith, commentary, and Sufism to agriculture, medicine, philosophy, travel writing, apologetics, cosmology, literary criticism, and paratextual or bibliographic material. Unlike the systematic sampling of Source 1, Source 2 varies significantly in depth: it includes both single-page samples and substantial transcriptions — the longest being a 315-page transcription of the Riḥlat Muṣṭafā al-Laṭīfī. The chronological range is broad, extending from 1200 CE to 1990 CE. While the material primarily uses naskh and nastaliq, it also includes specialized hands like Yemeni naskh and shikaste. The selections reflect the particular interests and linguistic specializations of members of the OpenITI team. A number of these texts were transcribed as part of our ongoing multi-witness manuscript collation and edition creation project, for which participants select a given text that has numerous manuscript exemplars displaying varying degrees of internal textual variability and change over time. A subset was produced by students in our summer digital paleography and codicology course and some by colleagues during our weekly manuscript reading sessions; this data has been double-checked. However, the majority of Source 2’s transcription data has been vetted only once and is therefore relatively less reliable than the double-checked data.

Urdu print data

The eight printed Urdu publications in this dataset were produced in collaboration with Forman Christian College in Pakistan. These publications occupy a distinctive position in the dataset: typeset printing arrived relatively late in the Urdu publishing world, and most of the works included here were originally composed by professional scribes (kātibs) whose hand-written texts were then reproduced using offset printing. The result is a unique category of hand-composed scripts standardized for mass publication — technically “printed,” but produced by scribal hands rather than movable type or digital typesetting, and exhibiting less variation than manuscripts created for individual, non-standardized production. It is this close kinship with manuscript writing that makes Urdu print data a natural complement to the manuscript data in MAKHZAN, and the primary reason that print data in this release is limited to Urdu: Arabic and Persian print, produced through conventional typesetting, follows a different production logic described elsewhere (Allen, forthcoming).

Our primary motivation in selecting Urdu texts was typographic diversity — we sought to represent the range of nastaliq hands used in Urdu offset printing. The selected texts were printed between 1921 and 1990 and span diverse genres, with original compositions dating from the eighteenth through early twenty-first centuries. Copyright restrictions on books published in India and Pakistan, where layered copyright laws protect works for 50 to 60 years respectively, prevented us from releasing images of many additional Urdu texts we have transcribed in the broader OpenITI project.²

Transcribed and empty lines

The ALTO XML files in the dataset distinguish between transcribed lines and empty, untranscribed lines. For each page, the metadata records both a “Transcribed Lines Count,” meaning lines with at least one non-blank text string, and a “Total Lines Count,” meaning all segmented lines, including empty lines with bounding boxes but no transcription. In the majority of pages (1,227 of 1,497), all segmented lines are transcribed. The remaining 270 pages contain some empty lines — typically marginal, interlinear, or other paratextual text that was segmented but not transcribed, and most of these pages belong to manuscripts from Source 1. Pages with empty lines can still be used for line-level HTR training on their transcribed portions, but users requiring complete page-level transcriptions should filter for pages where the two counts are equal.

(4) Results and discussion

The OpenITI MAKHZAN dataset, reflecting its multi-year, multi-project origin, exhibits variation in transcription practices. These variations arise from the inherent interpretive challenges of historical Arabic script and the technical limitations of the Unicode standard, resulting in a degree of minimal normalization across the dataset.

Transcription of historical documents is an inherently interpretive act. Decisions emerge from negotiations between the transcriber’s scholarly or pedagogical purposes, the complex material realities of the documents themselves — which reflect not only the scribe’s practices but also linguistic and generic contexts, interventions by subsequent hands, and the preservation and digitization histories of the originals — and anticipated computational applications that may not have been envisioned during the transcription process. In practical terms, these questions manifest most clearly where a given transcription sits along the spectrum from fully normalized (rendering historical text into modern representation, eliding scribal and orthographic peculiarities) to strictly diplomatic (preserving orthographic variation and scribal particularities to the fullest extent allowable by the Unicode standard).

Recent scholarship in digital humanities and HTR research has demonstrated that acknowledging and documenting these interpretive choices strengthens both the scholarly integrity of a dataset and the robustness of computational models trained on it (Panagiotidou et al., 2022). In our dataset, individual transcribers made necessary judgment calls when confronting ambiguous letterforms, variant spellings, or damaged text — decisions shaped by their expertise in specific historical periods and linguistic traditions. The resulting variation across the MAKHZAN dataset reflects the authentic diversity of transcription practices encountered in real-world scholarly environments — diversity that falls within the range most HTR frameworks can handle and that, in practice, enhances model generalization. The remainder of this section documents the key normalization decisions and representative inconsistencies across the dataset. This documentation is not comprehensive, and a systematic analysis of how these transcription variations impact downstream model performance is beyond the scope of this paper. However, the documentation provided here — with transcription decisions divided into three broad sets, along with a representative image demonstrating some features (Figure 1) — is intended to enable such investigations by future research. Notably, many inconsistencies documented here arise from the dataset’s position at the intersection of Arabic and Persian orthographic traditions, where transcribers may have prioritized different regional or historical standards for the same visual features.

A randomly selected line from one of the transcribed pages (line 39, Doc ID: 4060 Part ID: 913829). The line shows various normalization decisions by highlighting the manuscript feature and transcription decision under it in the same color. Starting right to left it shows examples of the following features described in the tables above: 1.2/1.3 (pink), 3.4 (orange), 3.4 (purple), 2.6 (green), 2.1 (blue), 1.2/1.3 (pink), 2.7 (green), 3.4 (orange), and 3.6 (blue).

Character normalization and substitution. The following table documents instances where a specific glyph or character in the manuscript is consistently or inconsistently represented by a different, typically more standard, Unicode character.

NO.	FEATURE IN MANUSCRIPT	TRANSCRIPTION DECISION	IMPLICATION/JUSTIFICATION
1.1	Alif Maqṣūra (ى)	Transcribed as standard Yāʾ (ي).	Since the difference is primarily in pronunciation, the transcription has been normalized to not differentiate between ي and ى. This is also consistent with transcribing inconsistently dotted Yāʾ in Persian, as well as transcribing the elongated nastaliq Persian Yāʾ (ے) as dotless ى. See the two entries that follow.
1.2	Persian Yāʾ (ی) (used in Nastaliq for both sounds)	Often transcribed as the standard Persian Yāʾ (ی), even when the manuscript uses the form of Urdu Barī Yā’ (ے).	Normalization to the standard Persian keyboard/encoding for the Yāʾ sound.
1.3	Inconsistently dotted Yāʾ (in manuscripts)	Almost always transcribed without dots (ی), except in some cases involving interlinear Arabic text, where the dotted Yāʾ (ي) may be used.	Prioritizes the common Persian usage over scribal inconsistency.
1.4	Kāf-like Gāf	Inconsistently normalized; sometimes transcribed as Gāf (گ) even when written like Kāf (ک).	Normalization depends on the transcriber’s judgment based on linguistic context.
1.5	Hamza on a Yāʾ seat (e.g., ئ)	Inconsistently normalized; transcribed as either a plain Yāʾ (ی/ي) for Persian or the full Hamza-on-Yāʾ (ئ) for Arabic	Cross-Linguistic Variation: Inconsistency arises from competing orthographic norms. Transcribers following Arabic conventions often retain the Hamza (سائر), while those following modern Persian standards often simplify it to a plain Yāʾ (سایر). This reflects the hybrid nature of the dataset’s linguistic tradition.
1.6	Alif Madda (آ) used as a long vowel (e.g., غآيب)	Transcribed as a standard Alif (غايب). The Alif Madda (آ) is ignored where it is used to write /ā/ and not /ʾā/. This usage of alif madda for /ā/ instead of /ʾā/ is particularly common in words with the sequence /ā/ + hamza + /i/ such as غآيب which is transcribed as غايب	Phonetic Normalization: In these instances, the madda serves as a marker for the long vowel /ā/ rather than a glottal stop. Normalizing to a standard Alif aligns with modern Persian/Arabic conventions and ensures the HTR model associates the visual stroke with the correct semantic word, and avoids a character combination that may be non-standard or difficult to render.

Omission or contextual interpretation. The following table covers elements from the manuscript that are either omitted, transcribed based on context, or represented by an approximation due to Unicode limitations.

NO.	FEATURE IN MANUSCRIPT	TRANSCRIPTION DECISION	IMPLICATION/JUSTIFICATION
2.1	Diacritics (Tashdīd and Vowels)	Not transcribed (omitted).	A typical practice for the dataset, prioritizing text content over full vocalization/diacritization. Some transcriptions may contain diacritics, such as the tashdīd (ّ).
2.2	Hamza on the line (ء)	Omitted where absent in the manuscript; missing hamzas on the line at the end of words are not added (e.g., transcribed as إبقا instead of إبقاء).	A diplomatic choice to not normalize missing hamzas.
2.3	Catchwords	Not transcribed (omitted).	Catchwords are often excluded from the main text transcription, but some pages may have them transcribed.
2.4	Hard-to-distinguish Letters (e.g., Rāʾ (ر) vs. Dāl (د), Wāw (و) vs. Dāl (د))	Differentiated and transcribed based on linguistic context.	A necessary interpretive judgment due to the cursive and varied nature of handwriting.
2.5	*Letters without dots (Iʿjām)*	Dots are added contextually (e.g., ٮلال as بلال or حسٮن as حسين).	Normalization to make the text readable; characters without dots are transcribed based on linguistic context.
2.6	Rubrication Marks (overlines, underlines, color)	Not represented; transcribed as regular text.	These codicological features are not captured in the transcription layer.
2.7	Nastaliq Shorthands	Cannot be represented in Unicode; transcribed as the full, contextually correct word/letters.	Unicode limitations necessitate normalization.

Script/style-specific normalization (Nastaliq and regional variations). The following table highlights normalizations due to the modern Arabic Unicode block’s inability to perfectly render specific glyphs common in historical scripts like Nastaliq, or regional styles.

NO.	FEATURE IN MANUSCRIPT	TRANSCRIPTION DECISION	IMPLICATION/JUSTIFICATION
3.1	Sīn with teeth vs. toothless Sīn (in Nastaliq)	The standard Sīn (س) is used, as Unicode offers only one rendition.	Unicode limitation results in normalization. This applies to variations in other glyphs as well.
3.2	Sīn with an inverted ‘ب’ or dots underneath (scribal marks)	Transcribed as the standard Sīn (س) without the extra marks.	Scribal marks used only to indicate Sīn are ignored in the final transcription.
3.3	Confused three-dot clusters (e.g., Yāʾ (ي) and Bāʾ (ب) written together with three dots to resemble Pay (پی))	Transcribed according to linguistic context (e.g., Bī (بي written as پی) as بی instead of the visual پی).	Interpretation based on known words/grammar overrides a strictly diplomatic visual representation.
3.4	Do Chashmi Hāʾ (ها) vs. Nastaliq Hāʾ (ہا)	The Nastaliq Hāh (used in Persian) is transcribed using the modern Persian Hāʾ (ها).	Modern Persian keyboards lack the Nastaliq Hāh, forcing a normalization to the most functionally equivalent character. This also applies to the Tā Marbūṭa variants (e.g., vs. ).
3.5	Initial/Middle/Final Hāʾ	Most cases follow the rendering from the modern Persian keyboard.	Avoids switching between Persian and Urdu keyboard Hāʾ variants to maintain consistency.
3.6	Unusual glyphs, such as triangular three-dot misra marker	Transcribed as closely as possible, such as three consecutive dots for the triangular three-dot misra marker (ellipses).	No Unicode representation exists for the specific marker.
3.7	Unauthorized Connections	Letters that should be terminal but are connected by the scribe are transcribed as disconnected glyphs.	Unicode cannot represent the resulting non-standard connected glyphs.
3.8	Confused three-dot letters (e.g., Thāʾ (ث) vs. toothless Shīn (ش))	Differentiated and transcribed based on linguistic context.	A necessary interpretive judgment when single-toothed ث and ش are visually identical.

(5) Implications and applications

The OpenITI MAKHZAN dataset expands the resources available for developing advanced text-recognition systems in Arabic-script text recognition. As a large, open-access collection of ground truth spanning Arabic, Persian, Ottoman Turkish, Urdu, and additional Arabic-script languages, it enables researchers to train and rigorously evaluate new OCR and HTR models across a wide spectrum of scripts and styles. In particular, MAKHZAN is designed to work hand-in-hand with cutting-edge open-source OCR frameworks. For example, it can be seamlessly integrated into the Kraken engine and the eScriptorium platform, which the project team has helped develop and optimize for Arabic-script processing. Using this dataset in combination with Kraken and eScriptorium allows for the training of high-accuracy recognition models tailored to historical Islamicate texts and thereby pushing the state-of-the-art in multi-lingual OCR. By addressing the chronic shortage of large, high-quality Arabic-script training data, MAKHZAN also contributes to training more robust language models in this domain, ensuring that future AI systems can read both printed and handwritten sources.

Beyond the technical sphere, the implications of the OpenITI MAKHZAN dataset are equally relevant for digital humanities and the study of premodern Islamic cultures. Improved OCR/HTR models arising from this dataset will accelerate the digitization of Islamic manuscripts and historical publications, which in the past had been hampered by low recognition accuracy. MAKHZAN will therefore help scholars unlock vast textual corpora — from classical Arabic literature to Ottoman chronicles and Urdu print media — that were previously accessible only through laborious manual transcription. Researchers in history, literature, and linguistics can leverage models trained on MAKHZAN to automatically transcribe and search within new sources, opening new avenues for large-scale text analysis, comparative studies, and digital scholarship. The dataset’s inclusion of diverse scripts and calligraphic styles means that models developed with it will be versatile. Scholars can incorporate these models into user-friendly interfaces like eScriptorium to collaboratively transcribe manuscripts from a wide range of historical and linguistic contexts. In sum, MAKHZAN serves as a catalyst for cross-disciplinary research: enabling computer scientists to deal with real-world data and empowering humanists with more powerful tools for digitization and analysis of written artefacts.

While the MAKHZAN dataset supports a wide range of text-recognition use cases, it is important to note its intentional limitations. First, the Ottoman Turkish transcriptions in this dataset are in their original Arabic script form and have not been transliterated into Latin script as is often the norm in Ottoman studies. While Arabic-script transcription is the appropriate format for an OCR/HTR ground truth dataset, scholars accustomed to working with transliterated Ottoman Turkish text should be aware of this distinction. Second, the suitability of this dataset for line segmentation varies by subset. A total of 292 pages — comprising the ACDC-Evaluation subset (123 pages), the Urdu print data (74 pages), and a selection of Persian and Arabic manuscripts in the Research subset (95 pages) — include comprehensive segmentation of all text regions on each page, including main text, marginal, interlinear, and other elements. These pages are flagged as “TRUE” in the “Segmentation Complete” column of the metadata and may be suitable for training or evaluating line segmentation models. The remaining pages generally include segmentation only of the main text body; interlinears, marginalia, catchwords, and other paratextual elements were typically not segmented or transcribed, with the exception of manuscripts in which the main text appears in both main and marginal text blocks. No subset includes region-level labeling (such as higher-order document structuring zones for headers, columns, marginalia, or images), so the dataset cannot be used for region classification or reading-order determination tasks. Users aiming for end-to-end document processing can combine MAKHZAN-trained HTR models with separate layout-analysis tools, and may find the comprehensively segmented subset useful for line-segmentation work specifically. Looking forward, the release of OpenITI MAKHZAN may inspire the development of analogous open datasets that include structural and layout annotations, moving toward comprehensive solutions that handle both page layout and text content.

Notes

[2] These designs, produced from the early modern period into the 19th century, often substituted thuluth-specific serifs and strokes into naskh text, reflecting a Western typographic tradition isolated from authentic Arabic script expertise. See Milo 2013, 98–101.

[3] The trained models resulting from this data will be made publicly available even where the underlying images cannot be released due to copyright.

[4] Table 4 groups closely related scripts’ descriptions from the metadata together to show the distribution of major script traditions across languages. “Naskh” includes ordinary naskh as well as one page of varied naskh. “Naskh/Nastaliq mixed” combines naskh-nastaliq mixed and naskh-nastaliq interlinear pages, since both involve the co-presence of naskh and nastaliq. “Other” consolidates Sudani (2), Bihari (2), Naskh/Divani mixed (2), Kufic (1), Jawi (1), Riq’a (1), Naskh/Shikaste mixed (1).

[5] For more information on these grant projects, please see: https://openiti.org/projects/.

Data Accessibility Statement

The OpenITI MAKHZAN dataset is deposited on Zenodo under a CC BY-NC-SA license. The dataset, including all page images, ALTO XML transcription files, and the accompanying metadata file, is available at: https://doi.org/10.5281/zenodo.19861912. The metadata file includes a “Dataset Source” column identifying the production context (ACDC-Training, ACDC-Evaluation, or Research) for each page.

Acknowledgements

The authors would like to express their sincere gratitude to all individuals who contributed to the project through data creation, discussion, and feedback. This includes participants in the first Digital Islamicate Paleography and Codicology Summer School, June 1st to August 20th, 2021, the second such summer school, June 5 to August 25, 2023, members of our various OpenITI online reading groups held from 2022 up to the present, and numerous others whose insights and efforts were invaluable.

Funding Statement

Work on the OpenITI MAKHZAN dataset was funded through grants from the Mellon Foundation (OpenITI Arabic-script OCR Catalyst Project Phase I, Bridge, and Phase II), National Endowment for the Humanities (HAA-277203-21), and the National Science Foundation (#2200333, #2200334).⁴

Author Contributions

Matthew Thomas Miller: conceptualization; supervision; funding acquisition; project administration; writing—original draft, review, and editing.

Taimoor Shahid-Khan: data curation; investigation; methodology; resources; supervision; writing—review and editing.

Lorenz Nigst: data curation; investigation; writing—review.

Jonathan Parkes Allen: data curation; investigation; resources; writing—review and editing.

Osama Eshera: data curation; investigation; writing—review and editing.

John Mullan: software; methodology; validation; writing—review.

David Smith: methodology; conceptualization; supervision; funding acquisition; formal analysis; validation; writing—review.

Mathew Barber: data curation; writing—review.

Masoumeh Seydi: data curation.

Danlu Chen: software; validation.

Yufei Weng: software; validation.

Nikolai Vogler: software; validation.

Jacob Murel: data curation; validation.

Taylor Berg-Kirkpatrick: methodology; funding acquisition; supervision; writing—review.

Peter Verkinderen: data curation; resources.

Sarah Bowen Savant: conceptualization; funding acquisition; supervision; writing—review and editing.