OpenITI MAKHZAN: An Open Annotated Dataset of Arabic, Persian, Ottoman Turkish, and Urdu Print and Manuscript Data

[i] Note: The multi-language category includes documents tagged as Arabic-Persian, Ottoman Turkish-Arabic-Persian, Ottoman Turkish-Baleybelen, Ottoman Turkish-Persian, Arabic-Javanese, and Malay-Arabic. The eight documents contain pages in multiple languages.

Table 2

Canonical works in the ACDC training set.

TEXT	LANGUAGE	MANUSCRIPTS	PAGES	DATE RANGE	GENRE	SCRIPTS REPRESENTED
Dīvān of Ḥāfiẓ	Persian	7	37	1420–1587	Poetry	Nastaliq, Ta’liq
Gulistān of Saʿdī	Persian	17	157	1500–1841	Poetry	Nastaliq, Ta’liq, Eurabic, Naskh
al-Qāmūs al-Muḥīṭ of al-Fīrūzābādī	Arabic	12	43	1350–1700	Dictionaries	Naskh, Sudani, Maghribi
Dalāʾil al-Khayrāt of al-Jazūlī	Arabic	25	128	1650–1873	Devotional works	Maghribi, Naskh, Naskh-Nastaliq interlinear, Maghribi-Sudani, Naskh-Nastaliq mixed
Sharḥ al-ʿAqāʾid al-Nasafīya of al-Taftāzānī	Arabic	9	35	1441–1750	Commentary/Theology	Naskh, Ta’liq, Nastaliq
Total	2	70	400	1350–1873	4 genres	9 scripts or combinations

Table 3

Distribution of page images by language and document type.

	ARABIC	PERSIAN	URDU	OTTOMAN TURKISH	MALAY	MULTI-LANGUAGE	SUM
Manuscript	822	343	0	241	2	15	1,423
Publication	0	0	74	0	0	0	74
Sum	822	343	74	241	2	15	1,497

Table 4

Distribution of page images by script and language.³

	ARABIC	PERSIAN	URDU	OTTOMAN TURKISH	MALAY	MULTI-LANGUAGE	SUM
Naskh	732	19	1	229	0	8	989
Nastaliq	10	200	73	0	0	3	286
Eurabic	0	100	0	0	0	0	100
Maghribi	45	0	0	0	0	0	45
Ta’liq	12	9	0	5	0	0	26
Naskh/Nastaliq mixed	13	0	0	0	0	0	13
Shikaste	0	11	0	0	0	0	11
Divani	0	0	0	5	0	0	5
Jawi Naskh	0	0	0	0	2	3	5
Yemeni Naskh	4	0	0	0	0	0	4
Maghribi/Sudani	3	0	0	0	0	0	3
Other	3	4	0	2	0	1	10

A randomly selected line from one of the transcribed pages (line 39, Doc ID: 4060 Part ID: 913829). The line shows various normalization decisions by highlighting the manuscript feature and transcription decision under it in the same color. Starting right to left it shows examples of the following features described in the tables above: 1.2/1.3 (pink), 3.4 (orange), 3.4 (purple), 2.6 (green), 2.1 (blue), 1.2/1.3 (pink), 2.7 (green), 3.4 (orange), and 3.6 (blue).

NO.	FEATURE IN MANUSCRIPT	TRANSCRIPTION DECISION	IMPLICATION/JUSTIFICATION
1.1	Alif Maqṣūra (ى)	Transcribed as standard Yāʾ (ي).	Since the difference is primarily in pronunciation, the transcription has been normalized to not differentiate between ي and ى. This is also consistent with transcribing inconsistently dotted Yāʾ in Persian, as well as transcribing the elongated nastaliq Persian Yāʾ (ے) as dotless ى. See the two entries that follow.
1.2	Persian Yāʾ (ی) (used in Nastaliq for both sounds)	Often transcribed as the standard Persian Yāʾ (ی), even when the manuscript uses the form of Urdu Barī Yā’ (ے).	Normalization to the standard Persian keyboard/encoding for the Yāʾ sound.
1.3	Inconsistently dotted Yāʾ (in manuscripts)	Almost always transcribed without dots (ی), except in some cases involving interlinear Arabic text, where the dotted Yāʾ (ي) may be used.	Prioritizes the common Persian usage over scribal inconsistency.
1.4	Kāf-like Gāf	Inconsistently normalized; sometimes transcribed as Gāf (گ) even when written like Kāf (ک).	Normalization depends on the transcriber’s judgment based on linguistic context.
1.5	Hamza on a Yāʾ seat (e.g., ئ)	Inconsistently normalized; transcribed as either a plain Yāʾ (ی/ي) for Persian or the full Hamza-on-Yāʾ (ئ) for Arabic	Cross-Linguistic Variation: Inconsistency arises from competing orthographic norms. Transcribers following Arabic conventions often retain the Hamza (سائر), while those following modern Persian standards often simplify it to a plain Yāʾ (سایر). This reflects the hybrid nature of the dataset’s linguistic tradition.
1.6	Alif Madda (آ) used as a long vowel (e.g., غآيب)	Transcribed as a standard Alif (غايب). The Alif Madda (آ) is ignored where it is used to write /ā/ and not /ʾā/. This usage of alif madda for /ā/ instead of /ʾā/ is particularly common in words with the sequence /ā/ + hamza + /i/ such as غآيب which is transcribed as غايب	Phonetic Normalization: In these instances, the madda serves as a marker for the long vowel /ā/ rather than a glottal stop. Normalizing to a standard Alif aligns with modern Persian/Arabic conventions and ensures the HTR model associates the visual stroke with the correct semantic word, and avoids a character combination that may be non-standard or difficult to render.

NO.	FEATURE IN MANUSCRIPT	TRANSCRIPTION DECISION	IMPLICATION/JUSTIFICATION
2.1	Diacritics (Tashdīd and Vowels)	Not transcribed (omitted).	A typical practice for the dataset, prioritizing text content over full vocalization/diacritization. Some transcriptions may contain diacritics, such as the tashdīd (ّ).
2.2	Hamza on the line (ء)	Omitted where absent in the manuscript; missing hamzas on the line at the end of words are not added (e.g., transcribed as إبقا instead of إبقاء).	A diplomatic choice to not normalize missing hamzas.
2.3	Catchwords	Not transcribed (omitted).	Catchwords are often excluded from the main text transcription, but some pages may have them transcribed.
2.4	Hard-to-distinguish Letters (e.g., Rāʾ (ر) vs. Dāl (د), Wāw (و) vs. Dāl (د))	Differentiated and transcribed based on linguistic context.	A necessary interpretive judgment due to the cursive and varied nature of handwriting.
2.5	*Letters without dots (Iʿjām)*	Dots are added contextually (e.g., ٮلال as بلال or حسٮن as حسين).	Normalization to make the text readable; characters without dots are transcribed based on linguistic context.
2.6	Rubrication Marks (overlines, underlines, color)	Not represented; transcribed as regular text.	These codicological features are not captured in the transcription layer.
2.7	Nastaliq Shorthands	Cannot be represented in Unicode; transcribed as the full, contextually correct word/letters.	Unicode limitations necessitate normalization.

NO.	FEATURE IN MANUSCRIPT	TRANSCRIPTION DECISION	IMPLICATION/JUSTIFICATION
3.1	Sīn with teeth vs. toothless Sīn (in Nastaliq)	The standard Sīn (س) is used, as Unicode offers only one rendition.	Unicode limitation results in normalization. This applies to variations in other glyphs as well.
3.2	Sīn with an inverted ‘ب’ or dots underneath (scribal marks)	Transcribed as the standard Sīn (س) without the extra marks.	Scribal marks used only to indicate Sīn are ignored in the final transcription.
3.3	Confused three-dot clusters (e.g., Yāʾ (ي) and Bāʾ (ب) written together with three dots to resemble Pay (پی))	Transcribed according to linguistic context (e.g., Bī (بي written as پی) as بی instead of the visual پی).	Interpretation based on known words/grammar overrides a strictly diplomatic visual representation.
3.4	Do Chashmi Hāʾ (ها) vs. Nastaliq Hāʾ (ہا)	The Nastaliq Hāh (used in Persian) is transcribed using the modern Persian Hāʾ (ها).	Modern Persian keyboards lack the Nastaliq Hāh, forcing a normalization to the most functionally equivalent character. This also applies to the Tā Marbūṭa variants (e.g., vs. ).
3.5	Initial/Middle/Final Hāʾ	Most cases follow the rendering from the modern Persian keyboard.	Avoids switching between Persian and Urdu keyboard Hāʾ variants to maintain consistency.
3.6	Unusual glyphs, such as triangular three-dot misra marker	Transcribed as closely as possible, such as three consecutive dots for the triangular three-dot misra marker (ellipses).	No Unicode representation exists for the specific marker.
3.7	Unauthorized Connections	Letters that should be terminal but are connected by the scribe are transcribed as disconnected glyphs.	Unicode cannot represent the resulting non-standard connected glyphs.
3.8	Confused three-dot letters (e.g., Thāʾ (ث) vs. toothless Shīn (ش))	Differentiated and transcribed based on linguistic context.	A necessary interpretive judgment when single-toothed ث and ش are visually identical.

References

Authors

Metrics

Articles in this issue

DOI: https://doi.org/10.5334/johd.465 | Journal eISSN: 2059-481X

Journal RSS Feed

Language: English

Page range: 69 - 69

Submitted on: Nov 9, 2025

Accepted on: May 5, 2026

Published on: May 26, 2026

Published by: Ubiquity Press

In partnership with: Paradigm Publishing Services

Publication frequency: 1 issue per year

Keywords:

HTR,

OCR,

Arabic,

Persian,

Ottoman Turkish,

Urdu

© 2026 Jonathan Parkes Allen, John Mullan, Lorenz Nigst, Mathew Barber, Taimoor Shahid-Khan, Masoumeh Seydi, Danlu Chen, Yufei Weng, Nikolai Vogler, Jacob Murel, Osama Eshera, Taylor Berg-Kirkpatrick, David Smith, Sarah Bowen Savant, Matthew Thomas Miller, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.

Volume 12 (2026): Issue 1

OpenITI MAKHZAN: An Open Annotated Dataset of Arabic, Persian, Ottoman Turkish, and Urdu Print and Manuscript Data

Figures & Tables

Table 1

Table 2

Table 3

Table 4

Figure 1

Paradigm

My account