Skip to main content
Have a personal or library account? Click to login
OpenITI MAKHZAN: An Open Annotated Dataset of Arabic, Persian, Ottoman Turkish, and Urdu Print and Manuscript Data Cover

OpenITI MAKHZAN: An Open Annotated Dataset of Arabic, Persian, Ottoman Turkish, and Urdu Print and Manuscript Data

Open Access
|May 2026

Figures & Tables

Table 1

Distribution of documents by language and type.

ARABICPERSIANURDUOTTOMAN TURKISHMALAYMULTI-LANGUAGESUM
Manuscript1214402618200
Publication0080008
Sum1214482618208

[i] Note: The multi-language category includes documents tagged as Arabic-Persian, Ottoman Turkish-Arabic-Persian, Ottoman Turkish-Baleybelen, Ottoman Turkish-Persian, Arabic-Javanese, and Malay-Arabic. The eight documents contain pages in multiple languages.

Table 2

Canonical works in the ACDC training set.

TEXTLANGUAGEMANUSCRIPTSPAGESDATE RANGEGENRESCRIPTS REPRESENTED
Dīvān of ḤāfiẓPersian7371420–1587PoetryNastaliq, Ta’liq
Gulistān of SaʿdīPersian171571500–1841PoetryNastaliq, Ta’liq, Eurabic, Naskh
al-Qāmūs al-Muḥīṭ of al-FīrūzābādīArabic12431350–1700DictionariesNaskh, Sudani, Maghribi
Dalāʾil al-Khayrāt of al-JazūlīArabic251281650–1873Devotional worksMaghribi, Naskh, Naskh-Nastaliq interlinear, Maghribi-Sudani, Naskh-Nastaliq mixed
Sharḥ al-ʿAqāʾid al-Nasafīya of al-TaftāzānīArabic9351441–1750Commentary/TheologyNaskh, Ta’liq, Nastaliq
Total2704001350–18734 genres9 scripts or combinations
Table 3

Distribution of page images by language and document type.

ARABICPERSIANURDUOTTOMAN TURKISHMALAYMULTI-LANGUAGESUM
Manuscript82234302412151,423
Publication007400074
Sum822343742412151,497
Table 4

Distribution of page images by script and language.3

ARABICPERSIANURDUOTTOMAN TURKISHMALAYMULTI-LANGUAGESUM
Naskh73219122908989
Nastaliq1020073003286
Eurabic01000000100
Maghribi450000045
Ta’liq129050026
Naskh/Nastaliq mixed130000013
Shikaste011000011
Divani0005005
Jawi Naskh0000235
Yemeni Naskh4000004
Maghribi/Sudani3000003
Other34020110
Figure 1

A randomly selected line from one of the transcribed pages (line 39, Doc ID: 4060 Part ID: 913829). The line shows various normalization decisions by highlighting the manuscript feature and transcription decision under it in the same color. Starting right to left it shows examples of the following features described in the tables above: 1.2/1.3 (pink), 3.4 (orange), 3.4 (purple), 2.6 (green), 2.1 (blue), 1.2/1.3 (pink), 2.7 (green), 3.4 (orange), and 3.6 (blue).

NO.FEATURE IN MANUSCRIPTTRANSCRIPTION DECISIONIMPLICATION/JUSTIFICATION
1.1Alif Maqṣūra (ى)Transcribed as standard Yāʾ (ي).Since the difference is primarily in pronunciation, the transcription has been normalized to not differentiate between ي and ى. This is also consistent with transcribing inconsistently dotted Yāʾ in Persian, as well as transcribing the elongated nastaliq Persian Yāʾ (ے) as dotless ى. See the two entries that follow.
1.2Persian Yāʾ (ی) (used in Nastaliq for both sounds)Often transcribed as the standard Persian Yāʾ (ی), even when the manuscript uses the form of Urdu Barī Yā’ (ے).Normalization to the standard Persian keyboard/encoding for the Yāʾ sound.
1.3Inconsistently dotted Yāʾ (in manuscripts)Almost always transcribed without dots (ی), except in some cases involving interlinear Arabic text, where the dotted Yāʾ (ي) may be used.Prioritizes the common Persian usage over scribal inconsistency.
1.4Kāf-like GāfInconsistently normalized; sometimes transcribed as Gāf (گ) even when written like Kāf (ک).Normalization depends on the transcriber’s judgment based on linguistic context.
1.5Hamza on a Yāʾ seat (e.g., ئ)Inconsistently normalized; transcribed as either a plain Yāʾ (ی/ي) for Persian or the full Hamza-on-Yāʾ (ئ) for ArabicCross-Linguistic Variation: Inconsistency arises from competing orthographic norms. Transcribers following Arabic conventions often retain the Hamza (سائر), while those following modern Persian standards often simplify it to a plain Yāʾ (سایر). This reflects the hybrid nature of the dataset’s linguistic tradition.
1.6Alif Madda (آ) used as a long vowel (e.g., غآيب)Transcribed as a standard Alif (غايب). The Alif Madda (آ) is ignored where it is used to write /ā/ and not /ʾā/. This usage of alif madda for /ā/ instead of /ʾā/ is particularly common in words with the sequence /ā/ + hamza + /i/ such as غآيب which is transcribed as غايبPhonetic Normalization: In these instances, the madda serves as a marker for the long vowel /ā/ rather than a glottal stop. Normalizing to a standard Alif aligns with modern Persian/Arabic conventions and ensures the HTR model associates the visual stroke with the correct semantic word, and avoids a character combination that may be non-standard or difficult to render.
NO.FEATURE IN MANUSCRIPTTRANSCRIPTION DECISIONIMPLICATION/JUSTIFICATION
2.1Diacritics (Tashdīd and Vowels)Not transcribed (omitted).A typical practice for the dataset, prioritizing text content over full vocalization/diacritization. Some transcriptions may contain diacritics, such as the tashdīd (ّ).
2.2Hamza on the line (ء)Omitted where absent in the manuscript; missing hamzas on the line at the end of words are not added (e.g., transcribed as إبقا instead of إبقاء).A diplomatic choice to not normalize missing hamzas.
2.3CatchwordsNot transcribed (omitted).Catchwords are often excluded from the main text transcription, but some pages may have them transcribed.
2.4Hard-to-distinguish Letters (e.g., Rāʾ (ر) vs. Dāl (د), Wāw (و) vs. Dāl (د))Differentiated and transcribed based on linguistic context.A necessary interpretive judgment due to the cursive and varied nature of handwriting.
2.5Letters without dots (Iʿjām)Dots are added contextually (e.g., ٮلال as بلال or حسٮن as حسين).Normalization to make the text readable; characters without dots are transcribed based on linguistic context.
2.6Rubrication Marks (overlines, underlines, color)Not represented; transcribed as regular text.These codicological features are not captured in the transcription layer.
2.7Nastaliq ShorthandsCannot be represented in Unicode; transcribed as the full, contextually correct word/letters.Unicode limitations necessitate normalization.
NO.FEATURE IN MANUSCRIPTTRANSCRIPTION DECISIONIMPLICATION/JUSTIFICATION
3.1Sīn with teeth vs. toothless Sīn (in Nastaliq)The standard Sīn (س) is used, as Unicode offers only one rendition.Unicode limitation results in normalization. This applies to variations in other glyphs as well.
3.2Sīn with an inverted ‘ب’ or dots underneath (scribal marks)Transcribed as the standard Sīn (س) without the extra marks.Scribal marks used only to indicate Sīn are ignored in the final transcription.
3.3Confused three-dot clusters (e.g., Yāʾ (ي) and Bāʾ (ب) written together with three dots to resemble Pay (پی))Transcribed according to linguistic context (e.g., Bī (بي written as پی) as بی instead of the visual پی).Interpretation based on known words/grammar overrides a strictly diplomatic visual representation.
3.4Do Chashmi Hāʾ (ها) vs. Nastaliq Hāʾ (ہا)The Nastaliq Hāh (used in Persian) is transcribed using the modern Persian Hāʾ (ها).Modern Persian keyboards lack the Nastaliq Hāh, forcing a normalization to the most functionally equivalent character. This also applies to the Tā Marbūṭa variants (e.g., johd-12-465-g2.png vs. johd-12-465-g3.png).
3.5Initial/Middle/Final HāʾMost cases follow the rendering from the modern Persian keyboard.Avoids switching between Persian and Urdu keyboard Hāʾ variants to maintain consistency.
3.6Unusual glyphs, such as triangular three-dot misra markerTranscribed as closely as possible, such as three consecutive dots for the triangular three-dot misra marker (ellipses).No Unicode representation exists for the specific marker.
3.7Unauthorized ConnectionsLetters that should be terminal but are connected by the scribe are transcribed as disconnected glyphs.Unicode cannot represent the resulting non-standard connected glyphs.
3.8Confused three-dot letters (e.g., Thāʾ (ث) vs. toothless Shīn (ش))Differentiated and transcribed based on linguistic context.A necessary interpretive judgment when single-toothed ث and ش are visually identical.
DOI: https://doi.org/10.5334/johd.465 | Journal eISSN: 2059-481X
Language: English
Page range: 69 - 69
Submitted on: Nov 9, 2025
Accepted on: May 5, 2026
Published on: May 26, 2026
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2026 Jonathan Parkes Allen, John Mullan, Lorenz Nigst, Mathew Barber, Taimoor Shahid-Khan, Masoumeh Seydi, Danlu Chen, Yufei Weng, Nikolai Vogler, Jacob Murel, Osama Eshera, Taylor Berg-Kirkpatrick, David Smith, Sarah Bowen Savant, Matthew Thomas Miller, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.