Table 1
Distribution of documents by language and type.
| ARABIC | PERSIAN | URDU | OTTOMAN TURKISH | MALAY | MULTI-LANGUAGE | SUM | |
|---|---|---|---|---|---|---|---|
| Manuscript | 121 | 44 | 0 | 26 | 1 | 8 | 200 |
| Publication | 0 | 0 | 8 | 0 | 0 | 0 | 8 |
| Sum | 121 | 44 | 8 | 26 | 1 | 8 | 208 |
[i] Note: The multi-language category includes documents tagged as Arabic-Persian, Ottoman Turkish-Arabic-Persian, Ottoman Turkish-Baleybelen, Ottoman Turkish-Persian, Arabic-Javanese, and Malay-Arabic. The eight documents contain pages in multiple languages.
Table 2
Canonical works in the ACDC training set.
| TEXT | LANGUAGE | MANUSCRIPTS | PAGES | DATE RANGE | GENRE | SCRIPTS REPRESENTED |
|---|---|---|---|---|---|---|
| Dīvān of Ḥāfiẓ | Persian | 7 | 37 | 1420–1587 | Poetry | Nastaliq, Ta’liq |
| Gulistān of Saʿdī | Persian | 17 | 157 | 1500–1841 | Poetry | Nastaliq, Ta’liq, Eurabic, Naskh |
| al-Qāmūs al-Muḥīṭ of al-Fīrūzābādī | Arabic | 12 | 43 | 1350–1700 | Dictionaries | Naskh, Sudani, Maghribi |
| Dalāʾil al-Khayrāt of al-Jazūlī | Arabic | 25 | 128 | 1650–1873 | Devotional works | Maghribi, Naskh, Naskh-Nastaliq interlinear, Maghribi-Sudani, Naskh-Nastaliq mixed |
| Sharḥ al-ʿAqāʾid al-Nasafīya of al-Taftāzānī | Arabic | 9 | 35 | 1441–1750 | Commentary/Theology | Naskh, Ta’liq, Nastaliq |
| Total | 2 | 70 | 400 | 1350–1873 | 4 genres | 9 scripts or combinations |
Table 3
Distribution of page images by language and document type.
| ARABIC | PERSIAN | URDU | OTTOMAN TURKISH | MALAY | MULTI-LANGUAGE | SUM | |
|---|---|---|---|---|---|---|---|
| Manuscript | 822 | 343 | 0 | 241 | 2 | 15 | 1,423 |
| Publication | 0 | 0 | 74 | 0 | 0 | 0 | 74 |
| Sum | 822 | 343 | 74 | 241 | 2 | 15 | 1,497 |
Table 4
Distribution of page images by script and language.3
| ARABIC | PERSIAN | URDU | OTTOMAN TURKISH | MALAY | MULTI-LANGUAGE | SUM | |
|---|---|---|---|---|---|---|---|
| Naskh | 732 | 19 | 1 | 229 | 0 | 8 | 989 |
| Nastaliq | 10 | 200 | 73 | 0 | 0 | 3 | 286 |
| Eurabic | 0 | 100 | 0 | 0 | 0 | 0 | 100 |
| Maghribi | 45 | 0 | 0 | 0 | 0 | 0 | 45 |
| Ta’liq | 12 | 9 | 0 | 5 | 0 | 0 | 26 |
| Naskh/Nastaliq mixed | 13 | 0 | 0 | 0 | 0 | 0 | 13 |
| Shikaste | 0 | 11 | 0 | 0 | 0 | 0 | 11 |
| Divani | 0 | 0 | 0 | 5 | 0 | 0 | 5 |
| Jawi Naskh | 0 | 0 | 0 | 0 | 2 | 3 | 5 |
| Yemeni Naskh | 4 | 0 | 0 | 0 | 0 | 0 | 4 |
| Maghribi/Sudani | 3 | 0 | 0 | 0 | 0 | 0 | 3 |
| Other | 3 | 4 | 0 | 2 | 0 | 1 | 10 |

Figure 1
A randomly selected line from one of the transcribed pages (line 39, Doc ID: 4060 Part ID: 913829). The line shows various normalization decisions by highlighting the manuscript feature and transcription decision under it in the same color. Starting right to left it shows examples of the following features described in the tables above: 1.2/1.3 (pink), 3.4 (orange), 3.4 (purple), 2.6 (green), 2.1 (blue), 1.2/1.3 (pink), 2.7 (green), 3.4 (orange), and 3.6 (blue).
| NO. | FEATURE IN MANUSCRIPT | TRANSCRIPTION DECISION | IMPLICATION/JUSTIFICATION |
|---|---|---|---|
| 1.1 | Alif Maqṣūra (ى) | Transcribed as standard Yāʾ (ي). | Since the difference is primarily in pronunciation, the transcription has been normalized to not differentiate between ي and ى. This is also consistent with transcribing inconsistently dotted Yāʾ in Persian, as well as transcribing the elongated nastaliq Persian Yāʾ (ے) as dotless ى. See the two entries that follow. |
| 1.2 | Persian Yāʾ (ی) (used in Nastaliq for both sounds) | Often transcribed as the standard Persian Yāʾ (ی), even when the manuscript uses the form of Urdu Barī Yā’ (ے). | Normalization to the standard Persian keyboard/encoding for the Yāʾ sound. |
| 1.3 | Inconsistently dotted Yāʾ (in manuscripts) | Almost always transcribed without dots (ی), except in some cases involving interlinear Arabic text, where the dotted Yāʾ (ي) may be used. | Prioritizes the common Persian usage over scribal inconsistency. |
| 1.4 | Kāf-like Gāf | Inconsistently normalized; sometimes transcribed as Gāf (گ) even when written like Kāf (ک). | Normalization depends on the transcriber’s judgment based on linguistic context. |
| 1.5 | Hamza on a Yāʾ seat (e.g., ئ) | Inconsistently normalized; transcribed as either a plain Yāʾ (ی/ي) for Persian or the full Hamza-on-Yāʾ (ئ) for Arabic | Cross-Linguistic Variation: Inconsistency arises from competing orthographic norms. Transcribers following Arabic conventions often retain the Hamza (سائر), while those following modern Persian standards often simplify it to a plain Yāʾ (سایر). This reflects the hybrid nature of the dataset’s linguistic tradition. |
| 1.6 | Alif Madda (آ) used as a long vowel (e.g., غآيب) | Transcribed as a standard Alif (غايب). The Alif Madda (آ) is ignored where it is used to write /ā/ and not /ʾā/. This usage of alif madda for /ā/ instead of /ʾā/ is particularly common in words with the sequence /ā/ + hamza + /i/ such as غآيب which is transcribed as غايب | Phonetic Normalization: In these instances, the madda serves as a marker for the long vowel /ā/ rather than a glottal stop. Normalizing to a standard Alif aligns with modern Persian/Arabic conventions and ensures the HTR model associates the visual stroke with the correct semantic word, and avoids a character combination that may be non-standard or difficult to render. |
| NO. | FEATURE IN MANUSCRIPT | TRANSCRIPTION DECISION | IMPLICATION/JUSTIFICATION |
|---|---|---|---|
| 2.1 | Diacritics (Tashdīd and Vowels) | Not transcribed (omitted). | A typical practice for the dataset, prioritizing text content over full vocalization/diacritization. Some transcriptions may contain diacritics, such as the tashdīd (ّ). |
| 2.2 | Hamza on the line (ء) | Omitted where absent in the manuscript; missing hamzas on the line at the end of words are not added (e.g., transcribed as إبقا instead of إبقاء). | A diplomatic choice to not normalize missing hamzas. |
| 2.3 | Catchwords | Not transcribed (omitted). | Catchwords are often excluded from the main text transcription, but some pages may have them transcribed. |
| 2.4 | Hard-to-distinguish Letters (e.g., Rāʾ (ر) vs. Dāl (د), Wāw (و) vs. Dāl (د)) | Differentiated and transcribed based on linguistic context. | A necessary interpretive judgment due to the cursive and varied nature of handwriting. |
| 2.5 | Letters without dots (Iʿjām) | Dots are added contextually (e.g., ٮلال as بلال or حسٮن as حسين). | Normalization to make the text readable; characters without dots are transcribed based on linguistic context. |
| 2.6 | Rubrication Marks (overlines, underlines, color) | Not represented; transcribed as regular text. | These codicological features are not captured in the transcription layer. |
| 2.7 | Nastaliq Shorthands | Cannot be represented in Unicode; transcribed as the full, contextually correct word/letters. | Unicode limitations necessitate normalization. |
| NO. | FEATURE IN MANUSCRIPT | TRANSCRIPTION DECISION | IMPLICATION/JUSTIFICATION |
|---|---|---|---|
| 3.1 | Sīn with teeth vs. toothless Sīn (in Nastaliq) | The standard Sīn (س) is used, as Unicode offers only one rendition. | Unicode limitation results in normalization. This applies to variations in other glyphs as well. |
| 3.2 | Sīn with an inverted ‘ب’ or dots underneath (scribal marks) | Transcribed as the standard Sīn (س) without the extra marks. | Scribal marks used only to indicate Sīn are ignored in the final transcription. |
| 3.3 | Confused three-dot clusters (e.g., Yāʾ (ي) and Bāʾ (ب) written together with three dots to resemble Pay (پی)) | Transcribed according to linguistic context (e.g., Bī (بي written as پی) as بی instead of the visual پی). | Interpretation based on known words/grammar overrides a strictly diplomatic visual representation. |
| 3.4 | Do Chashmi Hāʾ (ها) vs. Nastaliq Hāʾ (ہا) | The Nastaliq Hāh (used in Persian) is transcribed using the modern Persian Hāʾ (ها). | Modern Persian keyboards lack the Nastaliq Hāh, forcing a normalization to the most functionally equivalent character. This also applies to the Tā Marbūṭa variants (e.g., vs. ). |
| 3.5 | Initial/Middle/Final Hāʾ | Most cases follow the rendering from the modern Persian keyboard. | Avoids switching between Persian and Urdu keyboard Hāʾ variants to maintain consistency. |
| 3.6 | Unusual glyphs, such as triangular three-dot misra marker | Transcribed as closely as possible, such as three consecutive dots for the triangular three-dot misra marker (ellipses). | No Unicode representation exists for the specific marker. |
| 3.7 | Unauthorized Connections | Letters that should be terminal but are connected by the scribe are transcribed as disconnected glyphs. | Unicode cannot represent the resulting non-standard connected glyphs. |
| 3.8 | Confused three-dot letters (e.g., Thāʾ (ث) vs. toothless Shīn (ش)) | Differentiated and transcribed based on linguistic context. | A necessary interpretive judgment when single-toothed ث and ش are visually identical. |

vs.
).