OpenITI MAKHZAN: An Open Annotated Dataset of Arabic, Persian, Ottoman Turkish, and Urdu Print and Manuscript Data

Jonathan Parkes Allen; John Mullan; Lorenz Nigst; Mathew Barber; Taimoor Shahid-Khan; Masoumeh Seydi; Danlu Chen; Yufei Weng; Nikolai Vogler; Jacob Murel; Osama Eshera; Taylor Berg-Kirkpatrick; David Smith; Sarah Bowen Savant; Matthew Thomas Miller

doi:10.5334/johd.465

OpenITI MAKHZAN: An Open Annotated Dataset of Arabic, Persian, Ottoman Turkish, and Urdu Print and Manuscript Data

Volume 12 (2026): Issue 1

By: Jonathan Parkes Allen , John Mullan, Lorenz Nigst , Mathew Barber , Taimoor Shahid-Khan , Masoumeh Seydi , Danlu Chen , Yufei Weng , Nikolai Vogler , Jacob Murel , Osama Eshera , Taylor Berg-Kirkpatrick , David Smith , Sarah Bowen Savant and Matthew Thomas Miller

Open Access

|May 2026

Abstract

The OpenITI MAKHZAN dataset is a large aggregation of Arabic-script ground truth and evaluation data drawn from a wide variety of Persian, Arabic, Ottoman Turkish, and Urdu scribal print and handwritten (manuscript) documents. Comprising nearly 1,500 page images across 208 documents sourced from 30 repositories worldwide, the dataset spans seven languages, around 20 unique and mixed script types, and a chronological range from the 10th to the 20th century. This data set is available, open access, for all on Zenodo. This article explains the different types of data in this large dataset and how this data was compiled and verified and suggests potential use cases for it, such as the training and evaluation of new print and handwritten transcription models.

References

Alghamdi, M., & Teahan, W. (2017). Experimental Evaluation of Arabic OCR Systems. PSU Research Review, 1(3), 229–241. 10.1108/PRR-05-2017-0026
Open DOI Search in Google Scholar Back to article
Allen, J. P. (Forthcoming). Navigating the Ink-Dark Sea of Arabic Script Typography: OpenITI’s Typeface Evaluation and Data Production. In Al-ʿUṣūr al-Wusṭā: The Journal of Middle East Medievalists.
Search in Google Scholar Back to article
Clausner, C, Antonacopoulos, A., McGregor, N., & Wilson-Nunn, D. (2018). ICFHR 2018 Competition on Recognition of Historical Arabic Scientific Manuscripts–RASM2018. In 16th International Conference on Frontiers in Handwriting Recognition (ICFHR) (pp. 471–476). 10.1109/ICFHR-2018.2018.00088
Open DOI Search in Google Scholar Back to article
Keinan-Schoonbaert, A. (September 2019). Results of the RASM2019 Competition on Recognition of Historical Arabic Scientific Manuscripts. British Library Digital Scholarship Blog. https://blogs.bl.uk/digital-scholarship/2019/09/rasm2019-results.html
Search in Google Scholar Back to article
Keinan-Schoonbaert, A. (January 2020). Using Transkribus for Arabic Handwritten Text Recognition. British Library Digital Scholarship Blog. https://blogs.bl.uk/digital-scholarship/2020/01/using-transkribus-for-arabic-handwritten-text-recognition.html
Search in Google Scholar Back to article
Kiessling, B. (2019). Kraken—An Universal Text Recognizer for the Humanities. In Proceedings of the Digital Humanities Conference.
Search in Google Scholar Back to article
Kiessling, B., Kurin, G., Miller, M. T., & Smail, K. (2021). Advances and Limitations in Open Source Arabic-Script OCR: A Case Study. Digital Studies / Le champ numérique, 11(1). 10.16995/dscn.8094
Open DOI Search in Google Scholar Back to article
Kiessling, B., Miller, M. T., Romanov, M. G., & Savant, S. B. (2017). Important New Developments in Arabographic Optical Character Recognition (OCR). Al-ʿUṣūr al-Wusṭā: The Journal of Middle East Medievalists, 25(1), 1–13. 10.7916/alusur.v25i1.6996
Open DOI Search in Google Scholar Back to article
Milo, T. (2013). Bodoni’s Arabic, Some Observations. In O. Riccardo & P. Jonathan (Eds.), Compulsive Bodoni and the Parmigiano Typographic System (pp. 95–103). UvA Special Collections.
Search in Google Scholar Back to article
Panagiotidou, G., Lamqaddam, H., Poblome, J., Brosens, K., Verbert, K., & Moere, A. V. (2022). Communicating Uncertainty in Digital Humanities Visualization Research. IEEE Transactions on Visualization and Computer Graphics (pp. 1–11). 10.1109/TVCG.2022.3209436
Open DOI Search in Google Scholar Back to article
Smith, D. A., Murel, J., Allen, J. P., & Miller, M. T. (2023). Automatic Collation for Diversifying Corpora: Commonly Copied Texts as Distant Supervision for Handwritten Text Recognition. In A. Šeļa, F. Jannidis, & I. Romanowska (Eds.), Proceedings of the Computational Humanities Research Conference 2023 (CHR 2023) (pp. 206–221). CEUR Workshop Proceedings, Vol. 3558. https://ceur-ws.org/Vol-3558/paper1708.pdf
Search in Google Scholar Back to article
Vogler, N., Allen, J. P., Miller, M. T., & Berg-Kirkpatrick, T. (2022). Lacuna Reconstruction: Self-Supervised Pre-Training for Low-Resource Historical Document Transcription. In C. Marine, M.-C. de Marneffe, & I. V. Meza Ruiz (Eds.), Findings of the Association for Computational Linguistics: NAACL 2022 (pp. 206–216). Association for Computational Linguistics. 10.18653/v1/2022.findings-naacl.15
Open DOI Search in Google Scholar Back to article

Articles in this issue

DOI: https://doi.org/10.5334/johd.465 | Journal eISSN: 2059-481X

Journal RSS Feed

Language: English

Page range: 69 - 69

Submitted on: Nov 9, 2025

Accepted on: May 5, 2026

Published on: May 26, 2026

Published by: Ubiquity Press

In partnership with: Paradigm Publishing Services

Publication frequency: 1 issue per year

Keywords:

HTR,

OCR,

Arabic,

Persian,

Ottoman Turkish,

Urdu

© 2026 Jonathan Parkes Allen, John Mullan, Lorenz Nigst, Mathew Barber, Taimoor Shahid-Khan, Masoumeh Seydi, Danlu Chen, Yufei Weng, Nikolai Vogler, Jacob Murel, Osama Eshera, Taylor Berg-Kirkpatrick, David Smith, Sarah Bowen Savant, Matthew Thomas Miller, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.

Volume 12 (2026): Issue 1