Skip to main content
Have a personal or library account? Click to login
OpenITI MAKHZAN: An Open Annotated Dataset of Arabic, Persian, Ottoman Turkish, and Urdu Print and Manuscript Data Cover

OpenITI MAKHZAN: An Open Annotated Dataset of Arabic, Persian, Ottoman Turkish, and Urdu Print and Manuscript Data

Open Access
|May 2026

Abstract

The OpenITI MAKHZAN dataset is a large aggregation of Arabic-script ground truth and evaluation data drawn from a wide variety of Persian, Arabic, Ottoman Turkish, and Urdu scribal print and handwritten (manuscript) documents. Comprising nearly 1,500 page images across 208 documents sourced from 30 repositories worldwide, the dataset spans seven languages, around 20 unique and mixed script types, and a chronological range from the 10th to the 20th century. This data set is available, open access, for all on Zenodo. This article explains the different types of data in this large dataset and how this data was compiled and verified and suggests potential use cases for it, such as the training and evaluation of new print and handwritten transcription models.

DOI: https://doi.org/10.5334/johd.465 | Journal eISSN: 2059-481X
Language: English
Page range: 69 - 69
Submitted on: Nov 9, 2025
Accepted on: May 5, 2026
Published on: May 26, 2026
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2026 Jonathan Parkes Allen, John Mullan, Lorenz Nigst, Mathew Barber, Taimoor Shahid-Khan, Masoumeh Seydi, Danlu Chen, Yufei Weng, Nikolai Vogler, Jacob Murel, Osama Eshera, Taylor Berg-Kirkpatrick, David Smith, Sarah Bowen Savant, Matthew Thomas Miller, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.