ARletta. Open-Source Handwritten Text Recognition Models for Historic Dutch

Lith Lefranc; Ilja Van Damme; Thibault Clérice; Mike Kestemont

doi:10.5334/johd.225

Abstract

We release ARletta, a series of open-source models for the automated transcription of historic Dutch-language handwritten sources, which has remained a desideratum in the scholarly community until now. All models presented were trained on publicly available data using the open-source kraken engine. Our endeavor focuses on the digitization of a large-scale collection of local police reports (1876–1945). Additionally, we include a supermodel trained on the union of other Dutch-language datasets (extending back to the 17th century) which we hope will be useful as a foundational model for future projects. Our results demonstrate performance that is competitive with proprietary software solutions.

References

1Chagué, A., Clérice, T., Norindr, J., Humeau, M., Davoury, B., et al. (2023). Manu McFrench, from zero to hero: impact of using a generic handwriting recognition model for smaller datasets. Digital Humanities 2023: Collaboration as Opportunity, Alliance of Digital Humanities Organizations. Graz, Austria. Retrieved from https://inria.hal.science/hal-04094241/document (last accessed: 18 April 2024).
Back to article
2Clérice, T., Vlachou-Efstathiou, M., & Chagué, A. (2023). CREMMA Medii Aevi: Literary manuscript text recognition in Latin. Journal of Open Humanities Data, 9(4), 1–19. DOI: 10.5334/johd.97
Back to article
3Clérice, T. (2023). You Actually Look Twice At it (YALTAi): using an object detection approach instead of region segmentation within the Kraken engine. Journal of Data Mining and Digital Humanities. DOI: 10.46298/jdmdh.9806
Back to article
4Gabay, S., Pinche, A., Christensen, K., Camps, J.-B., & Carboni, N. (2023). SegmOnto, A Controlled Vocabulary to Describe the Layout of Pages. Retrieved from https://segmonto.github.io/ (last accessed: 18 April 2024).
Back to article
5Haverals, W. (2024). WHaverals/CERberus: v1.0.1. Zenodo. DOI: 10.5281/zenodo.10668052
Back to article
6Haverals, W., & Kestemont, M. (2023). Handwritten Text Recognition Applied to the Manuscript Production of the Carthusian Monastery of Herne in the Fourteenth Century [Conference presentation]. Digital Humanities 2023: Collaboration as Opportunity, Alliance of Digital Humanities Organizations. Graz, Austria. DOI: 10.5281/zenodo.8107391
Back to article
7Keijser, L. (2020). Scans and transcriptions of the VOC and the Haarlem notarial deeds archives. Zenodo. DOI: 10.5281/zenodo.3613666 (last accessed: 18 April 2024).
Back to article
8Kiessling, B., Tissot, R., Stokes, P., & Ezra, D. S. B. (2019). eScriptorium: An Open Source Platform for Historical Document Analysis. 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW). Sydney, Australia. DOI: 10.1109/ICDARW.2019.10032
Back to article
9Kiessling, B. (2020). A Modular Region and Text Line Layout Analysis System [Conference presentation]. 17th International Conference on Frontiers in Handwriting Recognition (ICFHR). Dortmund, Germany. DOI: 10.1109/ICFHR2020.2020.00064
Back to article
10Muehlberger, G., Seaward, L., Terras, M., Ares Oliviera, V., Bosch, V., et al. (2019). Transforming scholarship in the archives through handwritten text recognition: Transkribus as a case study. Journal of Documentation, 7(5), 954–976. DOI: 10.1108/JD-07-2018-0114
Back to article
11Pinche, A. (2023). Generic HTR Models for Medieval Manuscripts. The CREMMALab Project. Journal of Data Mining and Digital Humanities. DOI: 10.46298/jdmdh.10252
Back to article

ARletta. Open-Source Handwritten Text Recognition Models for Historic Dutch

Abstract

Paradigm

My account