(1) Overview
Repository location
Context
Our need for a Dutch-language model for handwritten text recognition (HTR) arose in the context of a data-driven research project based on the historic incident books (‘gebeurtenisboeken’) of Antwerp, Belgium.1 In these books, local policemen reported meticulously on their neighborhood patrols and documented many aspects of urban street life (see example in Figure 1 and Table 1).

Figure 1
Example of an incident (FelixArchief, 731#1604, p. 211).
Table 1
Transcription of Figure 1 and translation to English (FelixArchief, 731#1604, p. 211).
| 1 Mei 1885 Mr Sergoynne à verbaliser D | Om 9 1/2 uren ‘s avonds heb ik gezien dat de meid uit het huis n° 57 der Leopoldst matten uitklopte voor hare woning tegen den muer van het gasthuis toen ik er naartoe ging liep zy binnen en is niet meer buiten gekomen (zij veroorzaakte eenen over vloed van stof) Broes, Dymphna, 16 jaar, geb. te Minderhout, meid Wonende Leopold str. 57. Comme cette fille est ci jeune et qu’elle savait pas mieux nous n’y avons pas donné suite [onleesbare handtekening] |
| 1 May 1885 Mr Sergoynne draw up an official report D | At 9 1/2 hours in the evening I saw that the maid from the house N° 57 in the Leopold street was beating rugs against the wall of the guesthouse when I came over she ran away inside the house and didn’t come outside any more (she caused a flood of dust). Broes, Dymphna, 16 years, born in Minderhout, maid residing in Leopoldstr 57. Because she was so young and she didn’t know better, we didn’t follow up on this. [unreadable signature-mark] |
For our research project, we scanned 271 incident books (87,594 pages containing text) preserved by the Antwerp City Archives, ranging from 1876 up to 1945. A ground truth training set of 15,000 words is generally considered adequate for obtaining satisfying HTR results (Muehlberger et al., 2019). This proved to be insufficient in the case of the incident books, as the pages contain many different handwriting styles and display irregularities in layout. Additionally, the incidents feature French passages as a consequence of Belgian language policies at that time. Because existing solutions fell short, we developed new models for layout analysis, line segmentation, and transcription through the open-source kraken engine, only using publicly available data.
(2) Method
Steps
Assets
Our project depended on the following assets:
Manu: At the end of 2023, no models were publicly available in the kraken model repository (Kiessling, 2020) for the automated transcription of handwritten sources in historic Dutch, except the multilingual medieval CATMuS model that was partially trained on Middle Dutch (Clérice et al., 2023). Of all the models on HTR-united, Manu McFrench came closest to our Antwerp focus material because it was trained on French materials with a similar temporal scope (Chagué et al., 2023).
IJsberg: Two publicly available training sets created by the Dutch national archives seemed valuable to us because they share the same language, historical period, and administrative character (Keijser, 2020):
VOC: 4,735 pages of the 17th and 18th century East-Asia Company archive
Notarial: 1,615 pages of 19th-century Dutch notarial deeds
Antwerp incident books: We prepared a ground truth dataset from the Antwerp incident books. One expert (PhD student) annotated an initial subset of the data, consisting of 271 randomly selected pages.2 This subset was later supplemented with 3444 pages annotated by students.3 The student-contributed materials are much larger in scope than the materials annotated by the expert, but the former’s quality is unfortunately much less consistent.
Layout analysis and line segmentation
In terms of layout, the Antwerp material contains two region types of interest: marginal text (marginalia) and blocks of main text (paragraph) (see Figure 3).4 Using the YALTAi library (Clérice, 2023), we trained a YOLOv8 model to automatically extract these regions from a page and limit the line segmentation to these regions. For the line segmentation, kraken’s default BLLA model produced acceptable results, but we lightly finetuned it on the Antwerp data to alleviate some ad hoc issues in the data. Regarding layout, we kept all baselines in the original datasets but repolygonized them to guarantee their compatibility with the kraken framework. Baselines are annotated with straightforward geometric lines in HTR and are fairly portable across different platforms. For HTR, the baselines are used to determine a suitable polygon around lines, maximizing the inclusion of ascenders and descenders of characters, while minimizing the overlap with glyphs from neighbouring lines. The polygons which result from this process nevertheless tend to be quite irregular in shape and are much less portable across different transcription engines. The repolygonization step for kraken was therefore essential (cf. Pinche, 2023; see Figures 2 and 3).

Figure 2
Visualization of the training process.

Figure 3
An example of a page with annotated text regions (green) and baselines (pink) (FelixArchief, MA#17612, p. 85).
All transcriptions were encoded as UTF-8 and we deliberately chose NFKD normalization, meaning that complex glyphs, such as those involving diacritics, were decomposed into their constituent glyphs. The glyph <é>, for instance, could be represented in Unicode as a single glyph (U+00E9) or the combination of simplex <e> (U+0065) and the combining acute accent (U+0301). We wanted to capture the intuition that the prediction <e> (U+0065) for <é> would still be a relatively better prediction than a wholly unrelated character.
Handwritten text recognition
Our project focused on the HTR for the Antwerp incident books. Given the considerable diachronic spread of the available training material, we experimented with the following strategies:
– Finetuning: we sequentially finetuned a series of models, where we started with the oldest materials and gradually proceeded to the more recent periods. We indicate such a pretraining workflow using arrows; e.g. “Manu: VOC → Notarial”
– Union: we experimented with joining datasets, through training on the union of the available materials. We indicate such a union approach using + signs: e.g. “Manu: VOC + Notarial”
– Scratch: we trained on a subset of the available calibration assets, without making use of a foundation model but relied on randomly initialized model weights.5
Models were trained on a single GPU (GeForce RTX 2080 Ti) using a learning rate of 10–4, a batch size of 12 and early stopping in combination with learning rate reduction (for a maximum of 50 epochs and a patience of 3 epochs). When finetuning models, the backbone was frozen during the initial 5K steps; we applied a warmup of 10K steps. We evaluate our models through the character accuracy rate measure (CAR). Test scores were calculated through the CERberus tool (see Table 3), which offers options for relaxed evaluations (Haverals & Kestemont, 2023; Haverals, 2024). For all datasets, we split off a random, unstratified validation set (10% of the available data) at the level of individual pages; the remaining 90% of the pages were used for training models. For the Antwerp dataset, a dedicated test set of 101 pages was created (annotated by the expert only) on which all models were evaluated (see Table 2).
Table 2
Train-validation splits for each dataset.
| DATASET | SUBSET | PAGES | REGIONS | LINES | LINE LENGTH* | CHARACTERS | VOCABULARY** |
|---|---|---|---|---|---|---|---|
| VOC | train | 4261 | 6079 | 132611 | 36.37 (±17.84) | 4823321 | 120 |
| valid | 474 | 655 | 15154 | 35.79 (±17.57) | 542321 | 101 | |
| Notarial | train | 1453 | 3624 | 92003 | 35.03 (±20.03) | 3222690 | 107 |
| valid | 162 | 377 | 9554 | 35.78 (±18.67) | 341841 | 96 | |
| Antw-expert | train | 243 | 1828 | 10766 | 30.63 (±17.10) | 329806 | 89 |
| valid | 28 | 209 | 1260 | 29.59 (±16.77) | 37288 | 83 | |
| Antw-students | train | 3099 | 27496 | 145387 | 29.00 (±17.76) | 4216129 | 118 |
| valid | 345 | 3089 | 16196 | 28.92 (±17.85) | 468359 | 106 | |
| Antw-test | test | 101 | 715 | 4628 | 30.18 (±17.09) | 139658 | 90 |
[i] * Line length expressed in characters (including whitespace). Mean and standard deviation are reported.
** Number of unique characters in the character vocabulary.
Table 3
HTR training results (CAR).
| MODEL | VALIDATION* | ANTW-TEST | ANTW-TEST (RELAXED)** |
|---|---|---|---|
| Manu (base model) | NA | 70.51% | 72.24% |
| Manu: VOC | 94.36% | 76.57% | 78.95% |
| Manu: VOC → Notarial | 95.68% | 83.04% | 85.22% |
| Manu: VOC → Notarial → Antwexpert | 91.47% | 90.01% | 91.57% |
| Manu: VOC → Notarial → Antwexpert + Antwstudent | 89.97% | 92.58% | 93.97% |
| Antwexpert + Antwstudent (scratch) | 89.05% | 91.54% | 92.88% |
| Manu: VOC + Notarial + Antwexpert + Antwstudent (super model) | 92.78% | 92.31% | 93.69% |
[i] *Note that the validation scores cannot be directly compared across model rows below, but are still useful because they give the reader a sense of in-domain model fit.
**Exclusion of whitespace, punctuation and capital letters.
In the finetuned model stack, we observe a gradual improvement in CAR as the materials come closer in time to the Antwerp dataset (see Table 3). Combining the expert and students’ annotations in the Antwerp materials is beneficial, despite the lower transcription quality of the latter. This benefit must come from the sheer size of the students’ annotations. The finetuning approach, finally, outperforms training from scratch, as well as the grand model on the Antwerp test set, but not by a large margin. The super model, however, still performed acceptably, suggesting its potential usefulness as a foundation model for other applications in this domain in the future.
(3) Dataset Description
Repository name
Zenodo
Object name
ARletta
Format names and versions
data: jpg, xml, txt; models: pt; ml.model; software: py
Creation dates
2021.10.01–2024.04.15
Dataset creators
Training data: Lith Lefranc, Oliver Bogaerts, Louise Ledegen; Models: Mike Kestemont
Language
English
License
CC-BY-NC-SA
Publication date
2024.05.16
(4) Reuse Potential
The layout and HTR models are applicable to images of serial historical sources containing (Dutch-language) handwritten text. They are particularly tailored to administrative reports from the nineteenth and twentieth centuries whose formatting breaks down into margin and main body text. The baseline segmentation and HTR models can be applied to images of these sources through the kraken engine or through the eScriptorium platform (Kiessling et al., 2019). The text region model can only be adopted within YOLO/YALTAi. All of the annotated data can be used for training new models. We released the filenames in our training and validation splits to ensure replicability.
Notes
[3] PhD research project on nocturnal street life in Antwerp (1876–1940) through data-driven methods by Lith Lefranc under the supervision of Ilja Van Damme and Mike Kestemont.
[4] This random selection was stratified, in that each incident book is represented by at least a single page.
[5] These were not randomly selected, but consecutive pages from incident books were chosen in function of the students’ research assignments: FelixArchief, 450#58, 731#1592, 731#1596, 731#1598, 731#1612, 731#1613, 731#1614, 731#1626, 731#1639, MA#17871, MA#28935, MA#28937, MA#30824, MA#30825, MA#30827, MA#30828, MA#30844. The students had to annotate the sources as part of an undergraduate module at the University of Antwerp on learning how to transcribe and use historical sources.
[6] We applied the standardized region nomenclature as proposed in the SegmOnto vocabulary (Gabay et al., 2023).
Acknowledgements
Thanks to Wouter Haverals, Jef Peeraer, FelixArchief and the students of the course Historische Oefeningen 2 (2021–2023, University of Antwerp) supervised by Ilja Van Damme, Wout Saelens, Egon Bauwelink, and Lith Lefranc.
Funding Information
This dataset results from a research project funded by University of Antwerp (GOA FFB200403).
Competing Interests
The authors have no competing interests to declare.
Author Informations
– Lith Lefranc: Conceptualization; Data Curation; Formal Analysis; Project Administration; Resources; Validation; Writing – original draft
– Ilja Van Damme: Conceptualization; Funding Acquisition; Supervision; Writing – review & editing
– Thibault Clérice: Software
– Mike Kestemont: Conceptualization; Data Curation; Formal Analysis; Funding Acquisition; Methodology; Software; Supervision; Validation; Writing – original draft
