Table 1
Overview of benchmark datasets and tasks. A companion data paper provides a more detailed table of the available datasets (Hindermann, Marti, Kasper, & Bosse, 2026).
| Bibliographic Data: Metadata extraction from bibliographies | |
| Data | 5 images, 67 elements, JPG, approx. 1743 × 2888 px, 350 KB each. |
| Content | Pages from Bibliography of Works in the Philosophy of History, 1945–1957. |
| Design | The dataset was created as a proof of concept in the context of a master’s thesis and targets the extraction of structured bibliographic information from semi-structured, directory-like printed text. The benchmark focuses on structural inference rather than linguistic prompting and omits an explicit prompt in favor of a predefined Pydantic output schema. |
| Further Info | Dataset Description, Test Results, GitHub (last accessed 2026-01-16) |
| Blacklist Cards: NER extraction from index cards | |
| Data | 33 images, JPG, approx. 1788 × 1305 px, 590 KB each. |
| Content | Swiss federal index cards for companies on a British black list for trade (1940s). |
| Design | The dataset was created in the context of an ongoing dissertation and reflects a historically common archival document type. The cards were manually selected and curated to capture the combination of typed and handwritten entries typical for mid-20th-century administrative records, which are not readily available for bulk download. |
| Further Info | Dataset Description, Test Results, GitHub (last accessed 2026-01-16) |
| Book Advert XMLs: Data correction of LLM-generated XML files | |
| Data | 50 JSON files containing XML structures. |
| Content | Faulty LLM-generated XML structures from historical sources |
| Design | This dataset was created to introduce text-only benchmarks into the framework as part of an ongoing research project. It targets a reasoning task situated in a workflow, where large language models are required to detect and correct structural errors in previously generated XML, reflecting common data transformation and processing challenges in digital humanities pipelines. |
| Further Info | Dataset Description, Test Results, GitHub (last accessed 2026-01-16) |
| Business Letters: NER extraction from correspondence | |
| Data | 57 letters, 98 images, JPG, approx. 2479 × 3508 px, 600 KB each. |
| Content | Collection of letters from the Basler Rheinschifffahrt-Aktiengesellschaft. |
| Design | This dataset was designed as a multi-stage benchmark reflecting a common humanities scenario centered on historical correspondence. The benchmark evaluates sequential tasks including metadata extraction, named-entity recognition, and signature-based person identification, addressing both linguistic variation and document-level decision making. |
| Further Info | Dataset Description, Test Results, GitHub (last accessed 2026-01-16) |
| Company Lists: Company data extraction from list-like materials | |
| Data | 15 images, JPG, approx. 1868 × 2931 px, 360 KB each. |
| Content | Pages from Trade index: classified handbook of the members of the British Chamber of Commerce for Switzerland |
| Design | This dataset was developed in the context of an ongoing dissertation project focused on historical company networks. It evaluates metadata extraction from multilingual company lists with strongly varying layouts, testing a model’s ability to infer a consistent schema despite substantial variation in formatting and presentation. |
| Further Info | Dataset Description, Test Results, GitHub (last accessed 2026-01-16) |
| Fraktur Adverts: Page segmentation and Fraktur text transcription | |
| Data | 5 images, JPG, approx. 5000 × 8267 px, 10 MB each. |
| Content | Pages from the Basler Avisblatt, an early advertisement newspaper published in Basel, Switzerland. |
| Design | This dataset was developed as part of ongoing research at the University of Basel and targets historical advertisements printed in Fraktur typeface. It evaluates advert segmentation and text recognition under conditions of dense layout and strong typographic variability, while demonstrating that aggressive image down-scaling has a limited effect on model performance. |
| Further Info | Dataset Description, Test Results, GitHub (last accessed 2026-01-16) |
| Library Cards: Metadata extraction from multilingual sources | |
| Data | 263 images, JPG, approx. 976 × 579 px, 10 KB each. |
| Content | Library cards with dissertation thesis information. |
| Design | This dataset was created as a feasibility study to assess whether unique dissertation records can be identified and structured within a large-scale historical card catalog comprising approximately 700,000 entries. A random sample was selected to capture multilingual content, inconsistent layouts, and a mixture of handwritten, typed, and printed text, reflecting realistic challenges in large archival metadata extraction tasks. |
| Further Info | Dataset Description, Test Results, GitHub (last accessed 2026-01-16) |
| Medieval Manuscripts: Handwritten text recognition | |
| Data | 12 images, JPG, approx. 1872 × 2808 px, 1.1 MB each. |
| Content | Pages from Pilgerreisen nach Jerusalem 1440 und 1453. |
| Design | This dataset was prepared in the context of a university seminar and focuses on late medieval manuscript material. It targets segmentation and text recognition of handwritten sources that are difficult to read due to script variability and the presence of marginalia, reflecting common challenges in medieval manuscript analysis. |
| Further Info | Dataset Description, Test Results, GitHub (last accessed 2026-01-16) |

Figure 1
Images in images/ are sent to an LLM with a prompt from prompts/ and compared against a ground truth file in ground_truths/. Each result follows the structure defined by dataclass.py and is scored using the implementation in benchmark.py.

Listing 1
The benchmark class (without actual implementation) from the Library Cards benchmark v0.4.1.

Listing 2
The Pydantic data class (without docstrings) from the Library Cards benchmark v0.4.1.

Listing 3
The ground truth for card 00624122 (see Figure 2) from the Library Cards benchmark v0.4.1.

Figure 2
The card 00624122 from the Library Cards benchmark v0.4.1.

Figure 3
Two letterheads and a valediction from the Business Letters benchmark v0.4.1. The strings “Max Oettinger” (Figure 3a), “Artur Oettinger-Meili” (Figure 3b), and “A. Oettinger” (Figure 3c) are name variants referring to the same individual, namely Artur Oettinger-Meili. He served as managing director (Geschäftsführer) of the Basler Personenschifffahrtsgesellschaft from 1925 to 1938; the variant “Max” results from an error on the part of the senders.

Figure 4
Screenshot from the front-end.7 The graph shows cumulated test results of the Business Letters benchmark v0.4.1 over time. Connecting lines show test runs of the same models on different dates with colors indicating providers. The graph is interactive—only one part is shown here.

Figure 5
Comparison of average model scores of Google and OpenAI across all benchmarks. The results show substantial task-specific variance: Google achieves higher scores on Fraktur Adverts (77.1 versus OpenAI 20.6), Medieval Manuscripts (63.3 versus OpenAI 51.4), and Library Cards (82.5 versus OpenAI 77.2), while OpenAI performs better on Bibliographic Data (57.7 versus Google 36.1), and Book Advert XML (85.8 versus Google 80.5). Differences are less pronounced (<5%) for Blacklist Cards (Google 89.9, OpenAI 85.5), Business Letters (OpenAI 53.2, Google 49.7), and Company Lists (Google 42.6, OpenAI 39.6). This comparison underscores the need for systematic evaluation of humanities relevant LLM tasks.
