
Figure 1
Execution flow of a benchmark test run.

Figure 2
Overview over the Bibliographic Data benchmark in the visualization frontend.

Figure 3
Best performance per model and benchmark dataset, taking the maximum score over all recorded configurations per model and dataset. Null values (N/A) indicate tests that have not yet been run or can no longer be run because the corresponding model has been deprecated.
Table 1
Overview of benchmark datasets, tasks, and scoring schemes.
| BIBLIOGRAPHIC DATA: METADATA EXTRACTION FROM BIBLIOGRAPHIES | |
|---|---|
| Data | 5 images, 67 elements, JPG, approx. 1743 × 2888 px, 350 KB each. |
| Content | Pages from Bibliography of Works in the Philosophy of History, 1945–1957. |
| Source | http://www.jstor.org/stable/2504495 |
| Language | English |
| Content Type | Printed text. Directory-like. |
| Task | Transcription; metadata extraction. |
| Expected output | JSON list of structured items; output defined via Pydantic class. |
| Scoring | Fuzzy matching on each key in the JSON items. |
| Test runs | 77 runs. |
| Best result | 71.43%, OpenAI, gpt-4o. |
| Design | The dataset was created as a proof of concept in the context of a master’s thesis and targets the extraction of structured bibliographic information from semi-structured, directory-like printed text. The benchmark focuses on structural inference rather than linguistic prompting and omits an explicit prompt in favor of a predefined Pydantic output schema. |
| Further Info | Dataset Description, Test Results, GitHub (last accessed 2026-01-12) |
| BLACKLIST CARDS: NER EXTRACTION FROM INDEX CARDS | |
| Data | 33 images, JPG, approx. 1788 × 1305 px, 590 KB each. |
| Content | Swiss federal index cards for companies on a British black list for trade (1940s). |
| Source | https://www.recherche.bar.admin.ch/recherche/#/de/archiv/einheit/31240458 (last accessed 2026-01-12) |
| Language | Mostly German, some French |
| Content Type | Typed and handwritten text. Stamped dates. |
| Task | Transcription; metadata extraction. |
| Expected output | JSON object with predefined keys; output defined via Pydantic class. |
| Scoring | Fuzzy matching on each key in the JSON object. |
| Test runs | 38 runs. |
| Best result | 95.65%, OpenAI, gpt-4.1-mini. |
| Design | The dataset was created in the context of an ongoing dissertation and reflects a historically common archival document type. The cards were manually selected and curated to capture the combination of typed and handwritten entries typical for mid-20th-century administrative records, which are not readily available for bulk download. |
| Further Info | Dataset Description, Test Results, GitHub (last accessed 2026-01-12) |
| BOOK ADVERT XMLS: DATA CORRECTION OF LLM-GENERATED XML FILES | |
| Data | 50 JSON files containing XML structures. |
| Content | Faulty LLM-generated XML structures from historical sources |
| Source | Data from the digitized Basler Avisblatt, namely book advertisements, extracted as XML. |
| Language | Early Modern German |
| Content Type | Digital data. Plain text in JSON files. |
| Task | Correct XML structure (add closing tags, remove faulty tags). |
| Expected output | JSON object with correct XML as string; output defined via Pydantic class. |
| Scoring | Fuzzy matching on whole XML string after removing white spaces and setting all characters to lowercase. |
| Test runs | 40 runs. |
| Best result | 97.47%, Anthropic, Claude Sonnet 4.5. |
| Design | This dataset was created to introduce text-only benchmarks into the framework as part of an ongoing research project. It targets a reasoning task situated in a workflow, where large language models are required to detect and correct structural errors in previously generated XML, reflecting common data transformation and processing challenges in digital humanities pipelines. |
| Further Info | Dataset Description, Test Results, GitHub (last accessed 2026-01-12) |
| BUSINESS LETTERS: NER EXTRACTION FROM CORRESPONDENCE | |
| Data | 57 letters, 98 images, JPG, approx. 2479 × 3508 px, 600 KB each. |
| Content | Collection of letters from the Basler Rheinschifffahrt-Aktiengesellschaft. |
| Source | http://dx.doi.org/10.7891/e-manuscripta-54917 |
| Language | German |
| Content Type | Typed, printed and handwritten text. Signatures. |
| Task | Metadata extraction, person matching, signature recognition. |
| Expected output | JSON object with predefined keys; output defined via Pydantic class. |
| Scoring | F1 score. |
| Test runs | 212 runs. |
| Best result | 77.00%, OpenAI, gpt-5. |
| Design | This dataset was designed as a multi-stage benchmark reflecting a common humanities scenario centered on historical correspondence. The benchmark evaluates sequential tasks including metadata extraction, named-entity recognition, and signature-based person identification, addressing both linguistic variation and document-level decision making. |
| Further Info | Dataset Description, Test Results, GitHub (last accessed 2026-01-12) |
| COMPANY LISTS: COMPANY DATA EXTRACTION FROM LIST-LIKE MATERIALS | |
| Data | 15 images, JPG, approx. 1868 × 2931 px, 360 KB each. |
| Content | Pages from Trade index: classified handbook of the members of the British Chamber of Commerce for Switzerland |
| Source | https://doi.org/10.7891/e-manuscripta-174832 |
| Language | English and German |
| Content Type | Printed lists with strongly varying layout. |
| Task | Metadata extraction with varying layouts. |
| Expected output | JSON list of structured items; output defined via Pydantic class. |
| Scoring | F1 score. |
| Test runs | 76 runs. |
| Best result | 58.40% OpenAI, gpt-5. |
| Design | This dataset was developed in the context of an ongoing dissertation project focused on historical company networks. It evaluates metadata extraction from multilingual company lists with strongly varying layouts, testing a model’s ability to infer a consistent schema despite substantial variation in formatting and presentation. |
| Further Info | Dataset Description, Test Results, GitHub (last accessed 2026-01-12) |
| FRAKTUR ADVERTS: PAGE SEGMENTATION AND FRAKTUR TEXT TRANSCRIPTION | |
| Data | 5 images, JPG, approx. 5000 × 8267 px, 10 MB each. |
| Content | Pages from the Basler Avisblatt, an early advertisement newspaper published in Basel, Switzerland. |
| Source | https://avisblatt.dg-basel.hasdai.org (last accessed 2026-01-12) |
| Language | Early modern German |
| Content Type | Printed text in Fraktur typeface. |
| Task | F1 score & Character Error Rate (CER). |
| Expected output | JSON list of structured items; output defined via Pydantic class. |
| Scoring | Segmentation of adverts and text recognition |
| Test runs | 91 runs. |
| Best result | 95.70%, Google, gemini-2.0-pro-exp-02-05. |
| Design | This dataset was developed as part of ongoing research at the University of Basel and targets historical advertisements printed in Fraktur typeface. It evaluates advert segmentation and text recognition under conditions of dense layout and strong typographic variability, while demonstrating that aggressive image downscaling has a limited effect on model performance. |
| Further Info | Dataset Description, Test Results, GitHub (last accessed 2026-01-12) |
| LIBRARY CARDS: METADATA EXTRACTION FROM MULTILINGUAL SOURCES | |
| Data | 263 images, JPG, approx. 976 × 579 px, 10 KB each. |
| Content | Library cards with dissertation thesis information. |
| Source | https://ub.unibas.ch/cmsdata/spezialkataloge/ipac/searchform.php?KatalogID=ak2 (last accessed 2026-01-12) |
| Language | German, French, English, Latin, Greek, and other European languages |
| Content Type | Typed and handwritten multilingual text. |
| Task | Metadata extraction. |
| Expected output | JSON object with predefined keys; output defined via Pydantic class. |
| Scoring | F1 score. |
| Test runs | 61 runs. |
| Best result | 89.51%, OpenAI, gpt-5. |
| Design | This dataset was created as a feasibility study to assess whether unique dissertation records can be identified and structured within a large-scale historical card catalog comprising approximately 700,000 entries. A random sample was selected to capture multilingual content, inconsistent layouts, and a mixture of handwritten, typed, and printed text, reflecting realistic challenges in large archival metadata extraction tasks. |
| Further Info | Dataset Description, Test Results, GitHub (last accessed 2026-01-12) |
| MEDIEVAL MANUSCRIPTS: HANDWRITTEN TEXT RECOGNITION | |
| Data | 12 images, JPG, approx. 1872 × 2808 px, 1.1 MB each. |
| Content | Pages from Pilgerreisen nach Jerusalem 1440 und 1453. |
| Source | https://www.e-codices.ch/de/description/ubb/H-V-0015/HAN (last accessed 2026-01-12) |
| Language | Medieval German |
| Content Type | Handwritten text. |
| Task | Segmentation & Text recognition. |
| Expected output | JSON object with predefined keys; output defined via Pydantic class. |
| Scoring | Fuzzy & Character Error Rate (CER). |
| Test runs | 38 runs. |
| Best result | 76.90%, OpenAI, gpt-4.1-mini. |
| Design | This dataset was prepared in the context of a university seminar and focuses on late medieval manuscript material. It targets segmentation and text recognition of handwritten sources that are difficult to read due to script variability and the presence of marginalia, reflecting common challenges in medieval manuscript analysis. |
| Further Info | Dataset Description, Test Results, GitHub (last accessed 2026-01-12) |
Table 2
Contributors to the project. Contributor roles follow the CRediT taxonomy.
| NAME | PRIMARY ROLES |
|---|---|
| Anthea Alberto | Data Curation; Validation |
| Sven Burkhardt | Validation |
| Eric Decker | Data Curation; Validation |
| Pema Frick | Data Curation; Validation; Formal Analysis; Software |
| Maximilian Hindermann | Conceptualization; Methodology; Software; Formal Analysis |
| Lea Katharina Kasper | Data Curation; Validation; Formal Analysis |
| José Luis Losada Palenzuela | Data Curation; Validation |
| Sorin Marti | Conceptualization; Software; Formal Analysis; Visualization |
| Gabriel Müller | Data Curation; Validation |
| Ina Serif | Data Curation; Validation; Formal Analysis |
| Elena Spadini | Data Curation; Validation |
