1 Overview
Purpose and Scope
The RISE1 Humanities Data Benchmark is both a suite of humanities-oriented datasets and a provider-agnostic framework for executing and comparing model performance on document-centred research tasks. It was conceived to systematically assess the integration and performance of current LLMs on workflows encountered in concrete digital humanities research projects, such as transcription, layout segmentation, and structured metadata extraction from scanned historical sources. By combining focused tasks with explicitly structured schemas and ground truths, the suite operationalises evaluation and makes the outcomes of assumptions, performance criteria, and interpretative decisions explicit and comparable. As a result, the RISE benchmark prioritizes realism and interpretability over scale. This helps mitigate the risk of data contamination and reflects the high cost of expert annotation of complex humanities material.
Concepts and Terminology
In this paper, we use the following terminology:
Benchmark dataset: A complete and executable evaluation unit consisting of curated source materials, a task definition, prompts, structured output schemas, expert-verified ground truths, and task-specific scoring functions.
Benchmark tool: A system for executing benchmark datasets with specified models and parameters, storing outputs and scores persistently.
Configured test: The execution of a benchmark dataset with a specific provider, model, and runtime configuration (e.g. temperature). Figure 1 shows how configured and ad hoc tests are executed.
Presentation interface: A searchable website for browsing and comparing benchmark results across datasets, models, and time. Figure 2 shows an example result page of a single benchmark.

Figure 1
Execution flow of a benchmark test run.

Figure 2
Overview over the Bibliographic Data benchmark in the visualization frontend.
This distinction between benchmark datasets and configured tests is central to the framework. It allows the same dataset to be reused across multiple models and providers while ensuring that each individual test run remains fully specified and repeatable.
Model Modalities and Evaluation Coverage
While the framework itself is model-agnostic, the current benchmark datasets focus on document understanding tasks involving image inputs, text inputs, or a combination of both, paired with structured outputs. Accordingly, the benchmark evaluates both vision-language models (here understood as large language models augmented with visual input capabilities and often referred to as multimodal LLMs) and text-based LLMs, depending on the modality required by the task.
Repository Location
The benchmark datasets and the benchmarking tool are hosted in the same public GitHub repository and archived on Zenodo for long-term access. Aggregated benchmark results are made accessible through a custom web interface.
GitHub: Public repository. Open for contribution. https://github.com/RISE-UNIBAS/humanities_data_benchmark (last accessed 2026-01-12)
Zenodo: Stable archive. With each release on GitHub, the new version is automatically stored in a Zenodo repository. Concept DOI: https://doi.org/10.5281/zenodo.16941752
Presentation website: Custom front-end to browse and search the benchmark results.https://rise-services.rise.unibas.ch/benchmarks/ (last accessed 2026-01-12)
Context
Recent works have offered methodological frameworks for integrating LLMs into humanities and social science research workflows (Karjus 2025; Abdurahman et al. 2025) or surveyed domain applications without addressing evaluation infrastructure (Simons et al. 2025). Most current benchmarks either prioritise scale (Kang et al. 2025; Hauser et al. 2024) or focus on specific domains (e.g. Ziems et al. 2024; Kraus et al. 2025; Spinaci et al. 2025; Greif et al. 2025). The RISE benchmark offers small, expert-verified datasets from real consulting projects featuring materially challenging sources. Ground truth is treated not as absolute correctness but as operationalised scholarly interpretation, making evaluative choices explicit and contestable. The framework is therefore not intended to produce generic leaderboard-style rankings or statistically representative performance estimates for the humanities as a whole, but to enable evidence-based comparison of model behaviour on concrete research tasks and changes in cost and performance over time.
2 Method
Steps
The creation of a new benchmark dataset follows a standardized workflow. First, representative source materials are selected from real humanities research projects. Next, a task definition (e.g. transcription, segmentation, or metadata extraction) is formulated specifying the expected behaviour of the model.
Prompts and sampling parameters are then developed in consultation with domain experts to guide the model toward producing structured outputs that conform to a predefined schema. For example, a prompt may instruct the model to extract sender, recipient, date, and place from a historical business letter and to return the result as a structured JSON object.
For each benchmark dataset, an explicit output schema is defined which is typically implemented as a Pydantic dataclass. This schema constrains model outputs to a fixed structure and enables automated validation and scoring. Ground truths are subsequently created or verified by domain experts and represent the expected output for each source item.
Finally, the task-specific scoring logic is implemented. Depending on the nature of the task, this may involve fuzzy string matching, F1 scores, or character error rates. All components (source materials, prompts, schemas, ground truths, and scoring functions) are stored together as part of the benchmark dataset.
Configured Tests and Execution
Running a benchmark involves defining one or more configured tests. A configured test pairs a benchmark dataset with a specific provider, model, and runtime parameters such as temperature or maximum token limits. Once defined and assigned a unique ID, a test can be executed individually or as part of a batch.
During execution, the benchmarking tool submits the benchmark dataset to the selected model using the defined prompt and schema, captures the model output, validates it against the schema, and applies the dataset-specific scoring function. Outputs, scores, and metadata (including model version identifiers and execution timestamps) are stored persistently, enabling full reproducibility of each test run.
Configured tests are versioned and assigned unique identifiers (e.g. T0234), allowing previous results to be re-executed or compared across time as new models become available.
Sampling Strategy
Our selection strategy at the level of the benchmark datasets is grounded in concrete humanities research projects supported by RISE. The source materials were selected to reflect the typical challenges we encountered in practice, including variable document layouts, multilingual content, historical typefaces, and mixed handwritten and printed text.
Within each benchmark dataset, sampling was performed by domain experts to ensure representative coverage of both common cases and relevant edge cases. The benchmark datasets are intentionally small, reflecting the high cost of expert annotation and the diagnostic aim of the benchmark rather than statistical representativeness.
Quality Control and Ground Truth Validation
All ground truths were either manually produced or carefully verified by domain experts. Validation focused on correctness with respect to the defined task and output schema rather than on stylistic normalisation beyond what is required for scoring.
As a consequence of this, each benchmark dataset implements its own scoring routine, tailored to the task at hand. Performance scores are directly comparable across models within a benchmark dataset. However, cross-benchmark comparisons are possible only after normalisation and should therefore be interpreted with caution.2
Aggregated Results
The test configurations are stored in a dedicated table containing the full execution parameters for each configured test. At the time of writing, the benchmark tool contains 633 recorded test runs collected over multiple benchmark datasets for 487 test configurations.
For each benchmark dataset and model, the presentation interface reports the best achieved score across all executed configurations. These aggregated results permit a comparison of model behaviour within a task and provide a longitudinal view of performance as the capabilities of models evolve over time as shown in Figure 3.

Figure 3
Best performance per model and benchmark dataset, taking the maximum score over all recorded configurations per model and dataset. Null values (N/A) indicate tests that have not yet been run or can no longer be run because the corresponding model has been deprecated.
3 Dataset Description
Name
RISE Humanities Data Benchmark
Objects
8 benchmark datasets (detailed in Table 1) and 2 test datasets.3
Table 1
Overview of benchmark datasets, tasks, and scoring schemes.
| BIBLIOGRAPHIC DATA: METADATA EXTRACTION FROM BIBLIOGRAPHIES | |
|---|---|
| Data | 5 images, 67 elements, JPG, approx. 1743 × 2888 px, 350 KB each. |
| Content | Pages from Bibliography of Works in the Philosophy of History, 1945–1957. |
| Source | http://www.jstor.org/stable/2504495 |
| Language | English |
| Content Type | Printed text. Directory-like. |
| Task | Transcription; metadata extraction. |
| Expected output | JSON list of structured items; output defined via Pydantic class. |
| Scoring | Fuzzy matching on each key in the JSON items. |
| Test runs | 77 runs. |
| Best result | 71.43%, OpenAI, gpt-4o. |
| Design | The dataset was created as a proof of concept in the context of a master’s thesis and targets the extraction of structured bibliographic information from semi-structured, directory-like printed text. The benchmark focuses on structural inference rather than linguistic prompting and omits an explicit prompt in favor of a predefined Pydantic output schema. |
| Further Info | Dataset Description, Test Results, GitHub (last accessed 2026-01-12) |
| BLACKLIST CARDS: NER EXTRACTION FROM INDEX CARDS | |
| Data | 33 images, JPG, approx. 1788 × 1305 px, 590 KB each. |
| Content | Swiss federal index cards for companies on a British black list for trade (1940s). |
| Source | https://www.recherche.bar.admin.ch/recherche/#/de/archiv/einheit/31240458 (last accessed 2026-01-12) |
| Language | Mostly German, some French |
| Content Type | Typed and handwritten text. Stamped dates. |
| Task | Transcription; metadata extraction. |
| Expected output | JSON object with predefined keys; output defined via Pydantic class. |
| Scoring | Fuzzy matching on each key in the JSON object. |
| Test runs | 38 runs. |
| Best result | 95.65%, OpenAI, gpt-4.1-mini. |
| Design | The dataset was created in the context of an ongoing dissertation and reflects a historically common archival document type. The cards were manually selected and curated to capture the combination of typed and handwritten entries typical for mid-20th-century administrative records, which are not readily available for bulk download. |
| Further Info | Dataset Description, Test Results, GitHub (last accessed 2026-01-12) |
| BOOK ADVERT XMLS: DATA CORRECTION OF LLM-GENERATED XML FILES | |
| Data | 50 JSON files containing XML structures. |
| Content | Faulty LLM-generated XML structures from historical sources |
| Source | Data from the digitized Basler Avisblatt, namely book advertisements, extracted as XML. |
| Language | Early Modern German |
| Content Type | Digital data. Plain text in JSON files. |
| Task | Correct XML structure (add closing tags, remove faulty tags). |
| Expected output | JSON object with correct XML as string; output defined via Pydantic class. |
| Scoring | Fuzzy matching on whole XML string after removing white spaces and setting all characters to lowercase. |
| Test runs | 40 runs. |
| Best result | 97.47%, Anthropic, Claude Sonnet 4.5. |
| Design | This dataset was created to introduce text-only benchmarks into the framework as part of an ongoing research project. It targets a reasoning task situated in a workflow, where large language models are required to detect and correct structural errors in previously generated XML, reflecting common data transformation and processing challenges in digital humanities pipelines. |
| Further Info | Dataset Description, Test Results, GitHub (last accessed 2026-01-12) |
| BUSINESS LETTERS: NER EXTRACTION FROM CORRESPONDENCE | |
| Data | 57 letters, 98 images, JPG, approx. 2479 × 3508 px, 600 KB each. |
| Content | Collection of letters from the Basler Rheinschifffahrt-Aktiengesellschaft. |
| Source | http://dx.doi.org/10.7891/e-manuscripta-54917 |
| Language | German |
| Content Type | Typed, printed and handwritten text. Signatures. |
| Task | Metadata extraction, person matching, signature recognition. |
| Expected output | JSON object with predefined keys; output defined via Pydantic class. |
| Scoring | F1 score. |
| Test runs | 212 runs. |
| Best result | 77.00%, OpenAI, gpt-5. |
| Design | This dataset was designed as a multi-stage benchmark reflecting a common humanities scenario centered on historical correspondence. The benchmark evaluates sequential tasks including metadata extraction, named-entity recognition, and signature-based person identification, addressing both linguistic variation and document-level decision making. |
| Further Info | Dataset Description, Test Results, GitHub (last accessed 2026-01-12) |
| COMPANY LISTS: COMPANY DATA EXTRACTION FROM LIST-LIKE MATERIALS | |
| Data | 15 images, JPG, approx. 1868 × 2931 px, 360 KB each. |
| Content | Pages from Trade index: classified handbook of the members of the British Chamber of Commerce for Switzerland |
| Source | https://doi.org/10.7891/e-manuscripta-174832 |
| Language | English and German |
| Content Type | Printed lists with strongly varying layout. |
| Task | Metadata extraction with varying layouts. |
| Expected output | JSON list of structured items; output defined via Pydantic class. |
| Scoring | F1 score. |
| Test runs | 76 runs. |
| Best result | 58.40% OpenAI, gpt-5. |
| Design | This dataset was developed in the context of an ongoing dissertation project focused on historical company networks. It evaluates metadata extraction from multilingual company lists with strongly varying layouts, testing a model’s ability to infer a consistent schema despite substantial variation in formatting and presentation. |
| Further Info | Dataset Description, Test Results, GitHub (last accessed 2026-01-12) |
| FRAKTUR ADVERTS: PAGE SEGMENTATION AND FRAKTUR TEXT TRANSCRIPTION | |
| Data | 5 images, JPG, approx. 5000 × 8267 px, 10 MB each. |
| Content | Pages from the Basler Avisblatt, an early advertisement newspaper published in Basel, Switzerland. |
| Source | https://avisblatt.dg-basel.hasdai.org (last accessed 2026-01-12) |
| Language | Early modern German |
| Content Type | Printed text in Fraktur typeface. |
| Task | F1 score & Character Error Rate (CER). |
| Expected output | JSON list of structured items; output defined via Pydantic class. |
| Scoring | Segmentation of adverts and text recognition |
| Test runs | 91 runs. |
| Best result | 95.70%, Google, gemini-2.0-pro-exp-02-05. |
| Design | This dataset was developed as part of ongoing research at the University of Basel and targets historical advertisements printed in Fraktur typeface. It evaluates advert segmentation and text recognition under conditions of dense layout and strong typographic variability, while demonstrating that aggressive image downscaling has a limited effect on model performance. |
| Further Info | Dataset Description, Test Results, GitHub (last accessed 2026-01-12) |
| LIBRARY CARDS: METADATA EXTRACTION FROM MULTILINGUAL SOURCES | |
| Data | 263 images, JPG, approx. 976 × 579 px, 10 KB each. |
| Content | Library cards with dissertation thesis information. |
| Source | https://ub.unibas.ch/cmsdata/spezialkataloge/ipac/searchform.php?KatalogID=ak2 (last accessed 2026-01-12) |
| Language | German, French, English, Latin, Greek, and other European languages |
| Content Type | Typed and handwritten multilingual text. |
| Task | Metadata extraction. |
| Expected output | JSON object with predefined keys; output defined via Pydantic class. |
| Scoring | F1 score. |
| Test runs | 61 runs. |
| Best result | 89.51%, OpenAI, gpt-5. |
| Design | This dataset was created as a feasibility study to assess whether unique dissertation records can be identified and structured within a large-scale historical card catalog comprising approximately 700,000 entries. A random sample was selected to capture multilingual content, inconsistent layouts, and a mixture of handwritten, typed, and printed text, reflecting realistic challenges in large archival metadata extraction tasks. |
| Further Info | Dataset Description, Test Results, GitHub (last accessed 2026-01-12) |
| MEDIEVAL MANUSCRIPTS: HANDWRITTEN TEXT RECOGNITION | |
| Data | 12 images, JPG, approx. 1872 × 2808 px, 1.1 MB each. |
| Content | Pages from Pilgerreisen nach Jerusalem 1440 und 1453. |
| Source | https://www.e-codices.ch/de/description/ubb/H-V-0015/HAN (last accessed 2026-01-12) |
| Language | Medieval German |
| Content Type | Handwritten text. |
| Task | Segmentation & Text recognition. |
| Expected output | JSON object with predefined keys; output defined via Pydantic class. |
| Scoring | Fuzzy & Character Error Rate (CER). |
| Test runs | 38 runs. |
| Best result | 76.90%, OpenAI, gpt-4.1-mini. |
| Design | This dataset was prepared in the context of a university seminar and focuses on late medieval manuscript material. It targets segmentation and text recognition of handwritten sources that are difficult to read due to script variability and the presence of marginalia, reflecting common challenges in medieval manuscript analysis. |
| Further Info | Dataset Description, Test Results, GitHub (last accessed 2026-01-12) |
Formats
JSON, CSV, JPEG, ASCII
Creation
August 2025 – ongoing
Dataset Creators
The benchmark dataset contributors are listed in Table 2.
Table 2
Contributors to the project. Contributor roles follow the CRediT taxonomy.
| NAME | PRIMARY ROLES |
|---|---|
| Anthea Alberto | Data Curation; Validation |
| Sven Burkhardt | Validation |
| Eric Decker | Data Curation; Validation |
| Pema Frick | Data Curation; Validation; Formal Analysis; Software |
| Maximilian Hindermann | Conceptualization; Methodology; Software; Formal Analysis |
| Lea Katharina Kasper | Data Curation; Validation; Formal Analysis |
| José Luis Losada Palenzuela | Data Curation; Validation |
| Sorin Marti | Conceptualization; Software; Formal Analysis; Visualization |
| Gabriel Müller | Data Curation; Validation |
| Ina Serif | Data Curation; Validation; Formal Analysis |
| Elena Spadini | Data Curation; Validation |
Languages
English; medieval and early modern German; multiple European languages
License
GPL-3.0 (benchmark tool), CC-BY-4.0 (benchmark datasets)4
Publication Dates
v0.1 – 2025-08-25; current version v0.4.0 – 2025-12-09
4 Reuse Potential
The RISE Humanities Data Benchmark offers multiple levels of reuse. Beyond the aggregated leaderboard results, the repository provides ready-to-use benchmark datasets that can be applied to a wide range of evaluations by redefining prompts, sampling parameters, output schemas, or scoring procedures without modifying the underlying data. This enables systematic comparisons of models, task formulations, and evaluation metrics on identical humanities sources. In addition, the benchmark framework itself can be reused as an evaluation infrastructure to create new datasets following a standardised structure. Newly created benchmarks can be evaluated across multiple models and inference providers using the same execution and scoring pipeline, supporting reproducibility, comparability, and community-driven extension.
As a next architectural refinement, the benchmark execution logic and visualization pipeline will be fully separated from the benchmark dataset definitions into a dedicated system layer, allowing benchmark versions to remain stable, citable, and reproducible while the evaluation infrastructure continues to evolve independently. Within this reconfiguration, standardised BenchmarkCards (see Sokol et al. 2025) will be introduced as an additional documentation layer to further enhance comparability and interpretability across different benchmarks and evaluation contexts, thereby supporting informed benchmark selection and cross-system evaluation. In parallel, the regular extension of the benchmark corpus will proceed unchanged, ensuring that methodological improvements do not constrain the addition of new datasets and tasks. In addition to this forward-looking expansion, the continuously stored results also represent a valuable resource in their own right: they allow researchers to trace the development and performance of LLMs on specific humanities tasks over time, while the front-end enables the export of search results for further analysis, comparison, and the integration into external research workflows.
Notes
[1] RISE (Research & Infrastructure Support) supports humanities and social science researchers at the University of Basel in designing, implementing, and sustaining computer-based research through expert guidance on digital methods, data management, analysis, and open, FAIR data dissemination.
[2] All benchmark results are normalised to a common 0–100% scale to facilitate presentation, search, and visualization across datasets. This supports exploratory analysis but does not establish direct comparability between different benchmark tasks or evaluation criteria.
[3] Three additional benchmark datasets are currently in preparation: data extraction from 1950s handwritten letters, Arabic–German bibliographic data from the Bibliotheca Afghanica (see https://ub.unibas.ch/de/sammlungen/bibliotheca-afghanica/, last accessed 2026-01-12), and reasoning-focused tasks for the project Das Judenthum in der Musik (see https://data.snf.ch/grants/grant/212806, last accessed 2026-01-12). These additions will broaden the suite beyond its current emphasis on image-based context.
[4] The repository currently combines the benchmarking framework and the benchmark datasets. A planned reorganization will separate these components into distinct repositories, enabling independent licensing of framework (GPL-3.0) and datasets (CC-BY-4.0).
[5] https://avisblatt.dg-basel.hasdai.org (last accessed 2026-01-12).
Acknowledgements
The RISE Humanities Data Benchmark uses data from Printed Markets – The Basler Avisblatt,5 the forthcoming dissertation project of Lea Katharina Kasper, and curated collections of the University Library Basel.
We thank the contributors to the RISE Humanities Data Benchmark, Anthea Alberto, Sven Burkhardt, Eric Decker, Pema Frick, José Luis Losada Palenzuela, Gabriel Müller, Ina Serif, and Elena Spadini, for their contributions to dataset creation, annotation, software development, and evaluation design.
Competing Interests
The authors have no competing interests to declare.
Author Contributions
Maximilian Hindermann: Conceptualization; Methodology; Software; Data curation; Writing–original draft; Writing–review & editing; Supervision.
Sorin Marti: Conceptualization; Methodology; Software; Data curation; Visualization; Writing–review & editing.
Lea Katharina Kasper: Conceptualization; Data curation; Writing–original draft.
Arno Bosse: Writing–review & editing.
