The RISE Humanities Data Benchmark: A Framework for Evaluating Large Language Models for Humanities Tasks

Maximilian Hindermann; Sorin Marti; Lea Katharina Kaspera; Arno Bossea

doi:10.5334/johd.481

Full Article

1 Overview

Purpose and Scope

The RISE¹ Humanities Data Benchmark is both a suite of humanities-oriented datasets and a provider-agnostic framework for executing and comparing model performance on document-centred research tasks. It was conceived to systematically assess the integration and performance of current LLMs on workflows encountered in concrete digital humanities research projects, such as transcription, layout segmentation, and structured metadata extraction from scanned historical sources. By combining focused tasks with explicitly structured schemas and ground truths, the suite operationalises evaluation and makes the outcomes of assumptions, performance criteria, and interpretative decisions explicit and comparable. As a result, the RISE benchmark prioritizes realism and interpretability over scale. This helps mitigate the risk of data contamination and reflects the high cost of expert annotation of complex humanities material.

Concepts and Terminology

In this paper, we use the following terminology:

Benchmark dataset: A complete and executable evaluation unit consisting of curated source materials, a task definition, prompts, structured output schemas, expert-verified ground truths, and task-specific scoring functions.
Benchmark tool: A system for executing benchmark datasets with specified models and parameters, storing outputs and scores persistently.
Configured test: The execution of a benchmark dataset with a specific provider, model, and runtime configuration (e.g. temperature). Figure 1 shows how configured and ad hoc tests are executed.
Presentation interface: A searchable website for browsing and comparing benchmark results across datasets, models, and time. Figure 2 shows an example result page of a single benchmark.

Overview over the Bibliographic Data benchmark in the visualization frontend.

This distinction between benchmark datasets and configured tests is central to the framework. It allows the same dataset to be reused across multiple models and providers while ensuring that each individual test run remains fully specified and repeatable.

Model Modalities and Evaluation Coverage

While the framework itself is model-agnostic, the current benchmark datasets focus on document understanding tasks involving image inputs, text inputs, or a combination of both, paired with structured outputs. Accordingly, the benchmark evaluates both vision-language models (here understood as large language models augmented with visual input capabilities and often referred to as multimodal LLMs) and text-based LLMs, depending on the modality required by the task.

Repository Location

The benchmark datasets and the benchmarking tool are hosted in the same public GitHub repository and archived on Zenodo for long-term access. Aggregated benchmark results are made accessible through a custom web interface.

GitHub: Public repository. Open for contribution. https://github.com/RISE-UNIBAS/humanities_data_benchmark (last accessed 2026-01-12)
Zenodo: Stable archive. With each release on GitHub, the new version is automatically stored in a Zenodo repository. Concept DOI: https://doi.org/10.5281/zenodo.16941752
Presentation website: Custom front-end to browse and search the benchmark results.https://rise-services.rise.unibas.ch/benchmarks/ (last accessed 2026-01-12)

Context

Recent works have offered methodological frameworks for integrating LLMs into humanities and social science research workflows (Karjus 2025; Abdurahman et al. 2025) or surveyed domain applications without addressing evaluation infrastructure (Simons et al. 2025). Most current benchmarks either prioritise scale (Kang et al. 2025; Hauser et al. 2024) or focus on specific domains (e.g. Ziems et al. 2024; Kraus et al. 2025; Spinaci et al. 2025; Greif et al. 2025). The RISE benchmark offers small, expert-verified datasets from real consulting projects featuring materially challenging sources. Ground truth is treated not as absolute correctness but as operationalised scholarly interpretation, making evaluative choices explicit and contestable. The framework is therefore not intended to produce generic leaderboard-style rankings or statistically representative performance estimates for the humanities as a whole, but to enable evidence-based comparison of model behaviour on concrete research tasks and changes in cost and performance over time.

2 Method

Steps

The creation of a new benchmark dataset follows a standardized workflow. First, representative source materials are selected from real humanities research projects. Next, a task definition (e.g. transcription, segmentation, or metadata extraction) is formulated specifying the expected behaviour of the model.

Prompts and sampling parameters are then developed in consultation with domain experts to guide the model toward producing structured outputs that conform to a predefined schema. For example, a prompt may instruct the model to extract sender, recipient, date, and place from a historical business letter and to return the result as a structured JSON object.

For each benchmark dataset, an explicit output schema is defined which is typically implemented as a Pydantic dataclass. This schema constrains model outputs to a fixed structure and enables automated validation and scoring. Ground truths are subsequently created or verified by domain experts and represent the expected output for each source item.

Finally, the task-specific scoring logic is implemented. Depending on the nature of the task, this may involve fuzzy string matching, F1 scores, or character error rates. All components (source materials, prompts, schemas, ground truths, and scoring functions) are stored together as part of the benchmark dataset.

Configured Tests and Execution

Running a benchmark involves defining one or more configured tests. A configured test pairs a benchmark dataset with a specific provider, model, and runtime parameters such as temperature or maximum token limits. Once defined and assigned a unique ID, a test can be executed individually or as part of a batch.

During execution, the benchmarking tool submits the benchmark dataset to the selected model using the defined prompt and schema, captures the model output, validates it against the schema, and applies the dataset-specific scoring function. Outputs, scores, and metadata (including model version identifiers and execution timestamps) are stored persistently, enabling full reproducibility of each test run.

Configured tests are versioned and assigned unique identifiers (e.g. T0234), allowing previous results to be re-executed or compared across time as new models become available.

Sampling Strategy

Our selection strategy at the level of the benchmark datasets is grounded in concrete humanities research projects supported by RISE. The source materials were selected to reflect the typical challenges we encountered in practice, including variable document layouts, multilingual content, historical typefaces, and mixed handwritten and printed text.

Within each benchmark dataset, sampling was performed by domain experts to ensure representative coverage of both common cases and relevant edge cases. The benchmark datasets are intentionally small, reflecting the high cost of expert annotation and the diagnostic aim of the benchmark rather than statistical representativeness.

Quality Control and Ground Truth Validation

All ground truths were either manually produced or carefully verified by domain experts. Validation focused on correctness with respect to the defined task and output schema rather than on stylistic normalisation beyond what is required for scoring.

As a consequence of this, each benchmark dataset implements its own scoring routine, tailored to the task at hand. Performance scores are directly comparable across models within a benchmark dataset. However, cross-benchmark comparisons are possible only after normalisation and should therefore be interpreted with caution.²

Aggregated Results

The test configurations are stored in a dedicated table containing the full execution parameters for each configured test. At the time of writing, the benchmark tool contains 633 recorded test runs collected over multiple benchmark datasets for 487 test configurations.

For each benchmark dataset and model, the presentation interface reports the best achieved score across all executed configurations. These aggregated results permit a comparison of model behaviour within a task and provide a longitudinal view of performance as the capabilities of models evolve over time as shown in Figure 3.

Best performance per model and benchmark dataset, taking the maximum score over all recorded configurations per model and dataset. Null values (N/A) indicate tests that have not yet been run or can no longer be run because the corresponding model has been deprecated.

3 Dataset Description

Name

RISE Humanities Data Benchmark

Objects

8 benchmark datasets (detailed in Table 1) and 2 test datasets.³

Table 1

Overview of benchmark datasets, tasks, and scoring schemes.

BIBLIOGRAPHIC DATA: METADATA EXTRACTION FROM BIBLIOGRAPHIES
Data	5 images, 67 elements, JPG, approx. 1743 × 2888 px, 350 KB each.
Content	Pages from Bibliography of Works in the Philosophy of History, 1945–1957.
Source	http://www.jstor.org/stable/2504495
Language	English
Content Type	Printed text. Directory-like.
Task	Transcription; metadata extraction.
Expected output	JSON list of structured items; output defined via Pydantic class.
Scoring	Fuzzy matching on each key in the JSON items.
Test runs	77 runs.
Best result	71.43%, OpenAI, gpt-4o.
Design	The dataset was created as a proof of concept in the context of a master’s thesis and targets the extraction of structured bibliographic information from semi-structured, directory-like printed text. The benchmark focuses on structural inference rather than linguistic prompting and omits an explicit prompt in favor of a predefined Pydantic output schema.
Further Info	Dataset Description, Test Results, GitHub (last accessed 2026-01-12)
BLACKLIST CARDS: NER EXTRACTION FROM INDEX CARDS
Data	33 images, JPG, approx. 1788 × 1305 px, 590 KB each.
Content	Swiss federal index cards for companies on a British black list for trade (1940s).
Source	https://www.recherche.bar.admin.ch/recherche/#/de/archiv/einheit/31240458 (last accessed 2026-01-12)
Language	Mostly German, some French
Content Type	Typed and handwritten text. Stamped dates.
Task	Transcription; metadata extraction.
Expected output	JSON object with predefined keys; output defined via Pydantic class.
Scoring	Fuzzy matching on each key in the JSON object.
Test runs	38 runs.
Best result	95.65%, OpenAI, gpt-4.1-mini.
Design	The dataset was created in the context of an ongoing dissertation and reflects a historically common archival document type. The cards were manually selected and curated to capture the combination of typed and handwritten entries typical for mid-20th-century administrative records, which are not readily available for bulk download.
Further Info	Dataset Description, Test Results, GitHub (last accessed 2026-01-12)
BOOK ADVERT XMLS: DATA CORRECTION OF LLM-GENERATED XML FILES
Data	50 JSON files containing XML structures.
Content	Faulty LLM-generated XML structures from historical sources
Source	Data from the digitized Basler Avisblatt, namely book advertisements, extracted as XML.
Language	Early Modern German
Content Type	Digital data. Plain text in JSON files.
Task	Correct XML structure (add closing tags, remove faulty tags).
Expected output	JSON object with correct XML as string; output defined via Pydantic class.
Scoring	Fuzzy matching on whole XML string after removing white spaces and setting all characters to lowercase.
Test runs	40 runs.
Best result	97.47%, Anthropic, Claude Sonnet 4.5.
Design	This dataset was created to introduce text-only benchmarks into the framework as part of an ongoing research project. It targets a reasoning task situated in a workflow, where large language models are required to detect and correct structural errors in previously generated XML, reflecting common data transformation and processing challenges in digital humanities pipelines.
Further Info	Dataset Description, Test Results, GitHub (last accessed 2026-01-12)
BUSINESS LETTERS: NER EXTRACTION FROM CORRESPONDENCE
Data	57 letters, 98 images, JPG, approx. 2479 × 3508 px, 600 KB each.
Content	Collection of letters from the Basler Rheinschifffahrt-Aktiengesellschaft.
Source	http://dx.doi.org/10.7891/e-manuscripta-54917
Language	German
Content Type	Typed, printed and handwritten text. Signatures.
Task	Metadata extraction, person matching, signature recognition.
Expected output	JSON object with predefined keys; output defined via Pydantic class.
Scoring	F1 score.
Test runs	212 runs.
Best result	77.00%, OpenAI, gpt-5.
Design	This dataset was designed as a multi-stage benchmark reflecting a common humanities scenario centered on historical correspondence. The benchmark evaluates sequential tasks including metadata extraction, named-entity recognition, and signature-based person identification, addressing both linguistic variation and document-level decision making.
Further Info	Dataset Description, Test Results, GitHub (last accessed 2026-01-12)
COMPANY LISTS: COMPANY DATA EXTRACTION FROM LIST-LIKE MATERIALS
Data	15 images, JPG, approx. 1868 × 2931 px, 360 KB each.
Content	Pages from Trade index: classified handbook of the members of the British Chamber of Commerce for Switzerland
Source	https://doi.org/10.7891/e-manuscripta-174832
Language	English and German
Content Type	Printed lists with strongly varying layout.
Task	Metadata extraction with varying layouts.
Expected output	JSON list of structured items; output defined via Pydantic class.
Scoring	F1 score.
Test runs	76 runs.
Best result	58.40% OpenAI, gpt-5.
Design	This dataset was developed in the context of an ongoing dissertation project focused on historical company networks. It evaluates metadata extraction from multilingual company lists with strongly varying layouts, testing a model’s ability to infer a consistent schema despite substantial variation in formatting and presentation.
Further Info	Dataset Description, Test Results, GitHub (last accessed 2026-01-12)
FRAKTUR ADVERTS: PAGE SEGMENTATION AND FRAKTUR TEXT TRANSCRIPTION
Data	5 images, JPG, approx. 5000 × 8267 px, 10 MB each.
Content	Pages from the Basler Avisblatt, an early advertisement newspaper published in Basel, Switzerland.
Source	https://avisblatt.dg-basel.hasdai.org (last accessed 2026-01-12)
Language	Early modern German
Content Type	Printed text in Fraktur typeface.
Task	F1 score & Character Error Rate (CER).
Expected output	JSON list of structured items; output defined via Pydantic class.
Scoring	Segmentation of adverts and text recognition
Test runs	91 runs.
Best result	95.70%, Google, gemini-2.0-pro-exp-02-05.
Design	This dataset was developed as part of ongoing research at the University of Basel and targets historical advertisements printed in Fraktur typeface. It evaluates advert segmentation and text recognition under conditions of dense layout and strong typographic variability, while demonstrating that aggressive image downscaling has a limited effect on model performance.
Further Info	Dataset Description, Test Results, GitHub (last accessed 2026-01-12)
LIBRARY CARDS: METADATA EXTRACTION FROM MULTILINGUAL SOURCES
Data	263 images, JPG, approx. 976 × 579 px, 10 KB each.
Content	Library cards with dissertation thesis information.
Source	https://ub.unibas.ch/cmsdata/spezialkataloge/ipac/searchform.php?KatalogID=ak2 (last accessed 2026-01-12)
Language	German, French, English, Latin, Greek, and other European languages
Content Type	Typed and handwritten multilingual text.
Task	Metadata extraction.
Expected output	JSON object with predefined keys; output defined via Pydantic class.
Scoring	F1 score.
Test runs	61 runs.
Best result	89.51%, OpenAI, gpt-5.
Design	This dataset was created as a feasibility study to assess whether unique dissertation records can be identified and structured within a large-scale historical card catalog comprising approximately 700,000 entries. A random sample was selected to capture multilingual content, inconsistent layouts, and a mixture of handwritten, typed, and printed text, reflecting realistic challenges in large archival metadata extraction tasks.
Further Info	Dataset Description, Test Results, GitHub (last accessed 2026-01-12)
MEDIEVAL MANUSCRIPTS: HANDWRITTEN TEXT RECOGNITION
Data	12 images, JPG, approx. 1872 × 2808 px, 1.1 MB each.
Content	Pages from Pilgerreisen nach Jerusalem 1440 und 1453.
Source	https://www.e-codices.ch/de/description/ubb/H-V-0015/HAN (last accessed 2026-01-12)
Language	Medieval German
Content Type	Handwritten text.
Task	Segmentation & Text recognition.
Expected output	JSON object with predefined keys; output defined via Pydantic class.
Scoring	Fuzzy & Character Error Rate (CER).
Test runs	38 runs.
Best result	76.90%, OpenAI, gpt-4.1-mini.
Design	This dataset was prepared in the context of a university seminar and focuses on late medieval manuscript material. It targets segmentation and text recognition of handwritten sources that are difficult to read due to script variability and the presence of marginalia, reflecting common challenges in medieval manuscript analysis.
Further Info	Dataset Description, Test Results, GitHub (last accessed 2026-01-12)

Formats

JSON, CSV, JPEG, ASCII

Creation

August 2025 – ongoing

Dataset Creators

The benchmark dataset contributors are listed in Table 2.

Table 2

Contributors to the project. Contributor roles follow the CRediT taxonomy.

NAME	PRIMARY ROLES
Anthea Alberto	Data Curation; Validation
Sven Burkhardt	Validation
Eric Decker	Data Curation; Validation
Pema Frick	Data Curation; Validation; Formal Analysis; Software
Maximilian Hindermann	Conceptualization; Methodology; Software; Formal Analysis
Lea Katharina Kasper	Data Curation; Validation; Formal Analysis
José Luis Losada Palenzuela	Data Curation; Validation
Sorin Marti	Conceptualization; Software; Formal Analysis; Visualization
Gabriel Müller	Data Curation; Validation
Ina Serif	Data Curation; Validation; Formal Analysis
Elena Spadini	Data Curation; Validation

Languages

English; medieval and early modern German; multiple European languages

License

GPL-3.0 (benchmark tool), CC-BY-4.0 (benchmark datasets)⁴

Publication Dates

v0.1 – 2025-08-25; current version v0.4.0 – 2025-12-09

4 Reuse Potential

The RISE Humanities Data Benchmark offers multiple levels of reuse. Beyond the aggregated leaderboard results, the repository provides ready-to-use benchmark datasets that can be applied to a wide range of evaluations by redefining prompts, sampling parameters, output schemas, or scoring procedures without modifying the underlying data. This enables systematic comparisons of models, task formulations, and evaluation metrics on identical humanities sources. In addition, the benchmark framework itself can be reused as an evaluation infrastructure to create new datasets following a standardised structure. Newly created benchmarks can be evaluated across multiple models and inference providers using the same execution and scoring pipeline, supporting reproducibility, comparability, and community-driven extension.

As a next architectural refinement, the benchmark execution logic and visualization pipeline will be fully separated from the benchmark dataset definitions into a dedicated system layer, allowing benchmark versions to remain stable, citable, and reproducible while the evaluation infrastructure continues to evolve independently. Within this reconfiguration, standardised BenchmarkCards (see Sokol et al. 2025) will be introduced as an additional documentation layer to further enhance comparability and interpretability across different benchmarks and evaluation contexts, thereby supporting informed benchmark selection and cross-system evaluation. In parallel, the regular extension of the benchmark corpus will proceed unchanged, ensuring that methodological improvements do not constrain the addition of new datasets and tasks. In addition to this forward-looking expansion, the continuously stored results also represent a valuable resource in their own right: they allow researchers to trace the development and performance of LLMs on specific humanities tasks over time, while the front-end enables the export of search results for further analysis, comparison, and the integration into external research workflows.

Notes

[1] RISE (Research & Infrastructure Support) supports humanities and social science researchers at the University of Basel in designing, implementing, and sustaining computer-based research through expert guidance on digital methods, data management, analysis, and open, FAIR data dissemination.

[2] All benchmark results are normalised to a common 0–100% scale to facilitate presentation, search, and visualization across datasets. This supports exploratory analysis but does not establish direct comparability between different benchmark tasks or evaluation criteria.

[3] Three additional benchmark datasets are currently in preparation: data extraction from 1950s handwritten letters, Arabic–German bibliographic data from the Bibliotheca Afghanica (see https://ub.unibas.ch/de/sammlungen/bibliotheca-afghanica/, last accessed 2026-01-12), and reasoning-focused tasks for the project Das Judenthum in der Musik (see https://data.snf.ch/grants/grant/212806, last accessed 2026-01-12). These additions will broaden the suite beyond its current emphasis on image-based context.

[4] The repository currently combines the benchmarking framework and the benchmark datasets. A planned reorganization will separate these components into distinct repositories, enabling independent licensing of framework (GPL-3.0) and datasets (CC-BY-4.0).

[5] https://avisblatt.dg-basel.hasdai.org (last accessed 2026-01-12).

Acknowledgements

The RISE Humanities Data Benchmark uses data from Printed Markets – The Basler Avisblatt,⁵ the forthcoming dissertation project of Lea Katharina Kasper, and curated collections of the University Library Basel.

We thank the contributors to the RISE Humanities Data Benchmark, Anthea Alberto, Sven Burkhardt, Eric Decker, Pema Frick, José Luis Losada Palenzuela, Gabriel Müller, Ina Serif, and Elena Spadini, for their contributions to dataset creation, annotation, software development, and evaluation design.

Competing Interests

The authors have no competing interests to declare.

Author Contributions

Maximilian Hindermann: Conceptualization; Methodology; Software; Data curation; Writing–original draft; Writing–review & editing; Supervision.

Sorin Marti: Conceptualization; Methodology; Software; Data curation; Visualization; Writing–review & editing.

Lea Katharina Kasper: Conceptualization; Data curation; Writing–original draft.

Arno Bosse: Writing–review & editing.