Have a personal or library account? Click to login
The RISE Humanities Data Benchmark: A Framework for Evaluating Large Language Models for Humanities Tasks Cover

The RISE Humanities Data Benchmark: A Framework for Evaluating Large Language Models for Humanities Tasks

Open Access
|Feb 2026

Full Article

1 Overview

Purpose and Scope

The RISE1 Humanities Data Benchmark is both a suite of humanities-oriented datasets and a provider-agnostic framework for executing and comparing model performance on document-centred research tasks. It was conceived to systematically assess the integration and performance of current LLMs on workflows encountered in concrete digital humanities research projects, such as transcription, layout segmentation, and structured metadata extraction from scanned historical sources. By combining focused tasks with explicitly structured schemas and ground truths, the suite operationalises evaluation and makes the outcomes of assumptions, performance criteria, and interpretative decisions explicit and comparable. As a result, the RISE benchmark prioritizes realism and interpretability over scale. This helps mitigate the risk of data contamination and reflects the high cost of expert annotation of complex humanities material.

Concepts and Terminology

In this paper, we use the following terminology:

  • Benchmark dataset: A complete and executable evaluation unit consisting of curated source materials, a task definition, prompts, structured output schemas, expert-verified ground truths, and task-specific scoring functions.

  • Benchmark tool: A system for executing benchmark datasets with specified models and parameters, storing outputs and scores persistently.

  • Configured test: The execution of a benchmark dataset with a specific provider, model, and runtime configuration (e.g. temperature). Figure 1 shows how configured and ad hoc tests are executed.

  • Presentation interface: A searchable website for browsing and comparing benchmark results across datasets, models, and time. Figure 2 shows an example result page of a single benchmark.

johd-12-481-g1.png
Figure 1

Execution flow of a benchmark test run.

johd-12-481-g2.png
Figure 2

Overview over the Bibliographic Data benchmark in the visualization frontend.

This distinction between benchmark datasets and configured tests is central to the framework. It allows the same dataset to be reused across multiple models and providers while ensuring that each individual test run remains fully specified and repeatable.

Model Modalities and Evaluation Coverage

While the framework itself is model-agnostic, the current benchmark datasets focus on document understanding tasks involving image inputs, text inputs, or a combination of both, paired with structured outputs. Accordingly, the benchmark evaluates both vision-language models (here understood as large language models augmented with visual input capabilities and often referred to as multimodal LLMs) and text-based LLMs, depending on the modality required by the task.

Repository Location

The benchmark datasets and the benchmarking tool are hosted in the same public GitHub repository and archived on Zenodo for long-term access. Aggregated benchmark results are made accessible through a custom web interface.

Context

Recent works have offered methodological frameworks for integrating LLMs into humanities and social science research workflows (Karjus 2025; Abdurahman et al. 2025) or surveyed domain applications without addressing evaluation infrastructure (Simons et al. 2025). Most current benchmarks either prioritise scale (Kang et al. 2025; Hauser et al. 2024) or focus on specific domains (e.g. Ziems et al. 2024; Kraus et al. 2025; Spinaci et al. 2025; Greif et al. 2025). The RISE benchmark offers small, expert-verified datasets from real consulting projects featuring materially challenging sources. Ground truth is treated not as absolute correctness but as operationalised scholarly interpretation, making evaluative choices explicit and contestable. The framework is therefore not intended to produce generic leaderboard-style rankings or statistically representative performance estimates for the humanities as a whole, but to enable evidence-based comparison of model behaviour on concrete research tasks and changes in cost and performance over time.

2 Method

Steps

The creation of a new benchmark dataset follows a standardized workflow. First, representative source materials are selected from real humanities research projects. Next, a task definition (e.g. transcription, segmentation, or metadata extraction) is formulated specifying the expected behaviour of the model.

Prompts and sampling parameters are then developed in consultation with domain experts to guide the model toward producing structured outputs that conform to a predefined schema. For example, a prompt may instruct the model to extract sender, recipient, date, and place from a historical business letter and to return the result as a structured JSON object.

For each benchmark dataset, an explicit output schema is defined which is typically implemented as a Pydantic dataclass. This schema constrains model outputs to a fixed structure and enables automated validation and scoring. Ground truths are subsequently created or verified by domain experts and represent the expected output for each source item.

Finally, the task-specific scoring logic is implemented. Depending on the nature of the task, this may involve fuzzy string matching, F1 scores, or character error rates. All components (source materials, prompts, schemas, ground truths, and scoring functions) are stored together as part of the benchmark dataset.

Configured Tests and Execution

Running a benchmark involves defining one or more configured tests. A configured test pairs a benchmark dataset with a specific provider, model, and runtime parameters such as temperature or maximum token limits. Once defined and assigned a unique ID, a test can be executed individually or as part of a batch.

During execution, the benchmarking tool submits the benchmark dataset to the selected model using the defined prompt and schema, captures the model output, validates it against the schema, and applies the dataset-specific scoring function. Outputs, scores, and metadata (including model version identifiers and execution timestamps) are stored persistently, enabling full reproducibility of each test run.

Configured tests are versioned and assigned unique identifiers (e.g. T0234), allowing previous results to be re-executed or compared across time as new models become available.

Sampling Strategy

Our selection strategy at the level of the benchmark datasets is grounded in concrete humanities research projects supported by RISE. The source materials were selected to reflect the typical challenges we encountered in practice, including variable document layouts, multilingual content, historical typefaces, and mixed handwritten and printed text.

Within each benchmark dataset, sampling was performed by domain experts to ensure representative coverage of both common cases and relevant edge cases. The benchmark datasets are intentionally small, reflecting the high cost of expert annotation and the diagnostic aim of the benchmark rather than statistical representativeness.

Quality Control and Ground Truth Validation

All ground truths were either manually produced or carefully verified by domain experts. Validation focused on correctness with respect to the defined task and output schema rather than on stylistic normalisation beyond what is required for scoring.

As a consequence of this, each benchmark dataset implements its own scoring routine, tailored to the task at hand. Performance scores are directly comparable across models within a benchmark dataset. However, cross-benchmark comparisons are possible only after normalisation and should therefore be interpreted with caution.2

Aggregated Results

The test configurations are stored in a dedicated table containing the full execution parameters for each configured test. At the time of writing, the benchmark tool contains 633 recorded test runs collected over multiple benchmark datasets for 487 test configurations.

For each benchmark dataset and model, the presentation interface reports the best achieved score across all executed configurations. These aggregated results permit a comparison of model behaviour within a task and provide a longitudinal view of performance as the capabilities of models evolve over time as shown in Figure 3.

johd-12-481-g3.png
Figure 3

Best performance per model and benchmark dataset, taking the maximum score over all recorded configurations per model and dataset. Null values (N/A) indicate tests that have not yet been run or can no longer be run because the corresponding model has been deprecated.

3 Dataset Description

Name

RISE Humanities Data Benchmark

Objects

8 benchmark datasets (detailed in Table 1) and 2 test datasets.3

Table 1

Overview of benchmark datasets, tasks, and scoring schemes.

BIBLIOGRAPHIC DATA: METADATA EXTRACTION FROM BIBLIOGRAPHIES
Data5 images, 67 elements, JPG, approx. 1743 × 2888 px, 350 KB each.
ContentPages from Bibliography of Works in the Philosophy of History, 1945–1957.
Sourcehttp://www.jstor.org/stable/2504495
LanguageEnglish
Content TypePrinted text. Directory-like.
TaskTranscription; metadata extraction.
Expected outputJSON list of structured items; output defined via Pydantic class.
ScoringFuzzy matching on each key in the JSON items.
Test runs77 runs.
Best result71.43%, OpenAI, gpt-4o.
DesignThe dataset was created as a proof of concept in the context of a master’s thesis and targets the extraction of structured bibliographic information from semi-structured, directory-like printed text. The benchmark focuses on structural inference rather than linguistic prompting and omits an explicit prompt in favor of a predefined Pydantic output schema.
Further InfoDataset Description, Test Results, GitHub (last accessed 2026-01-12)
BLACKLIST CARDS: NER EXTRACTION FROM INDEX CARDS
Data33 images, JPG, approx. 1788 × 1305 px, 590 KB each.
ContentSwiss federal index cards for companies on a British black list for trade (1940s).
Sourcehttps://www.recherche.bar.admin.ch/recherche/#/de/archiv/einheit/31240458 (last accessed 2026-01-12)
LanguageMostly German, some French
Content TypeTyped and handwritten text. Stamped dates.
TaskTranscription; metadata extraction.
Expected outputJSON object with predefined keys; output defined via Pydantic class.
ScoringFuzzy matching on each key in the JSON object.
Test runs38 runs.
Best result95.65%, OpenAI, gpt-4.1-mini.
DesignThe dataset was created in the context of an ongoing dissertation and reflects a historically common archival document type. The cards were manually selected and curated to capture the combination of typed and handwritten entries typical for mid-20th-century administrative records, which are not readily available for bulk download.
Further InfoDataset Description, Test Results, GitHub (last accessed 2026-01-12)
BOOK ADVERT XMLS: DATA CORRECTION OF LLM-GENERATED XML FILES
Data50 JSON files containing XML structures.
ContentFaulty LLM-generated XML structures from historical sources
SourceData from the digitized Basler Avisblatt, namely book advertisements, extracted as XML.
LanguageEarly Modern German
Content TypeDigital data. Plain text in JSON files.
TaskCorrect XML structure (add closing tags, remove faulty tags).
Expected outputJSON object with correct XML as string; output defined via Pydantic class.
ScoringFuzzy matching on whole XML string after removing white spaces and setting all characters to lowercase.
Test runs40 runs.
Best result97.47%, Anthropic, Claude Sonnet 4.5.
DesignThis dataset was created to introduce text-only benchmarks into the framework as part of an ongoing research project. It targets a reasoning task situated in a workflow, where large language models are required to detect and correct structural errors in previously generated XML, reflecting common data transformation and processing challenges in digital humanities pipelines.
Further InfoDataset Description, Test Results, GitHub (last accessed 2026-01-12)
BUSINESS LETTERS: NER EXTRACTION FROM CORRESPONDENCE
Data57 letters, 98 images, JPG, approx. 2479 × 3508 px, 600 KB each.
ContentCollection of letters from the Basler Rheinschifffahrt-Aktiengesellschaft.
Sourcehttp://dx.doi.org/10.7891/e-manuscripta-54917
LanguageGerman
Content TypeTyped, printed and handwritten text. Signatures.
TaskMetadata extraction, person matching, signature recognition.
Expected outputJSON object with predefined keys; output defined via Pydantic class.
ScoringF1 score.
Test runs212 runs.
Best result77.00%, OpenAI, gpt-5.
DesignThis dataset was designed as a multi-stage benchmark reflecting a common humanities scenario centered on historical correspondence. The benchmark evaluates sequential tasks including metadata extraction, named-entity recognition, and signature-based person identification, addressing both linguistic variation and document-level decision making.
Further InfoDataset Description, Test Results, GitHub (last accessed 2026-01-12)
COMPANY LISTS: COMPANY DATA EXTRACTION FROM LIST-LIKE MATERIALS
Data15 images, JPG, approx. 1868 × 2931 px, 360 KB each.
ContentPages from Trade index: classified handbook of the members of the British Chamber of Commerce for Switzerland
Sourcehttps://doi.org/10.7891/e-manuscripta-174832
LanguageEnglish and German
Content TypePrinted lists with strongly varying layout.
TaskMetadata extraction with varying layouts.
Expected outputJSON list of structured items; output defined via Pydantic class.
ScoringF1 score.
Test runs76 runs.
Best result58.40% OpenAI, gpt-5.
DesignThis dataset was developed in the context of an ongoing dissertation project focused on historical company networks. It evaluates metadata extraction from multilingual company lists with strongly varying layouts, testing a model’s ability to infer a consistent schema despite substantial variation in formatting and presentation.
Further InfoDataset Description, Test Results, GitHub (last accessed 2026-01-12)
FRAKTUR ADVERTS: PAGE SEGMENTATION AND FRAKTUR TEXT TRANSCRIPTION
Data5 images, JPG, approx. 5000 × 8267 px, 10 MB each.
ContentPages from the Basler Avisblatt, an early advertisement newspaper published in Basel, Switzerland.
Sourcehttps://avisblatt.dg-basel.hasdai.org (last accessed 2026-01-12)
LanguageEarly modern German
Content TypePrinted text in Fraktur typeface.
TaskF1 score & Character Error Rate (CER).
Expected outputJSON list of structured items; output defined via Pydantic class.
ScoringSegmentation of adverts and text recognition
Test runs91 runs.
Best result95.70%, Google, gemini-2.0-pro-exp-02-05.
DesignThis dataset was developed as part of ongoing research at the University of Basel and targets historical advertisements printed in Fraktur typeface. It evaluates advert segmentation and text recognition under conditions of dense layout and strong typographic variability, while demonstrating that aggressive image downscaling has a limited effect on model performance.
Further InfoDataset Description, Test Results, GitHub (last accessed 2026-01-12)
LIBRARY CARDS: METADATA EXTRACTION FROM MULTILINGUAL SOURCES
Data263 images, JPG, approx. 976 × 579 px, 10 KB each.
ContentLibrary cards with dissertation thesis information.
Sourcehttps://ub.unibas.ch/cmsdata/spezialkataloge/ipac/searchform.php?KatalogID=ak2 (last accessed 2026-01-12)
LanguageGerman, French, English, Latin, Greek, and other European languages
Content TypeTyped and handwritten multilingual text.
TaskMetadata extraction.
Expected outputJSON object with predefined keys; output defined via Pydantic class.
ScoringF1 score.
Test runs61 runs.
Best result89.51%, OpenAI, gpt-5.
DesignThis dataset was created as a feasibility study to assess whether unique dissertation records can be identified and structured within a large-scale historical card catalog comprising approximately 700,000 entries. A random sample was selected to capture multilingual content, inconsistent layouts, and a mixture of handwritten, typed, and printed text, reflecting realistic challenges in large archival metadata extraction tasks.
Further InfoDataset Description, Test Results, GitHub (last accessed 2026-01-12)
MEDIEVAL MANUSCRIPTS: HANDWRITTEN TEXT RECOGNITION
Data12 images, JPG, approx. 1872 × 2808 px, 1.1 MB each.
ContentPages from Pilgerreisen nach Jerusalem 1440 und 1453.
Sourcehttps://www.e-codices.ch/de/description/ubb/H-V-0015/HAN (last accessed 2026-01-12)
LanguageMedieval German
Content TypeHandwritten text.
TaskSegmentation & Text recognition.
Expected outputJSON object with predefined keys; output defined via Pydantic class.
ScoringFuzzy & Character Error Rate (CER).
Test runs38 runs.
Best result76.90%, OpenAI, gpt-4.1-mini.
DesignThis dataset was prepared in the context of a university seminar and focuses on late medieval manuscript material. It targets segmentation and text recognition of handwritten sources that are difficult to read due to script variability and the presence of marginalia, reflecting common challenges in medieval manuscript analysis.
Further InfoDataset Description, Test Results, GitHub (last accessed 2026-01-12)

Formats

JSON, CSV, JPEG, ASCII

Creation

August 2025 – ongoing

Dataset Creators

The benchmark dataset contributors are listed in Table 2.

Table 2

Contributors to the project. Contributor roles follow the CRediT taxonomy.

NAMEPRIMARY ROLES
Anthea AlbertoData Curation; Validation
Sven BurkhardtValidation
Eric DeckerData Curation; Validation
Pema FrickData Curation; Validation; Formal Analysis; Software
Maximilian HindermannConceptualization; Methodology; Software; Formal Analysis
Lea Katharina KasperData Curation; Validation; Formal Analysis
José Luis Losada PalenzuelaData Curation; Validation
Sorin MartiConceptualization; Software; Formal Analysis; Visualization
Gabriel MüllerData Curation; Validation
Ina SerifData Curation; Validation; Formal Analysis
Elena SpadiniData Curation; Validation

Languages

English; medieval and early modern German; multiple European languages

License

GPL-3.0 (benchmark tool), CC-BY-4.0 (benchmark datasets)4

Publication Dates

v0.1 – 2025-08-25; current version v0.4.0 – 2025-12-09

4 Reuse Potential

The RISE Humanities Data Benchmark offers multiple levels of reuse. Beyond the aggregated leaderboard results, the repository provides ready-to-use benchmark datasets that can be applied to a wide range of evaluations by redefining prompts, sampling parameters, output schemas, or scoring procedures without modifying the underlying data. This enables systematic comparisons of models, task formulations, and evaluation metrics on identical humanities sources. In addition, the benchmark framework itself can be reused as an evaluation infrastructure to create new datasets following a standardised structure. Newly created benchmarks can be evaluated across multiple models and inference providers using the same execution and scoring pipeline, supporting reproducibility, comparability, and community-driven extension.

As a next architectural refinement, the benchmark execution logic and visualization pipeline will be fully separated from the benchmark dataset definitions into a dedicated system layer, allowing benchmark versions to remain stable, citable, and reproducible while the evaluation infrastructure continues to evolve independently. Within this reconfiguration, standardised BenchmarkCards (see Sokol et al. 2025) will be introduced as an additional documentation layer to further enhance comparability and interpretability across different benchmarks and evaluation contexts, thereby supporting informed benchmark selection and cross-system evaluation. In parallel, the regular extension of the benchmark corpus will proceed unchanged, ensuring that methodological improvements do not constrain the addition of new datasets and tasks. In addition to this forward-looking expansion, the continuously stored results also represent a valuable resource in their own right: they allow researchers to trace the development and performance of LLMs on specific humanities tasks over time, while the front-end enables the export of search results for further analysis, comparison, and the integration into external research workflows.

Notes

[1] RISE (Research & Infrastructure Support) supports humanities and social science researchers at the University of Basel in designing, implementing, and sustaining computer-based research through expert guidance on digital methods, data management, analysis, and open, FAIR data dissemination.

[2] All benchmark results are normalised to a common 0–100% scale to facilitate presentation, search, and visualization across datasets. This supports exploratory analysis but does not establish direct comparability between different benchmark tasks or evaluation criteria.

[3] Three additional benchmark datasets are currently in preparation: data extraction from 1950s handwritten letters, Arabic–German bibliographic data from the Bibliotheca Afghanica (see https://ub.unibas.ch/de/sammlungen/bibliotheca-afghanica/, last accessed 2026-01-12), and reasoning-focused tasks for the project Das Judenthum in der Musik (see https://data.snf.ch/grants/grant/212806, last accessed 2026-01-12). These additions will broaden the suite beyond its current emphasis on image-based context.

[4] The repository currently combines the benchmarking framework and the benchmark datasets. A planned reorganization will separate these components into distinct repositories, enabling independent licensing of framework (GPL-3.0) and datasets (CC-BY-4.0).

[5] https://avisblatt.dg-basel.hasdai.org (last accessed 2026-01-12).

Acknowledgements

The RISE Humanities Data Benchmark uses data from Printed Markets – The Basler Avisblatt,5 the forthcoming dissertation project of Lea Katharina Kasper, and curated collections of the University Library Basel.

We thank the contributors to the RISE Humanities Data Benchmark, Anthea Alberto, Sven Burkhardt, Eric Decker, Pema Frick, José Luis Losada Palenzuela, Gabriel Müller, Ina Serif, and Elena Spadini, for their contributions to dataset creation, annotation, software development, and evaluation design.

Competing Interests

The authors have no competing interests to declare.

Author Contributions

Maximilian Hindermann: Conceptualization; Methodology; Software; Data curation; Writing–original draft; Writing–review & editing; Supervision.

Sorin Marti: Conceptualization; Methodology; Software; Data curation; Visualization; Writing–review & editing.

Lea Katharina Kasper: Conceptualization; Data curation; Writing–original draft.

Arno Bosse: Writing–review & editing.

DOI: https://doi.org/10.5334/johd.481 | Journal eISSN: 2059-481X
Language: English
Submitted on: Nov 15, 2025
|
Accepted on: Jan 7, 2026
|
Published on: Feb 4, 2026
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2026 Maximilian Hindermann, Sorin Marti, Lea Katharina Kaspera, Arno Bossea, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.