From Experiments to Epistemic Practice: The RISE Humanities Data Benchmark

Maximilian Hindermann; Lea Katharina Kasper; Sorin Marti; Arno Bosse

doi:10.5334/johd.470

Abstract

The RISE Humanities Benchmark suite emerged from concrete research support practices undertaken by the Research and Infrastructure Support (RISE) unit for humanities and social science researchers at the University of Basel. At RISE, our digital-humanities consulting and infrastructure work frequently involves evaluating computational methods on historical and multilingual text and image data. Over time, these evaluations produced a body of tacit, methodological insights to a series of recurring questions. For example, which large language models (LLMs) handle historical handwriting reliably, which configurations balance cost and accuracy, and which types of visual layouts lead to systematic failures? The RISE Humanities Benchmark suite provides a means to transform this accumulated experience into a structured framework that can be used to reference, verify, and extend such observations. More broadly, the goal of the suite is to enable the wider humanities community to perform informed assessments of LLMs against their own data without specialized technical expertise. By publishing procedures, datasets, and metrics in a consistent, open format, the suite aims to lower the threshold for evidence-based decision-making in computational humanities projects by making the grounds for such decisions explicit and contestable.

References

Abdurahman, S., Salkhordeh Ziabari, A., Moore, A. K., Bartels, D. M., & Dehghani, M. (2025). A primer for evaluating large language models in social science research. Advances in Methods and Practices in Psychological Science, 8(2). 10.1177/25152459251325174
Open DOI Search in Google Scholar Back to article
Bamman, D., Chang, K. K., Lucy, L., & Zhou, N. (2024, October). On classification with large language models in cultural analytics. arXiv. 10.1038/s41597-022-01710-x
Open DOI Search in Google Scholar Back to article
Barker, M., Chue Hong, N. P., Katz, D. S., Lamprecht, A.-L., Martinez-Ortiz, C., Psomopoulos, F., … Honeyman, T. (2022, October). Introducing the FAIR Principles for research software. Scientific Data, 9(1), 622. 10.1038/s41597-022-01710-x
Open DOI Search in Google Scholar Back to article
Dobson, J. (2020, June). Interpretable Outputs: Criteria for Machine Learning in the Humanities. Digital Humanities Quarterly, 15(2).
Search in Google Scholar Back to article
Hamilton, S., Wilkens, M., & Piper, A. (2025, October). NarraBench: A Comprehensive Framework for Narrative Benchmarking. arXiv. 10.48550/arXiv.2510.09869
Open DOI Search in Google Scholar Back to article
Hauser, J., Kondor, D., Reddish, J., Benam, M., Cioni, E., Villa, F., … del Rio-Chanona, R. M. (2024, November). Large language models’ expert-level global history knowledge benchmark (hiST-LLM). In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track. Retrieved from https://openreview.net/forum?id=xlKeMuyoZ5#discussion
Search in Google Scholar Back to article
Hindermann, M. (2024, August). FAIR use of GPT-generated data in SSH research: A practical guide.
Search in Google Scholar Back to article
Hindermann, M., Marti, S., Alberto, A., Burkhardt, S., Decker, E., Frick, P., … Spadini, E. (2026, January). RISE-UNIBAS/humanities_data_benchmark. Zenodo. 10.5281/zenodo.18293269
Open DOI Search in Google Scholar Back to article
Hindermann, M., Marti, S., Kasper, L., & Bosse, A. (2026). The RISE Humanities Data Benchmark: A framework for evaluating large language models for humanities tasks. Journal of Open Humanities Data, 12(1), 24. 10.5334/johd.481
Open DOI Search in Google Scholar Back to article
Kang, Z., Gong, J., Yan, J., Xia, W., Wang, Y., Wang, Z., … Li, X. (2025, June). HSSBench: Benchmarking humanities and social sciences ability for multimodal large language models. arXiv. 10.48550/arXiv.2506.03922
Open DOI Search in Google Scholar Back to article
Karjus, A. (2025, February). Machine-assisted quantitizing designs: Augmenting humanities and social sciences with artificial intelligence. Humanities and Social Sciences Communications, 12(1), 277. 10.1057/s41599-025-04503-w
Open DOI Search in Google Scholar Back to article
Khadangi, A., Sartipi, A., Tchappi, I., & Fridgen, G. (2025, February). CognArtive: Large language models for automating art analysis and decoding aesthetic elements. arXiv. 10.48550/arXiv.2502.04353
Open DOI Search in Google Scholar Back to article
Marti, S. (2024, April). NDR Core. Zenodo. 10.5281/zenodo.10969133
Open DOI Search in Google Scholar Back to article
Marti, S. (2025, October). generic-llm-api-client: A unified, provider-agnostic Python client for multiple LLM APIs. Retrieved 2025-11-14, from https://github.com/RISE-UNIBAS/generic_llm_api_client
Search in Google Scholar Back to article
Simons, A., Zichert, M., & Wüthrich, A. (2025, June). Large language models for history, philosophy, and sociology of science: Interpretive uses, methodological challenges, and critical perspectives. arXiv. 10.48550/arXiv.2506.12242
Open DOI Search in Google Scholar Back to article
Sokol, A., Daly, E., Hind, M., Piorkowski, D., Zhang, X., Moniz, N., & Chawla, N. (2024). BenchmarkCards: Standardized documentation for large language model benchmarks. arXiv. 10.48550/ARXIV.2410.12974
Open DOI Search in Google Scholar Back to article
Spinaci, G., Klic, L., & Colavizza, G. (2025, September). Benchmarking vision–language and multimodal large language models in zero-shot and few-shot scenarios: A study on christian iconography. arXiv. 10.63744/oxWtm5MhhwBH
Open DOI Search in Google Scholar Back to article
Treloar, A., Groenewegen, D., & Harboe-Ree, C. (2007, September). The Data Curation Continuum: Managing Data Objects in Institutional Repositories. D-Lib Magazine, 13(9/10). 10.1045/september2007-treloar
Open DOI Search in Google Scholar Back to article
Ziems, C., Held, W., Shaikh, O., Chen, J., Zhang, Z., & Yang, D. (2024, March). Can large language models transform computational social science? Computational Linguistics, 50(1), 237–291. 10.1162/coli_a_00502
Open DOI Search in Google Scholar Back to article

From Experiments to Epistemic Practice: The RISE Humanities Data Benchmark

Abstract

Paradigm

My account