From Experiments to Epistemic Practice: The RISE Humanities Data Benchmark

Maximilian Hindermann; Lea Katharina Kasper; Sorin Marti; Arno Bosse

doi:10.5334/johd.470

References

Abdurahman, S., Salkhordeh Ziabari, A., Moore, A. K., Bartels, D. M., & Dehghani, M. (2025). A primer for evaluating large language models in social science research. Advances in Methods and Practices in Psychological Science, 8(2). 10.1177/25152459251325174
Open DOI Search in Google Scholar Back to article
Bamman, D., Chang, K. K., Lucy, L., & Zhou, N. (2024, October). On classification with large language models in cultural analytics. arXiv. 10.1038/s41597-022-01710-x
Open DOI Search in Google Scholar Back to article
Barker, M., Chue Hong, N. P., Katz, D. S., Lamprecht, A.-L., Martinez-Ortiz, C., Psomopoulos, F., … Honeyman, T. (2022, October). Introducing the FAIR Principles for research software. Scientific Data, 9(1), 622. 10.1038/s41597-022-01710-x
Open DOI Search in Google Scholar Back to article
Dobson, J. (2020, June). Interpretable Outputs: Criteria for Machine Learning in the Humanities. Digital Humanities Quarterly, 15(2).
Search in Google Scholar Back to article
Hamilton, S., Wilkens, M., & Piper, A. (2025, October). NarraBench: A Comprehensive Framework for Narrative Benchmarking. arXiv. 10.48550/arXiv.2510.09869
Open DOI Search in Google Scholar Back to article
Hauser, J., Kondor, D., Reddish, J., Benam, M., Cioni, E., Villa, F., … del Rio-Chanona, R. M. (2024, November). Large language models’ expert-level global history knowledge benchmark (hiST-LLM). In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track. Retrieved from https://openreview.net/forum?id=xlKeMuyoZ5#discussion
Search in Google Scholar Back to article
Hindermann, M. (2024, August). FAIR use of GPT-generated data in SSH research: A practical guide.
Search in Google Scholar Back to article
Hindermann, M., Marti, S., Alberto, A., Burkhardt, S., Decker, E., Frick, P., … Spadini, E. (2026, January). RISE-UNIBAS/humanities_data_benchmark. Zenodo. 10.5281/zenodo.18293269
Open DOI Search in Google Scholar Back to article
Hindermann, M., Marti, S., Kasper, L., & Bosse, A. (2026). The RISE Humanities Data Benchmark: A framework for evaluating large language models for humanities tasks. Journal of Open Humanities Data, 12(1), 24. 10.5334/johd.481
Open DOI Search in Google Scholar Back to article
Kang, Z., Gong, J., Yan, J., Xia, W., Wang, Y., Wang, Z., … Li, X. (2025, June). HSSBench: Benchmarking humanities and social sciences ability for multimodal large language models. arXiv. 10.48550/arXiv.2506.03922
Open DOI Search in Google Scholar Back to article
Karjus, A. (2025, February). Machine-assisted quantitizing designs: Augmenting humanities and social sciences with artificial intelligence. Humanities and Social Sciences Communications, 12(1), 277. 10.1057/s41599-025-04503-w
Open DOI Search in Google Scholar Back to article
Khadangi, A., Sartipi, A., Tchappi, I., & Fridgen, G. (2025, February). CognArtive: Large language models for automating art analysis and decoding aesthetic elements. arXiv. 10.48550/arXiv.2502.04353
Open DOI Search in Google Scholar Back to article
Marti, S. (2024, April). NDR Core. Zenodo. 10.5281/zenodo.10969133
Open DOI Search in Google Scholar Back to article
Marti, S. (2025, October). generic-llm-api-client: A unified, provider-agnostic Python client for multiple LLM APIs. Retrieved 2025-11-14, from https://github.com/RISE-UNIBAS/generic_llm_api_client
Search in Google Scholar Back to article
Simons, A., Zichert, M., & Wüthrich, A. (2025, June). Large language models for history, philosophy, and sociology of science: Interpretive uses, methodological challenges, and critical perspectives. arXiv. 10.48550/arXiv.2506.12242
Open DOI Search in Google Scholar Back to article
Sokol, A., Daly, E., Hind, M., Piorkowski, D., Zhang, X., Moniz, N., & Chawla, N. (2024). BenchmarkCards: Standardized documentation for large language model benchmarks. arXiv. 10.48550/ARXIV.2410.12974
Open DOI Search in Google Scholar Back to article
Spinaci, G., Klic, L., & Colavizza, G. (2025, September). Benchmarking vision–language and multimodal large language models in zero-shot and few-shot scenarios: A study on christian iconography. arXiv. 10.63744/oxWtm5MhhwBH
Open DOI Search in Google Scholar Back to article
Treloar, A., Groenewegen, D., & Harboe-Ree, C. (2007, September). The Data Curation Continuum: Managing Data Objects in Institutional Repositories. D-Lib Magazine, 13(9/10). 10.1045/september2007-treloar
Open DOI Search in Google Scholar Back to article
Ziems, C., Held, W., Shaikh, O., Chen, J., Zhang, Z., & Yang, D. (2024, March). Can large language models transform computational social science? Computational Linguistics, 50(1), 237–291. 10.1162/coli_a_00502
Open DOI Search in Google Scholar Back to article

From Experiments to Epistemic Practice: The RISE Humanities Data Benchmark

References

Paradigm

My account