Abstract
The RISE Humanities Benchmark suite emerged from concrete research support practices undertaken by the Research and Infrastructure Support (RISE) unit for humanities and social science researchers at the University of Basel. At RISE, our digital-humanities consulting and infrastructure work frequently involves evaluating computational methods on historical and multilingual text and image data. Over time, these evaluations produced a body of tacit, methodological insights to a series of recurring questions. For example, which large language models (LLMs) handle historical handwriting reliably, which configurations balance cost and accuracy, and which types of visual layouts lead to systematic failures? The RISE Humanities Benchmark suite provides a means to transform this accumulated experience into a structured framework that can be used to reference, verify, and extend such observations. More broadly, the goal of the suite is to enable the wider humanities community to perform informed assessments of LLMs against their own data without specialized technical expertise. By publishing procedures, datasets, and metrics in a consistent, open format, the suite aims to lower the threshold for evidence-based decision-making in computational humanities projects by making the grounds for such decisions explicit and contestable.
