Have a personal or library account? Click to login
Benchmarking as Source Criticism: From Recognition to Reasoning in LLM Assessment Cover

Benchmarking as Source Criticism: From Recognition to Reasoning in LLM Assessment

Open Access
|Mar 2026

Abstract

Large language models (LLMs) are increasingly utilized in historical research and pedagogy, yet assessments of their capabilities tell conflicting stories. Leading benchmarks suggest LLMs possess expert-level performance in historical domains, while other evaluations reveal dramatic failures on core forms of historical reasoning. This paper examines five benchmarks assessing the historical competencies of LLMs to identify methodological approaches for devising assessments that reveal rather than obscure model limitations. Comparative analysis shows a collapse in LLM performance as assessments move from contaminated to decontaminated datasets, from Western to global knowledge domains, and from multiple-choice questions to open-ended responses. These patterns expose fundamental gaps between pattern recognition, where models excel, and historical reasoning, which remains beyond the current capabilities of “frontier” models. In making these claims, the paper introduces four dimensions of assessment for developing future benchmarks. As LLMs continue to become integrated into our digital lives, rigorous benchmarking becomes essential for matching their actual strengths to appropriate tasks while maintaining human judgment for the interpretative tasks at the heart of humanistic scholarship.

DOI: https://doi.org/10.5334/johd.489 | Journal eISSN: 2059-481X
Language: English
Submitted on: Nov 24, 2025
|
Accepted on: Jan 29, 2026
|
Published on: Mar 13, 2026
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2026 Daniel Hutchinson, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.