Abstract
Large language models (LLMs) are increasingly utilized in historical research and pedagogy, yet assessments of their capabilities tell conflicting stories. Leading benchmarks suggest LLMs possess expert-level performance in historical domains, while other evaluations reveal dramatic failures on core forms of historical reasoning. This paper examines five benchmarks assessing the historical competencies of LLMs to identify methodological approaches for devising assessments that reveal rather than obscure model limitations. Comparative analysis shows a collapse in LLM performance as assessments move from contaminated to decontaminated datasets, from Western to global knowledge domains, and from multiple-choice questions to open-ended responses. These patterns expose fundamental gaps between pattern recognition, where models excel, and historical reasoning, which remains beyond the current capabilities of “frontier” models. In making these claims, the paper introduces four dimensions of assessment for developing future benchmarks. As LLMs continue to become integrated into our digital lives, rigorous benchmarking becomes essential for matching their actual strengths to appropriate tasks while maintaining human judgment for the interpretative tasks at the heart of humanistic scholarship.
