Benchmarking as Source Criticism: From Recognition to Reasoning in LLM Assessment

Figures & Tables

Framework for Evaluating LLM Benchmarks for Historical Knowledge.

FRAMEWORK DIMENSION	KEY QUESTION	WHAT THIS DIMENSION TESTS	ASSESSMENT APPROACH
Contamination Resistance	Does performance reflect model memorization or historical reasoning?	Whether models encountered evaluation questions during training.	Compare performance on contaminated vs. decontaminated datasets.
Content Diversity	Does model competence generalize beyond well-represented domains?	How performance varies across geographic, temporal, and linguistic contexts.	Test across global knowledge domains, not just Western and well-represented domains.
Format Diversity	Can pattern recognition transfer to reasoning?	If LLMs can leverage factual recall to analytical examination.	Compare multiple-choice identification (recognition) vs. open-ended synthesis (reasoning).
Epistemological Sophistication	Can models distinguish how we know from what we know?	Whether models understand how historical knowledge is constructed.	Examine capacities for source criticism, evidential weighing, and inferential reasoning.

Four-Dimensional Framework for Historical LLM Assessment with Design Implications.

DIMENSION	DEFINITION	KEY FINDINGS	DESIGN IMPLICATIONS
Contamination Resistance	Extent to which benchmark isolates genuine capabilities from memorization	90%→68% accuracy after decontamination	Curator-reviewed questions; temporal holdout sets; avoid widely distributed datasets
Content Diversity	Geographic, temporal, linguistic breadth of evaluation dataset	HiST-LLM Latin America anomaly (49%) reflects Seshat’s pre-Columbian focus; FoundaBench shows training data composition > computational scale	Include underrepresented regions/periods; document source coverage; test across languages
Format Diversity	Range of response types (MC, open-ended, structured) and complexity levels	HiBenchLLM: 38% overall → 16% interpretive synthesis; MC provides scaffolding that inflates scores	Multiple formats per domain; emphasize generation over selection; include multi-step reasoning
Epistemological Sophistication	Assessment of source criticism, historiographical judgment, evidential reasoning	Italian Fascism: 56.7% error rate; models cite but cannot apply historiographic arguments; privilege older English language scholarship and outdated framings	Test interpretation vs. fact; require source evaluation; probe historiographical awareness