Have a personal or library account? Click to login
Benchmarking as Source Criticism: From Recognition to Reasoning in LLM Assessment Cover

Benchmarking as Source Criticism: From Recognition to Reasoning in LLM Assessment

Open Access
|Mar 2026

Figures & Tables

Table 1

Framework for Evaluating LLM Benchmarks for Historical Knowledge.

FRAMEWORK DIMENSIONKEY QUESTIONWHAT THIS DIMENSION TESTSASSESSMENT APPROACH
Contamination ResistanceDoes performance reflect model memorization or historical reasoning?Whether models encountered evaluation questions during training.Compare performance on contaminated vs. decontaminated datasets.
Content DiversityDoes model competence generalize beyond well-represented domains?How performance varies across geographic, temporal, and linguistic contexts.Test across global knowledge domains, not just Western and well-represented domains.
Format DiversityCan pattern recognition transfer to reasoning?If LLMs can leverage factual recall to analytical examination.Compare multiple-choice identification (recognition) vs. open-ended synthesis (reasoning).
Epistemological SophisticationCan models distinguish how we know from what we know?Whether models understand how historical knowledge is constructed.Examine capacities for source criticism, evidential weighing, and inferential reasoning.
Figure 1

Accuracy of Selected LLMs on History Questions in the MMLU Benchmark: HELM Subject Leaderboards (High School History), January 2025. GPT-3 data from Hendrycks et al. (2021); all other models from Stanford CRFM (2025).

Figure 2

Comparison of LLM Performance on MMLU Variants: Overall Accuracy vs. History Questions. MMLU-Pro data (left) from Wang et al. (2024); MMLU-CF data (right) from Zhao et al. (2024).

Figure 3

Balanced Accuracy of LLMs on HiST-LLM Global History Benchmark (4-Choice Setting). Data from Hauser et al. (2024).

Figure 4

Balanced Accuracy of Selected LLMs by Region: HiST-LLM Global History Benchmark (4-Choice). Data from Hauser et al. (2024).

Figure 5

Accuracy of LLMs on FoundaBench History Subjects (CircularEval): Chinese Models Highlighted. Data from Li et al. (2024).

Figure 6

Reliability of LLMs on HiBenchLLM French Regional History Benchmark (Share of Questions With 100% Correct Variants). Data from Chartier et al. (2025).

Figure 7

Reliability by Question Type on HiBenchLLM French Regional History Benchmark: Top LLMs and Top-5 Average. Data from Chartier et al. (2025).

Figure 8

Error Categories in ChatGPT Responses on Italian Fascism. Data from De Ninno and Lacriola (2025).

Table 2

Four-Dimensional Framework for Historical LLM Assessment with Design Implications.

DIMENSIONDEFINITIONKEY FINDINGSDESIGN IMPLICATIONS
Contamination ResistanceExtent to which benchmark isolates genuine capabilities from memorization90%→68% accuracy after decontaminationCurator-reviewed questions; temporal holdout sets; avoid widely distributed datasets
Content DiversityGeographic, temporal, linguistic breadth of evaluation datasetHiST-LLM Latin America anomaly (49%) reflects Seshat’s pre-Columbian focus; FoundaBench shows training data composition > computational scaleInclude underrepresented regions/periods; document source coverage; test across languages
Format DiversityRange of response types (MC, open-ended, structured) and complexity levelsHiBenchLLM: 38% overall → 16% interpretive synthesis; MC provides scaffolding that inflates scoresMultiple formats per domain; emphasize generation over selection; include multi-step reasoning
Epistemological SophisticationAssessment of source criticism, historiographical judgment, evidential reasoningItalian Fascism: 56.7% error rate; models cite but cannot apply historiographic arguments; privilege older English language scholarship and outdated framingsTest interpretation vs. fact; require source evaluation; probe historiographical awareness
DOI: https://doi.org/10.5334/johd.489 | Journal eISSN: 2059-481X
Language: English
Submitted on: Nov 24, 2025
|
Accepted on: Jan 29, 2026
|
Published on: Mar 13, 2026
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2026 Daniel Hutchinson, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.