Table 1
Framework for Evaluating LLM Benchmarks for Historical Knowledge.
| FRAMEWORK DIMENSION | KEY QUESTION | WHAT THIS DIMENSION TESTS | ASSESSMENT APPROACH |
|---|---|---|---|
| Contamination Resistance | Does performance reflect model memorization or historical reasoning? | Whether models encountered evaluation questions during training. | Compare performance on contaminated vs. decontaminated datasets. |
| Content Diversity | Does model competence generalize beyond well-represented domains? | How performance varies across geographic, temporal, and linguistic contexts. | Test across global knowledge domains, not just Western and well-represented domains. |
| Format Diversity | Can pattern recognition transfer to reasoning? | If LLMs can leverage factual recall to analytical examination. | Compare multiple-choice identification (recognition) vs. open-ended synthesis (reasoning). |
| Epistemological Sophistication | Can models distinguish how we know from what we know? | Whether models understand how historical knowledge is constructed. | Examine capacities for source criticism, evidential weighing, and inferential reasoning. |

Figure 1
Accuracy of Selected LLMs on History Questions in the MMLU Benchmark: HELM Subject Leaderboards (High School History), January 2025. GPT-3 data from Hendrycks et al. (2021); all other models from Stanford CRFM (2025).

Figure 2
Comparison of LLM Performance on MMLU Variants: Overall Accuracy vs. History Questions. MMLU-Pro data (left) from Wang et al. (2024); MMLU-CF data (right) from Zhao et al. (2024).

Figure 3
Balanced Accuracy of LLMs on HiST-LLM Global History Benchmark (4-Choice Setting). Data from Hauser et al. (2024).

Figure 4
Balanced Accuracy of Selected LLMs by Region: HiST-LLM Global History Benchmark (4-Choice). Data from Hauser et al. (2024).

Figure 5
Accuracy of LLMs on FoundaBench History Subjects (CircularEval): Chinese Models Highlighted. Data from Li et al. (2024).

Figure 6
Reliability of LLMs on HiBenchLLM French Regional History Benchmark (Share of Questions With 100% Correct Variants). Data from Chartier et al. (2025).

Figure 7
Reliability by Question Type on HiBenchLLM French Regional History Benchmark: Top LLMs and Top-5 Average. Data from Chartier et al. (2025).

Figure 8
Error Categories in ChatGPT Responses on Italian Fascism. Data from De Ninno and Lacriola (2025).
Table 2
Four-Dimensional Framework for Historical LLM Assessment with Design Implications.
| DIMENSION | DEFINITION | KEY FINDINGS | DESIGN IMPLICATIONS |
|---|---|---|---|
| Contamination Resistance | Extent to which benchmark isolates genuine capabilities from memorization | 90%→68% accuracy after decontamination | Curator-reviewed questions; temporal holdout sets; avoid widely distributed datasets |
| Content Diversity | Geographic, temporal, linguistic breadth of evaluation dataset | HiST-LLM Latin America anomaly (49%) reflects Seshat’s pre-Columbian focus; FoundaBench shows training data composition > computational scale | Include underrepresented regions/periods; document source coverage; test across languages |
| Format Diversity | Range of response types (MC, open-ended, structured) and complexity levels | HiBenchLLM: 38% overall → 16% interpretive synthesis; MC provides scaffolding that inflates scores | Multiple formats per domain; emphasize generation over selection; include multi-step reasoning |
| Epistemological Sophistication | Assessment of source criticism, historiographical judgment, evidential reasoning | Italian Fascism: 56.7% error rate; models cite but cannot apply historiographic arguments; privilege older English language scholarship and outdated framings | Test interpretation vs. fact; require source evaluation; probe historiographical awareness |
