Have a personal or library account? Click to login
Updating the German Psycholinguistic Word Toolbox with AI-Generated Estimates of Concreteness, Valence, Arousal, Age of Acquisition, and Familiarity Cover

Updating the German Psycholinguistic Word Toolbox with AI-Generated Estimates of Concreteness, Valence, Arousal, Age of Acquisition, and Familiarity

Open Access
|Jan 2026

Figures & Tables

joc-9-1-482-g1.png
Figure 1

Correlations of the GPT estimates with the various other measures of concreteness and imageability. Above the diagonal: Spearman correlations; below the diagonal: Pearson correlations. Notice that the number of data pairs differs per cell, depending on the size of the datasets involved. Minimal number of data pairs for correlations between AI-generated estimates and human ratings was 900, minimal number of correlations between AI-generated estimates was 70,000.

joc-9-1-482-g2.png
Figure 2

Hierarchical cluster analysis based on Spearman correlations, showing that the GPT concreteness estimates correlated closely with the human concreteness ratings (based on corr_cluster from the AnalysisLin R package; Lin, 2024).

joc-9-1-482-g3.png
Figure 3

Correlations of the GPT estimates with the various other measures of valence. Above the diagonal: Spearman correlations; below the diagonal: Pearson correlations. Notice that the number of data pairs differs per cell, depending on the size of the datasets involved. Minimal number of data pairs for correlations between AI-generated estimates and human ratings is 880, minimal number of correlations between AI-generated estimates is 70,000.

joc-9-1-482-g4.png
Figure 4

Hierarchical cluster analysis based on the Spearman correlations, showing that the GPT valence estimates correlate closely with the human valence ratings (based on corr_cluster from the AnalysisLin R package; Lin, 2024).

joc-9-1-482-g5.png
Figure 5

Correlations of the GPT estimates with the various other measures of arousal. Above the diagonal: Spearman correlations; below the diagonal: Pearson correlations. Notice that the number of data pairs differs per cell, depending on the size of the datasets involved. Minimal number of data pairs for correlations between AI-generated estimates and human ratings is 880, minimal number of correlations between AI-generated estimates is 70,000.

joc-9-1-482-g6.png
Figure 6

Hierarchical cluster analysis based on the Spearman correlations, showing that the GPT estimates of arousal tend to be less correlated with the human data than the AI estimates of Köper & Schulte im Walde (2016) (based on corr_cluster from the AnalysisLin R package; Lin, 2024).

joc-9-1-482-g7.png
Figure 7

Correlations of the GPT estimates with the various other measures of AoA. Above the diagonal: Spearman correlations; below the diagonal: Pearson correlations. Notice that the number of data pairs differs per cell, depending on the size of the datasets involved. Minimal number of data pairs for correlations between AI-generated estimates and human ratings is 497, minimal number of correlations between AI-generated estimates is 12,900.

joc-9-1-482-g8.png
Figure 8

GPT_ft was added as an extra variable to Figure 7, showing the higher correlations between the fine-tuned GPT estimates and the other values than between the original GPT estimates and the other values.

joc-9-1-482-g9.png
Figure 9

Distribution of the AoA values for human ratings, GPT estimates, and GPT fine-tuned estimates.

Table 1

Spearman correlations between human familiarity ratings, GPT estimates, and word frequency norms. Below the diagonal: number of data pairs on which the correlation is based.

SCHRöDER_12SCHRöTER_17LINGUAPixXU_25GPT_GERGPT_ENGSUBLTLEXMULTILEXCHILDLEXCHILDLEX_LEMMA
Schröder_12.76.45.42.67.64.42.42.43.46
Schröter_17116.37.46.55.53.65.68.46.53
LinguaPix213255.33.41.40.17.18.15.14
Xu_253138773.59.61.48.45.31.29
GPT_Ger82011521248880.95.70.72.60.60
GPT_Eng820115212488803195.65.67.54.53
Subtlex6361150100188027692769.98.72.77
Multilex76711521144880304230422768.72.80
Childlex60511489848742720272026082706.88
Childlex_lem605114898487427202720260827062720
joc-9-1-482-g10.png
Figure 10

Distribution of the GPT estimates for the German and the English prompt. Most familiarity estimates with the German prompt are around seven, whereas the mode with the English prompt is six.

joc-9-1-482-g11.png
Figure 11

Human ratings as a function of GPT-FAM estimation and Multilex word frequency. Dark red color = low rating, light yellow color = high human rating. Black dots indicate the 2010 words rated. The part of the area without color is a part where there are no stimuli.

1234567
überhaupt nichtooooooosehr gut
Table 2

Words with the largest differences between GPT estimates and human ratings. Top: words with lower human ratings than GPT estimates. Bottom: words with higher human ratings than GPT estimates.

WORDRATINGMULTILEXGPT_FAMDIFF
herüben2.381.615.73–3.36
ingeniös1.830.715.03–3.20
wütig2.851.015.97–3.12
dichtbei1.971.015.00–3.03
Matthäuspassion2.531.015.05–2.52
Exonym1.260.303.71–2.45
spitzig3.711.715.99–2.28
darren1.173.453.45–2.28
Sekunda1.780.714.01–2.23
überwach2.881.975.03–2.15
huren5.973.892.143.83
Endlösung5.622.351.394.23
Schützenkönig5.281.661.044.24
Kinderschänder6.242.901.904.34
Neger5.423.681.064.35
Ecktisch5.752.051.064.70
Fotze6.023.731.174.85
Hitlergruß6.021.661.104.92
Hurensohn6.323.971.015.31
Hakenkreuz6.352.711.025.33
Table 3

Correlations between the newly collected human ratings and existing variables. Between brackets: the 95% confidence interval.

STUDYNstim IN COMMONSPEARMAN CORRELATION
Schröder_12331.67 [.60–.72]
Schröter_17622.33 [.26–.40]
LinguaPix1,489.29 [.25–.34]
Xu_25355.26 [.16–.36]
Multilex word frequency7,578.68 [.67–.69]
Untuned GPT-FAM estimates10,540.85 [.85–.86]
Table 4

Spearman correlations between accuracy, Multilex, GPT-FAM, GPT-FAM-ft and human ratings for three vocabulary tests.

TESTNstimMULTILEXGPT-FAMGPT-FAM-ftRATING
GAudI85.375.555.634.669
PPVT72.527.556.622.663
NOVA110.763.760.845.866
Table 5

Spearman correlations for the three lexical decision experiments described in Brysbaert et al. (2011) and the three studies described in Günther et al. (2020): accuracy, reaction time (RT) and gaze duration. -- means there were not enough data because the stimuli were not part of the rating studies.

STUDYNstimMULTILEXGPT-FAMGPT-FAM-ftRATING
S1_acc460.468.486.499.445
S1_RT460–.700–.656–.650–.539
S2_acc451.595.676.740---
S2_RT451–.662–.731–.766---
S3_acc2154.527.594.644.673
S3_RT2154–.493–.523–.553–.531
G20_RT11810–.376–.397–.422–.415
G20_RT21810–.429–.488–.504–.490
G20_gaze1810–.338–.336–.335–.317
joc-9-1-482-g12.png
Figure 12

Spearman correlations with Childlex lemma frequency (red), Multilex frequency (purple), GPT-FAM (blue) and GPT-FAM-ft (green) for accuracy (left panel) and RT (right panel) in the lexical decision task of Schröter & Schroeder (2017).

joc-9-1-482-g13.png
Figure 13

Familiarity estimates for English cognates and control words. Left panel: GPT-FAM; right panel: GPT-FAM-ft.

DOI: https://doi.org/10.5334/joc.482 | Journal eISSN: 2514-4820
Language: English
Submitted on: Dec 9, 2025
|
Accepted on: Dec 23, 2025
|
Published on: Jan 8, 2026
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2026 Javier Conde, Gonzalo Martínez, María Grandury, Carlos Arriaga, Juan Haro, Sascha Schroeder, Florian Hintz, Pedro Reviriego, Marc Brysbaert, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.