
Figure 1
Correlations of the GPT estimates with the various other measures of concreteness and imageability. Above the diagonal: Spearman correlations; below the diagonal: Pearson correlations. Notice that the number of data pairs differs per cell, depending on the size of the datasets involved. Minimal number of data pairs for correlations between AI-generated estimates and human ratings was 900, minimal number of correlations between AI-generated estimates was 70,000.

Figure 2
Hierarchical cluster analysis based on Spearman correlations, showing that the GPT concreteness estimates correlated closely with the human concreteness ratings (based on corr_cluster from the AnalysisLin R package; Lin, 2024).

Figure 3
Correlations of the GPT estimates with the various other measures of valence. Above the diagonal: Spearman correlations; below the diagonal: Pearson correlations. Notice that the number of data pairs differs per cell, depending on the size of the datasets involved. Minimal number of data pairs for correlations between AI-generated estimates and human ratings is 880, minimal number of correlations between AI-generated estimates is 70,000.

Figure 4
Hierarchical cluster analysis based on the Spearman correlations, showing that the GPT valence estimates correlate closely with the human valence ratings (based on corr_cluster from the AnalysisLin R package; Lin, 2024).

Figure 5
Correlations of the GPT estimates with the various other measures of arousal. Above the diagonal: Spearman correlations; below the diagonal: Pearson correlations. Notice that the number of data pairs differs per cell, depending on the size of the datasets involved. Minimal number of data pairs for correlations between AI-generated estimates and human ratings is 880, minimal number of correlations between AI-generated estimates is 70,000.

Figure 6
Hierarchical cluster analysis based on the Spearman correlations, showing that the GPT estimates of arousal tend to be less correlated with the human data than the AI estimates of Köper & Schulte im Walde (2016) (based on corr_cluster from the AnalysisLin R package; Lin, 2024).

Figure 7
Correlations of the GPT estimates with the various other measures of AoA. Above the diagonal: Spearman correlations; below the diagonal: Pearson correlations. Notice that the number of data pairs differs per cell, depending on the size of the datasets involved. Minimal number of data pairs for correlations between AI-generated estimates and human ratings is 497, minimal number of correlations between AI-generated estimates is 12,900.

Figure 8
GPT_ft was added as an extra variable to Figure 7, showing the higher correlations between the fine-tuned GPT estimates and the other values than between the original GPT estimates and the other values.

Figure 9
Distribution of the AoA values for human ratings, GPT estimates, and GPT fine-tuned estimates.
Table 1
Spearman correlations between human familiarity ratings, GPT estimates, and word frequency norms. Below the diagonal: number of data pairs on which the correlation is based.
| SCHRöDER_12 | SCHRöTER_17 | LINGUAPix | XU_25 | GPT_GER | GPT_ENG | SUBLTLEX | MULTILEX | CHILDLEX | CHILDLEX_LEMMA | |
|---|---|---|---|---|---|---|---|---|---|---|
| Schröder_12 | .76 | .45 | .42 | .67 | .64 | .42 | .42 | .43 | .46 | |
| Schröter_17 | 116 | .37 | .46 | .55 | .53 | .65 | .68 | .46 | .53 | |
| LinguaPix | 213 | 255 | .33 | .41 | .40 | .17 | .18 | .15 | .14 | |
| Xu_25 | 31 | 387 | 73 | .59 | .61 | .48 | .45 | .31 | .29 | |
| GPT_Ger | 820 | 1152 | 1248 | 880 | .95 | .70 | .72 | .60 | .60 | |
| GPT_Eng | 820 | 1152 | 1248 | 880 | 3195 | .65 | .67 | .54 | .53 | |
| Subtlex | 636 | 1150 | 1001 | 880 | 2769 | 2769 | .98 | .72 | .77 | |
| Multilex | 767 | 1152 | 1144 | 880 | 3042 | 3042 | 2768 | .72 | .80 | |
| Childlex | 605 | 1148 | 984 | 874 | 2720 | 2720 | 2608 | 2706 | .88 | |
| Childlex_lem | 605 | 1148 | 984 | 874 | 2720 | 2720 | 2608 | 2706 | 2720 |

Figure 10
Distribution of the GPT estimates for the German and the English prompt. Most familiarity estimates with the German prompt are around seven, whereas the mode with the English prompt is six.

Figure 11
Human ratings as a function of GPT-FAM estimation and Multilex word frequency. Dark red color = low rating, light yellow color = high human rating. Black dots indicate the 2010 words rated. The part of the area without color is a part where there are no stimuli.
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | ||
| überhaupt nicht | o | o | o | o | o | o | o | sehr gut |
Table 2
Words with the largest differences between GPT estimates and human ratings. Top: words with lower human ratings than GPT estimates. Bottom: words with higher human ratings than GPT estimates.
| WORD | RATING | MULTILEX | GPT_FAM | DIFF |
|---|---|---|---|---|
| herüben | 2.38 | 1.61 | 5.73 | –3.36 |
| ingeniös | 1.83 | 0.71 | 5.03 | –3.20 |
| wütig | 2.85 | 1.01 | 5.97 | –3.12 |
| dichtbei | 1.97 | 1.01 | 5.00 | –3.03 |
| Matthäuspassion | 2.53 | 1.01 | 5.05 | –2.52 |
| Exonym | 1.26 | 0.30 | 3.71 | –2.45 |
| spitzig | 3.71 | 1.71 | 5.99 | –2.28 |
| darren | 1.17 | 3.45 | 3.45 | –2.28 |
| Sekunda | 1.78 | 0.71 | 4.01 | –2.23 |
| überwach | 2.88 | 1.97 | 5.03 | –2.15 |
| huren | 5.97 | 3.89 | 2.14 | 3.83 |
| Endlösung | 5.62 | 2.35 | 1.39 | 4.23 |
| Schützenkönig | 5.28 | 1.66 | 1.04 | 4.24 |
| Kinderschänder | 6.24 | 2.90 | 1.90 | 4.34 |
| Neger | 5.42 | 3.68 | 1.06 | 4.35 |
| Ecktisch | 5.75 | 2.05 | 1.06 | 4.70 |
| Fotze | 6.02 | 3.73 | 1.17 | 4.85 |
| Hitlergruß | 6.02 | 1.66 | 1.10 | 4.92 |
| Hurensohn | 6.32 | 3.97 | 1.01 | 5.31 |
| Hakenkreuz | 6.35 | 2.71 | 1.02 | 5.33 |
Table 3
Correlations between the newly collected human ratings and existing variables. Between brackets: the 95% confidence interval.
| STUDY | Nstim IN COMMON | SPEARMAN CORRELATION |
|---|---|---|
| Schröder_12 | 331 | .67 [.60–.72] |
| Schröter_17 | 622 | .33 [.26–.40] |
| LinguaPix | 1,489 | .29 [.25–.34] |
| Xu_25 | 355 | .26 [.16–.36] |
| Multilex word frequency | 7,578 | .68 [.67–.69] |
| Untuned GPT-FAM estimates | 10,540 | .85 [.85–.86] |
Table 4
Spearman correlations between accuracy, Multilex, GPT-FAM, GPT-FAM-ft and human ratings for three vocabulary tests.
| TEST | Nstim | MULTILEX | GPT-FAM | GPT-FAM-ft | RATING |
|---|---|---|---|---|---|
| GAudI | 85 | .375 | .555 | .634 | .669 |
| PPVT | 72 | .527 | .556 | .622 | .663 |
| NOVA | 110 | .763 | .760 | .845 | .866 |
Table 5
Spearman correlations for the three lexical decision experiments described in Brysbaert et al. (2011) and the three studies described in Günther et al. (2020): accuracy, reaction time (RT) and gaze duration. -- means there were not enough data because the stimuli were not part of the rating studies.
| STUDY | Nstim | MULTILEX | GPT-FAM | GPT-FAM-ft | RATING |
|---|---|---|---|---|---|
| S1_acc | 460 | .468 | .486 | .499 | .445 |
| S1_RT | 460 | –.700 | –.656 | –.650 | –.539 |
| S2_acc | 451 | .595 | .676 | .740 | --- |
| S2_RT | 451 | –.662 | –.731 | –.766 | --- |
| S3_acc | 2154 | .527 | .594 | .644 | .673 |
| S3_RT | 2154 | –.493 | –.523 | –.553 | –.531 |
| G20_RT1 | 1810 | –.376 | –.397 | –.422 | –.415 |
| G20_RT2 | 1810 | –.429 | –.488 | –.504 | –.490 |
| G20_gaze | 1810 | –.338 | –.336 | –.335 | –.317 |

Figure 12
Spearman correlations with Childlex lemma frequency (red), Multilex frequency (purple), GPT-FAM (blue) and GPT-FAM-ft (green) for accuracy (left panel) and RT (right panel) in the lexical decision task of Schröter & Schroeder (2017).

Figure 13
Familiarity estimates for English cognates and control words. Left panel: GPT-FAM; right panel: GPT-FAM-ft.
