
Figure 1
Fiction Dataset. Distribution of books, sentences, and tokens across decades (1780–1930). The y-axis shows counts: the number of books (blue) is given in absolute values, while sentence counts (orange) are divided by 10,000 and token counts (green) by 100,000 for visualization purposes. This scaling applies to all subsequent figures.
Table 1
Descriptive statistics for token counts per text in the fiction corpus. Values indicate minimum, maximum, median, mean, and standard deviation of token counts across texts.4
| TOKENS PER TEXT | |
|---|---|
| Min | 658 |
| Max | 374,856 |
| Median | 48,980 |
| Mean | 58,995 |
| Std. Dev. | 45,769 |

Figure 2
Fiction dataset. Overview of genres.

Figure 3
Fiction Dataset. Literary sub-genres.6

Figure 4
Non-Fiction Dataset. Distribution of books, sentences, and tokens across decades (1780–1940).
Table 2
Descriptive statistics for token counts per text in the non-fiction corpus.
| TOKENS PER TEXT | |
|---|---|
| Min | 2,583 |
| Max | 978,656 |
| Median | 64,298 |
| Mean | 80,670 |
| Std. Dev. | 75,761 |

Figure 5
Non-Fiction Dataset. Literary sub-genres.8
