de-Corp: A Corpus of German-language Fiction and Non-Fiction (1780–1930)

Katrin Rohrbacher

doi:10.5334/johd.350

Figures & Tables

Fiction Dataset. Distribution of books, sentences, and tokens across decades (1780–1930). The y-axis shows counts: the number of books (blue) is given in absolute values, while sentence counts (orange) are divided by 10,000 and token counts (green) by 100,000 for visualization purposes. This scaling applies to all subsequent figures.

Table 1

Descriptive statistics for token counts per text in the fiction corpus. Values indicate minimum, maximum, median, mean, and standard deviation of token counts across texts.⁴

	TOKENS PER TEXT
Min	658
Max	374,856
Median	48,980
Mean	58,995
Std. Dev.	45,769

Non-Fiction Dataset. Distribution of books, sentences, and tokens across decades (1780–1940).

Table 2

Descriptive statistics for token counts per text in the non-fiction corpus.

	TOKENS PER TEXT
Min	2,583
Max	978,656
Median	64,298
Mean	80,670
Std. Dev.	75,761

Non-Fiction Dataset. Literary sub-genres.⁸

de-Corp: A Corpus of German-language Fiction and Non-Fiction (1780–1930)

Figures & Tables

Figure 1

Table 1

Figure 2

Figure 3

Figure 4

Table 2

Figure 5

Paradigm

My account