de-Corp: A Corpus of German-language Fiction and Non-Fiction (1780–1930)

Katrin Rohrbacher

doi:10.5334/johd.350

Full Article

(1) Overview

Repository location

https://doi.org/10.5281/zenodo.15714828

Context

This data was gathered as part of a dissertation project focusing on modeling fictional worlds and narrative space in German-language literature using quantitative methods. It includes both fiction and non-fiction works from the German and U.S. Project Gutenberg libraries, with fictional prose making up the majority. In addition to basic metadata such as “author,” “title,” and “year,” the corpus also provides more fine-grained metadata on genre (for both the fiction and non-fiction subcorpora), as well as information on author gender. In total, the corpus comprises approximately 5,000 books spanning about 150 years, from 1780 to 1930. It is therefore particularly well suited for researchers in Computational Literary Studies (CLS), computational linguistics, and related fields interested in analyzing literary and cultural phenomena on a large scale.

It should also be noted that the “original” version of the dataset includes works published up to 1940 that are available in the American Gutenberg Library (PG-US) but not in the German Gutenberg Library (PG-DE), due to differences in copyright law. In accordance with German copyright regulations, only works that are no longer under copyright in Germany have been made publicly available as full texts.

(2) Method

The majority of the data is drawn from the first digital edition of PG-DE, which presents texts in a browsable HTML format and groups them by genre and sub-genre (Projekt Gutenberg-DE, 2025).¹ The digital edition consists of all books published in the digital library until August 2023. In total, the PG-DE Digital Edition comprises 11,784 texts (fiction and non-fiction, as well as translations into German) by 2,476 unique authors ranging roughly the years 1600 to 1940.

The corpus presented here includes all fiction and non-fiction works published in the Digital Edition between 1780 and 1930 (or 1940, respectively), with translations into German excluded. For the fiction data, 391 additional works from PG-US that were not part of PG-DE were added, covering the same time span.² Thus, the resulting dataset focuses on modern German-language prose from the late 18th through the early 20th century, supplemented by a comparative non-fiction corpus from PG-DE covering the same temporal scope.

It is important to note that both PG-DE and PG-US are continuously updated; therefore, the collection at hand represents a snapshot of data gathered by PG-US up until September 2023, when this dataset was created, and incorporates the works collected and listed by the Gutenberg Edition.

Unlike platforms like HathiTrust or Google Books, which primarily aggregate digital collections from university libraries, PG-DE operates through a volunteer-driven model similar to PG-US and other related projects, aiming to digitize and preserve literary texts. Most works are selected based on analog bibliographies and literary lexicons spanning the 19th century to the present, although in some cases, they are chosen based on reader requests. Texts are acquired from antiquarian bookshops, scanned, and processed using optical character recognition (OCR).³

While the texts for PG-DE were scraped from the web version, for PG-US I used the available .txt files. The HTML files were converted to text format. Cleaning the scraped HTML pages was relatively straightforward, as they did not contain much “noise” in terms of page numbers or extraneous text unrelated to the book (such as copyright declarations, information about the proofreaders, changes made, etc.). These elements were removed from the .txt files drawn from PG-US.

Due to its rigorous proofreading efforts, Project Gutenberg is considered a “gold standard” among digitized corpora (Jiang et al., 2021). However, the various conversions from one format to another (e.g., OCR software, Word, XML) mean that the information on page numbers and book length is not preserved. The books scanned by PG-DE do not retain the original page numbering, and this information is lost in the process. Although the metadata often lists the original length of the book in page numbers, this information is by far not comprehensive.

In addition to this, details, especially regarding the publication year, are sometimes spotty. These are (most of the times) not subject to human errors but reflect the contingency and incompleteness of historical records. PG-DE distinguishes between the publication date of the edition that was digitized and the first publication date. For lesser-known works, first publication dates are often unknown. Instead of providing exact dates, the publication date is sometimes indicated as a range, such as between 1923 and 1924.

In cases where a publication date was missing, I manually added the date. For dates where a range was provided, I used the earliest year listed (e.g., 1923). For works where no original date of publication could be found, I used the earliest publication date that I found for the volume in question. In the rare case where no publication date for a work could be found, I used the author’s death year as a proxy for the historical period in which the work is set. This approach aligns with similar methods of estimating publication dates employed by others when constructing large datasets (see Underwood et al., 2020). PG-US provides metadata only for “author” and “title”. As the majority of books were obtained from the digital edition, I manually supplemented the missing metadata for genre and year of publication for books from PG-US.

Information on author gender was retroactively added and determined using a mixed approach: whenever possible, information on gender was automatically retrieved from Wikidata; in cases where no entry was available, gender was assigned manually. Known pseudonyms were taken into account. However, the information may not be fully comprehensive, and occasional errors or omissions cannot be ruled out.

The final fiction corpus encompasses 4,215 books by 1,060 unique authors, comprising a total of 15,067,032 sentences and 248,665,215 tokens. The volumes span the years from 1780 to 1930, with the majority falling between 1850 and 1930. Refer to Figure 1 for the distribution of the number of books in the corpus over time, along with the distribution of sentences per 10,000 and tokens per 100,000. In terms of author gender, the majority of books (86%) in this corpus were written by male authors (m = 3,628; f = 587). Table 1 presents descriptive statistics of token counts per text, including the minimum, maximum, mean, median, and standard deviation.

Fiction Dataset. Distribution of books, sentences, and tokens across decades (1780–1930). The y-axis shows counts: the number of books (blue) is given in absolute values, while sentence counts (orange) are divided by 10,000 and token counts (green) by 100,000 for visualization purposes. This scaling applies to all subsequent figures.

Table 1

Descriptive statistics for token counts per text in the fiction corpus. Values indicate minimum, maximum, median, mean, and standard deviation of token counts across texts.⁴

	TOKENS PER TEXT
Min	658
Max	374,856
Median	48,980
Mean	58,995
Std. Dev.	45,769

The genre designations for the German Gutenberg Edition were assigned by the Gutenberg Team following the Warengruppen-Systematik (WGS) of the Börsenverein des Deutschen Buchhandels, which provides a thematic classification system for books. Each work in the corpus was matched to the closest applicable WGS category. Since many of the texts are historical and do not always map neatly onto today’s book trade categories, some adjustments and interpretive decisions were necessary.⁵

See Figure 2 for an overview comparing the number of sub-genres to the number of “novels, novellas, and stories,” which constitute the majority in the dataset.

An overview of the diverse sub-genres (excluding “novels, novellas, and stories”) is presented in Figure 3. Among these, historical novels constitute the largest portion in the corpus, followed by “crime fiction” and “fairy tales”. Although the count for “horror,” “science fiction,” and “speculative fiction” books is relatively low, I chose not to merge any genres under a single banner (e.g., combining “speculative fiction” and “science fiction”) and opted to adhere to the categorization provided by the digital edition.

In addition to the fiction dataset, I compiled a smaller subset of non-fiction works using the texts available in the PG-DE edition. Figure 4 provides a historical overview of this dataset, showing the number of books per decade, along with the sentence and token counts for each decade. The non-fiction corpus comprises 765 books, 3,026,241 sentences, 61,712,752 tokens, and 412 unique authors, spanning the period from 1780 to 1940. See Table 2 for descriptive statistics of the non-fiction sample. Compared to the fiction corpus, the non-fiction dataset is considerably smaller in size. Data on author gender shows that the discrepancy between male and female authors is even more pronounced in the non-fiction corpus than in the fiction dataset: 706 books (92.5%) are authored by men, and only 57 (7.5%) by women.⁷

Non-Fiction Dataset. Distribution of books, sentences, and tokens across decades (1780–1940).

Table 2

Descriptive statistics for token counts per text in the non-fiction corpus.

	TOKENS PER TEXT
Min	2,583
Max	978,656
Median	64,298
Mean	80,670
Std. Dev.	75,761

Similar to the fiction data, I have also included the more detailed genre metadata provided by the digital edition for the non-fiction corpus. Figure 5 shows the count of books and sentences categorized by genre, highlighting that “travelogues” and “history” constitute the largest segments of the dataset, followed by “philosophy.”

Non-Fiction Dataset. Literary sub-genres.⁸

(3) Dataset Description

Repository name

Zenodo

Object name

de-corp_fiction.csv, de-corp_non_fiction.csv, de-corp_txt.zip

Format names and versions

.CSV, .txt

Creation dates

2023-09-12 to 2023-12-30

Dataset creator

Katrin Rohrbacher

Language

German; English for metadata

License

Public domain

Publication date

2025-06-22

(4) Reuse Potential

A number of German text datasets exist, but publicly available corpora of comparable size and metadata granularity remain relatively scarce. Among large collections of German-language texts, such as the Deutsches Textarchiv (DTA) and TextGrid, the collection presented here improves on existing resources in both size and metadata detail. The DTA comprises approximately 6,500 texts (including around 780 literary works) spanning the years 1600 to 1900 and covering fiction, non-fiction, newspapers, and scientific texts.⁹ In its digital library TextGrid includes a wide variety of works such as verse, drama, dictionaries, historical, artistic, and scientific texts, and features roughly 3,500 prose works, ranging from the beginning of printing to the first decades of the 20th century, including translations into German.¹⁰ Like the DTA, its texts are TEI/XML-encoded.

Another related resource is de-prose (Gius et al., 2021), which comprises 2,511 German-language prose texts from 1870 to 1920 with author gender metadata. The present collection builds on and extends de-prose by covering a broader temporal scope, including genre metadata, and adding a comparative non-fiction corpus.¹¹

The corpus presented here is paticularily valuable for scholars interested in large-scale studies of literary and cultural phenomena in modern German prose. It has already been used to investigate the role of setting and its historical relevance (Rohrbacher, forthcoming), as well as the concept’s significance for narrative beginnings (Rohrbacher, 2025) using machine learning approaches.

More generally, the dataset’s cleaned, full-length texts make it well suited for a range of both supervised and unsupervised machine learning methods, including classification tasks, topic modeling, and sentiment analysis. Depending on the specific task, the corpus may require additional preprocessing – such as tokenization, part-of-speech tagging, dependency parsing or named entity recognition using existing libraries, such as stanza (Qi et al., 2020) or spaCy (Honnibal et al., 2020). The corpus also supports historical and periodical analyses across a broad timespan – from 1780 to 1930 in the fiction corpus and up to 1940 in the non-fiction corpus – and offers potential for comparative research across literary genres and sub-genres, author gender (with the caveat of a rather unbalanced distribution), and the fiction/non-fiction divide.

Given its detailed metadata, the dataset can also be subdivided into more specific tasks, such as studying literary or cultural phenomena within a particular historical period (e.g., Modernism) or genre (e.g. speculative fiction). Conversely, it can be extended to cover other literary periods (e.g., early modern German literature) or used as a subset for cross-cultural, multilingual studies.

Specifically, given its size and period coverage, it presents a valuable resource for CLS scholars seeking to test hypotheses and examine key narrative phenomena and their socio-historical relevance—for example, the distribution of geographical and fictional space (Grisot & Herrmann, 2023; Wilkens, 2021), domesticity in both fiction and non-fiction (e.g., travelogues) (Guhr et al., 2025), or characterization (Radak et al., 2024; Piper, 2023).

Notes

[1] For the latest 2^nd edition, see https://www.projekt-gutenberg.org/. Accessed: 2025-06-13.

[2] In the metadata file, works stemming from the German Gutenberg Edition are labeled as PG-DE, and works from the American Gutenberg Library as PG-US.

[3] For a more detailed overview of the process, see https://www.projekt-gutenberg.org/info/texte/Vom-Antiquariat-zum-E-Text.pdf.

[4] The number of tokens and sentences for each text is also provided in the metadata file.

[5] For more information on how the Gutenberg team used WGS in their classification of genres, see https://www.projekt-gutenberg.org/info/texte/lesetips.html (accessed: 2025-08-15); for WGS itself, see Pohl & Umlauf (2007).

[6] The genre labels used in the figures are English translations or abbreviated forms of the original German designations: Historische Kriminalromane und -fälle (“hist crime”), Historische Romane und Erzählungen (“hist novels”), Horror (“horror”), Humor, Satire (“satire”), Jugendliteratur (“young adult (YA)”), Krimis, Thriller, Spionage (“crime”), Märchen, Sagen, Legenden (“fairy tales”), Phantastische Literatur (“spec fic”), Romane, Novellen und Erzählungen (“novels, novellas”), Romanhafte Biographien (“bio fic”), Science Fiction (“sci fi”), and Spannung und Abenteuer (“adventure”).

[7] There are two books in the corpus where the authors are unknown, and their gender could thus not be determined.

[8] The English genre terms used in the figures correspond to the following original German designations: Geschichte (“history”), Natur (“nature”), Philosophie (“philosophy”), Praktisches (“self-help”), Reiseberichte, Reiseerzählungen (“travelogues”), and Religion (“religion”).

[9] Deutsches Textarchiv. Grundlage für ein Referenzkorpus der neuhochdeutschen Sprache. Available from http://www.deutschestextarchiv.de/. Accessed: 2025-08-15.

[10] Textgrid Repository. Available from https://textgridrep.org/. Accessed: 2025-08-15.

[11] Other related resources include Horstmann (2024) and Herrmann & Lauer (2017) for a larger but less structured collection of German language texts (which includes translations into German and non-fiction and requires additional cleaning); a corpus of 547 German canonical works compiled by Brottrager et al. (2022); and a multilingual collection of European literary texts that includes a set of 100 German novels compiled by Schöch et al. (2021).

Competing interests

The author has no competing interests to declare.