de-Corp: A Corpus of German-language Fiction and Non-Fiction (1780–1930)

Katrin Rohrbacher

doi:10.5334/johd.350

Abstract

de-Corp is a corpus of ~5000 German-language fiction and non-fiction texts published between 1780 and 1930 and 1940 respectively, compiled from the German and U.S. Project Gutenberg libraries. It includes detailed metadata on genre, publication year, and author gender, offering over 300 million tokens across 1,400+ unique authors. The dataset supports large-scale historical and literary analysis and is especially valuable for research in Computational Literary Studies and Computational Linguistics.

References

Brottrager, J., Stahl, A., Arslan, A., Brandes, U., & Weitin, T. (2022). Modeling and Predicting Literary Reception. A Data-Rich Approach to Literary Historical Reception. Journal of Computational Literary Studies, 1(1). 10.48694/jcls.95
Open DOI Search in Google Scholar Back to article
Gius, E., Guhr, S., & Uglanova, I. (2021). “d-Prose 1870–1920” a Collection of German Prose Texts from 1870 to 1920. Journal of Open Humanities Data, 7(11). 10.5334/johd.30
Open DOI Search in Google Scholar Back to article
Grisot, G., & Herrmann, B. (2023). Examining the representation of landscape and its emotional value in German-Swiss fiction between 1840 and 1940. Journal of Cultural Analytics, 8(1). 10.22148/001c.84475
Open DOI Search in Google Scholar Back to article
Guhr, S., Monaco, J., Sherman, A., Warner, M., & Algee-Hewitt, M. (2025). Making BERT Feel at Home. Modelling Domestic Space in 19th-Century British and Irish Fiction. Journal of Computational Literary Studies, 4(1). 10.48694/jcls.4164
Open DOI Search in Google Scholar Back to article
Herrmann, J. B., & Lauer, G. (2017). Das „Was-bisher-geschah“ von KOLIMO: Ein Update zum Korpus der literarischen Moderne. In DHd 2017: Digitale Nachhaltigkeit. Konferenzabstracts (pp. 107–110). Bern: Universität Bern.
Search in Google Scholar Back to article
Honnibal, M., Montani, I., Van Landeghem, S., & Boyd, A. (2020). spaCy: Industrial-strength natural language processing in Python. 10.5281/zenodo.1212303
Open DOI Search in Google Scholar Back to article
Horstmann, J. (2024). Ressourcenbeitrag: Korpus der literarischen Moderne (KOLIMO). forTEXT, 1(2). 10.48694/fortext.3813
Open DOI Search in Google Scholar Back to article
Jiang, M., Hu, Y., Worthey, G., Dubnicek, R. C., Capitanu, B., Kudeki, D., & Downie, J. S. (2021). The Gutenberg-HathiTrust parallel corpus: A real-world dataset for noise investigation in uncorrected OCR texts. iConference 2021, virtual. https://www.ideals.illinois.edu/items/117404
Search in Google Scholar Back to article
Piper, A. (2023). What do characters do? The embodied agency of fictional characters. Journal of Computational Literary Studies, 2(1), 1–12. 10.48694/jcls.3589
Open DOI Search in Google Scholar Back to article
Pohl, S., & Umlauf, K. (2007). Warenkunde Buch: Strukturen, Inhalte und Tendenzen des deutschsprachigen Buchmarkts der Gegenwart. Wiesbaden: Harrrassowitz Verlag.
Search in Google Scholar Back to article
Projekt Gutenberg-DE. (2025). Projekt Gutenberg-DE (2nd ed.). https://www.projekt-gutenberg.org/
Search in Google Scholar Back to article
Qi, P., Zhang, Y., Zhang, Y., Bolton, J., & Manning, C. D. (2020). Stanza: A Python natural language processing toolkit for many human languages. In Proceedings of the Association for Computational Linguistics (ACL) System Demonstrations (pp. 101–108). Association for Computational Linguistics. 10.18653/v1/2020.acl-demos.14
Open DOI Search in Google Scholar Back to article
Radak, T., Burnard, L., François, P., Hilger, A., Jannidis, F., Palkó, G., Patras, R., Preminger, M., Santos, D., & Schöch, C. (2024). Towards a computational history of modernism in European literary history: Mapping the inner lives of characters in the European novel (1840–1920). Open Research Europe, 4, 44. 10.12688/openreseurope.16290.2
Open DOI Search in Google Scholar Back to article
Rohrbacher, K. (forthcoming). “Lived space”: A computational study of setting in fiction. In R. M. Aust, G. Grisot, & B. Herrmann (Eds.), Comparing landscapes: Approaches to space and affect in literary fiction. Bielefeld University Press.
Search in Google Scholar Back to article
Rohrbacher, K. (2025). Opening worlds: Narrative beginnings and the role of setting. CCLS2025 Conference Preprints, 4(1). 10.26083/tuprints-00030149
Open DOI Search in Google Scholar Back to article
Schöch, C., Erjavec, T., Patras, R., & Santos, D. (2021). Creating the European literary text collection (ELTeC): Challenges and perspectives. Modern Languages Open, 0(1), 25. 10.3828/mlo.v0i0.364
Open DOI Search in Google Scholar Back to article
Underwood, T., Kimutis, P., & Witte, J. (2020). NovelTM datasets for English-language fiction, 1700–2009. Journal of Cultural Analytics, 5(2). 10.22148/001c.13147
Open DOI Search in Google Scholar Back to article
Wilkens, M. (2021). Too isolated, too insular: American literature and the world. Journal of Cultural Analytics, 6(3), 52–84. 10.22148/001c.25273
Open DOI Search in Google Scholar Back to article

de-Corp: A Corpus of German-language Fiction and Non-Fiction (1780–1930)

Abstract

Paradigm

My account