Abstract
de-Corp is a corpus of ~5000 German-language fiction and non-fiction texts published between 1780 and 1930 and 1940 respectively, compiled from the German and U.S. Project Gutenberg libraries. It includes detailed metadata on genre, publication year, and author gender, offering over 300 million tokens across 1,400+ unique authors. The dataset supports large-scale historical and literary analysis and is especially valuable for research in Computational Literary Studies and Computational Linguistics.
