Bridging the Gaps: Integrating Bibliographic Metadata Into Wikidata for Literary Corpora

Katrin Rohrbacher; David Schrittesser

doi:10.5334/johd.483

Abstract

This paper presents a case study on enhancing literary-corpus metadata by integrating large-scale bibliographic resources with Wikidata. Digital libraries such as Project Gutenberg or HathiTrust often provide only minimal metadata (e.g., author name and title). For large-scale literary analysis, however, it is crucial to include additional information such as year of publication, author gender, genre, or publisher. Conversely, using Wikidata to enrich existing literary-corpus metadata is challenging, as significant gaps in coverage remain. In this case study, we draw on the metadata of a large literary corpus to address these gaps. We conduct a feasibility analysis to determine how a workflow can be established that integrates metadata from bibliographic catalogues into Wikidata as a step in the digital-humanities pipeline. We explore both procedural approaches and existing software tools and discuss resulting challenges and limitations. Our methods are documented and open-source; the full Python scripts and data processing workflows are publicly available on GitHub.¹ The goal is to develop reproducible methods for sharing and improving metadata availability across open platforms.

References

Algee-Hewitt, M., Porter, J. D., & Walser, H. (2020). Representing Race and Ethnicity in American Fiction, 1789–1920. Journal of Cultural Analytics, 5(2). 10.22148/001c.18509
Open DOI Search in Google Scholar Back to article
Almeida, P.D., Rocha, J. G., Ballatore, A., & Zipf, A. (2016). Where the streets have known names. In Computational Science and Its Applications — ICCSA 2016. Cham: Springer International Publishing, pp. 1–12. 10.1007/978-3-319-42089-9_1
Open DOI Search in Google Scholar Back to article
Bagga, S., & Piper, A. (2022). HATHI 1M: Introducing a Million Page Historical Prose Dataset in English from the Hathi Trust. Journal of Open Humanities Data, 8, 7. 10.5334/johd.71
Open DOI Search in Google Scholar Back to article
Candela, G. (2023). An automatic data quality approach to assess semantic data from cultural heritage institutions. J. Am. Soc. Inf. Sci. 74(7), 866–878. 10.1002/asi.24761
Open DOI Search in Google Scholar Back to article
Conroy, M. (2023). Quantifying the Gap: The Gender Gap in French Writers’ Wikidata. Journal of Cultural Analytics, 8(2). 10.22148/001c.74068
Open DOI Search in Google Scholar Back to article
de Beyssat, C. D. (2025) Victims of Posterity. Identifying Gaps on 19th-Century French Art History with Wikidata. Journal of Open Humanities Data, 11(1), 59. 10.5334/johd.399
Open DOI Search in Google Scholar Back to article
Egloff, M., Picca, D., Adamou, A. (2019). Extraction of character profiles from the Gutenberg Archive. In Metadata and Semantic Research. Cham: Springer, pp. 367–72. 10.1007/978-3-030-36599-8_32
Open DOI Search in Google Scholar Back to article
Egloff, M., & Picca, D. (2020). WeDH – a Friendly Tool for Building Literary Corpora Enriched with Encyclopedic Metadata. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 813–816. Marseille, France. European Language Resources Association. https://aclanthology.org/2020.lrec-1.101/
Search in Google Scholar Back to article
Erlin, M., Piper, A., Knox, D., Pentecost, S., Drouillard, M., Powell, B., & Townson, C. (2021). Cultural Capitals: Modeling Minor European Literature. Journal of Cultural Analytics, 6(1). 10.22148/001c.21182
Open DOI Search in Google Scholar Back to article
Fischer, F., Blakesley, J., Wojcik, P., & Jäschke, R. (2023). Preface: World Literature in an Expanding Digital Space. Journal of Cultural Analytics, 8(2). 10.22148/001c.74598
Open DOI Search in Google Scholar Back to article
Fischer, F., Börner, I., & Göbel, M. (2019). Programmable corpora: Introducing DraCor, an infrastructure for the research on European drama. In Digital Humanities 2019: Conference Abstracts. Utrecht: Utrecht University. 10.5281/zenodo.4284001
Open DOI Search in Google Scholar Back to article
Gittel, B. (2021). An Institutional Perspective on Genres: Generic Subtitles in German Literature from 1500–2020. Journal of Cultural Analytics, 6(1). 10.22148/001c.22086
Open DOI Search in Google Scholar Back to article
Hamilton, S., & Piper, A. (2023). MultiHATHI: A Complete Collection of Multilingual Prose Fiction in the HathiTrust Digital Library. Journal of Open Humanities Data, 9(1), 3. 10.5334/johd.95
Open DOI Search in Google Scholar Back to article
Huynh, D. (2012). OpenRefine 3.5. https://openrefine.org/
Search in Google Scholar Back to article
Langer, L., Burghardt, M., Borgards, R., Böhning-Gaese, K., Seppelt, R., & Wirth, C. (2021). The rise and fall of biodiversity in literature: A comprehensive quantification of historical changes in the use of vernacular labels for biological taxa in Western creative literature. People and Nature, 3(5), 1093–1109. 10.1002/pan3.10256
Open DOI Search in Google Scholar Back to article
Manske, M. (2019). QuickStatements 2.0. https://quickstatements.toolforge.org/
Search in Google Scholar Back to article
Müller, S., Brunzel, M., Kaun, D., Biswas, R., Koutraki, M., Tietz, T., & Sack, H. (2019). HistorEx: exploring historical text corpora using word and document embeddings. In The Semantic Web: ESWC 2019 Satellite Events. Lecture Notes in Computer Science. Cham: Springer International Publishing, pp. 136–40. 10.1007/978-3-030-32327-1_27
Open DOI Search in Google Scholar Back to article
Nešić, M. I., Stanković, R., Schöch, C., & Skoric, M. (2022). From ELTeC Text Collection Metadata and Named Entities to Linked-data (and Back). In Proceedings of the 8th Workshop on Linked Data in Linguistics within the 13th Language Resources and Evaluation Conference, pp. 7–16, Marseille, France. European Language Resources Association. https://aclanthology.org/2022.ldl-1.2/
Search in Google Scholar Back to article
Reeve, J. (2020). Corpus-DB: a scriptable textual corpus database for cultural analytics. In Digital Humanities 2020: Conference Abstracts, Ottawa: Carleton University and Université d’Ottawa (University of Ottawa), p. 230.
Search in Google Scholar Back to article
Rohrbacher, K. (2025) de-Corp: A Corpus of German-language Fiction and Non-Fiction (1780–1930). Journal of Open Humanities Data, 11(1), p. 51. 10.5334/johd.350
Open DOI Search in Google Scholar Back to article
Röttgermann, J. (2024). The Collection of Eighteenth-Century French Novels 1751–1800. Journal of Open Humanities Data, 10(1), 31. 10.5334/johd.201
Open DOI Search in Google Scholar Back to article
Schelbert, G., & Müller, M. (2023, März 10). Sammlungsdaten mit Wikidata anreichern und für die Vernetzung öffnen. Konzepte und praktische Erpobungen. DHd 2023 Open Humanities Open Culture. 9. Tagung des Verbands “Digital Humanities im deutschsprachigen Raum” (DHd 2023), Trier, Luxemburg. 10.5281/zenodo.7715472
Open DOI Search in Google Scholar Back to article
Soudani, A., Meherzi, Y., Bouhafs, A., Frontini, F., Brando, C., Dupont, Y., & Mélanie-Becquet, F. (2019). Adapting a system for named entity recognition and linking for 19th century French novels. In Digital Humanities 2019: Conference Abstracts, Utrecht: Utrecht University. https://hal.science/hal-02187283v1
Search in Google Scholar Back to article
Teichmann, L. (2025). The “Mapping German fiction in translation” dataset: Data collection, scope, and data quality. Journal of Cultural Analytics, 10(1). 10.22148/001c.128010
Open DOI Search in Google Scholar Back to article
Tharani, K. (2021). Much more than a mere technology: A systematic review of Wikidata in libraries. The Journal of Academic Librarianship, 47(2), 102326. 10.1016/j.acalib.2021.102326
Open DOI Search in Google Scholar Back to article
Underwood, T. (2019). Distant Horizons: Digital Evidence and Literary Change. University of Chicago Press. https://press.uchicago.edu/ucp/books/book/chicago/D/bo35853783.html
Search in Google Scholar Back to article
Underwood, T. (2021). “HathiTrust Post45 Fiction (1945–2013).” Edited by Dan Sinykin and Laura McGrath. Post45 Data Collective, April. data.post45.org/hathitrust-post45-fiction/
Search in Google Scholar Back to article
Underwood, T., Kimutis, P., & Witte, J. (2020). NovelTM Datasets for English-Language Fiction, 1700–2009. Journal of Cultural Analytics, 5(2). 10.22148/001c.13147
Open DOI Search in Google Scholar Back to article
Wilkens, M. (2021). Too isolated, too insular: American Literature and the World. Journal of Cultural Analytics, 6(3). 10.22148/001c.25273
Open DOI Search in Google Scholar Back to article
Wikidata. (n.d.-a). Wikidata: Main Page. Wikimedia Foundation. Retrieved October 16, 2025, from https://www.wikidata.org/wiki/Wikidata:Main_Page
Search in Google Scholar Back to article
Wikidata. (n.d.-b). Wikidata: Notability. Wikimedia Foundation. Retrieved December 28, 2025, from https://www.wikidata.org/wiki/Wikidata:Notability
Search in Google Scholar Back to article
Wikidata. (n.d.-c). Mass-editing policy. Wikimedia Foundation. Retrieved October 20, 2025, from https://www.wikidata.org/wiki/Wikidata_talk:Requests_for_comment/Mass-editing_policy
Search in Google Scholar Back to article
Wojcik, P., Bunzeck, B., & Zarrieß, S. (2023). The Wikipedia Republic of Literary Characters. Journal of Cultural Analytics, 8(2). 10.22148/001c.70251
Open DOI Search in Google Scholar Back to article
Wolfe, E. (2019). Natural Language Processing in the Humanities: A Case Study in Automated Metadata Enhancement. The Code4 Lib Journal, 46. https://journal.code4lib.org/articles/14834
Search in Google Scholar Back to article
Zhao, F. (2023). A systematic review of Wikidata in Digital Humanities projects. Digital Scholarship in the Humanities, 38(2), 852–874. 10.1093/llc/fqac083
Open DOI Search in Google Scholar Back to article

Bridging the Gaps: Integrating Bibliographic Metadata Into Wikidata for Literary Corpora

Abstract

Paradigm

My account