Have a personal or library account? Click to login
Bridging the Gaps: Integrating Bibliographic Metadata Into Wikidata for Literary Corpora Cover

Bridging the Gaps: Integrating Bibliographic Metadata Into Wikidata for Literary Corpora

Open Access
|Feb 2026

References

  1. Algee-Hewitt, M., Porter, J. D., & Walser, H. (2020). Representing Race and Ethnicity in American Fiction, 1789–1920. Journal of Cultural Analytics, 5(2). 10.22148/001c.18509
  2. Almeida, P.D., Rocha, J. G., Ballatore, A., & Zipf, A. (2016). Where the streets have known names. In Computational Science and Its Applications — ICCSA 2016. Cham: Springer International Publishing, pp. 112. 10.1007/978-3-319-42089-9_1
  3. Bagga, S., & Piper, A. (2022). HATHI 1M: Introducing a Million Page Historical Prose Dataset in English from the Hathi Trust. Journal of Open Humanities Data, 8, 7. 10.5334/johd.71
  4. Candela, G. (2023). An automatic data quality approach to assess semantic data from cultural heritage institutions. J. Am. Soc. Inf. Sci. 74(7), 866878. 10.1002/asi.24761
  5. Conroy, M. (2023). Quantifying the Gap: The Gender Gap in French Writers’ Wikidata. Journal of Cultural Analytics, 8(2). 10.22148/001c.74068
  6. de Beyssat, C. D. (2025) Victims of Posterity. Identifying Gaps on 19th-Century French Art History with Wikidata. Journal of Open Humanities Data, 11(1), 59. 10.5334/johd.399
  7. Egloff, M., Picca, D., Adamou, A. (2019). Extraction of character profiles from the Gutenberg Archive. In Metadata and Semantic Research. Cham: Springer, pp. 36772. 10.1007/978-3-030-36599-8_32
  8. Egloff, M., & Picca, D. (2020). WeDH – a Friendly Tool for Building Literary Corpora Enriched with Encyclopedic Metadata. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 813816. Marseille, France. European Language Resources Association. https://aclanthology.org/2020.lrec-1.101/
  9. Erlin, M., Piper, A., Knox, D., Pentecost, S., Drouillard, M., Powell, B., & Townson, C. (2021). Cultural Capitals: Modeling Minor European Literature. Journal of Cultural Analytics, 6(1). 10.22148/001c.21182
  10. Fischer, F., Blakesley, J., Wojcik, P., & Jäschke, R. (2023). Preface: World Literature in an Expanding Digital Space. Journal of Cultural Analytics, 8(2). 10.22148/001c.74598
  11. Fischer, F., Börner, I., & Göbel, M. (2019). Programmable corpora: Introducing DraCor, an infrastructure for the research on European drama. In Digital Humanities 2019: Conference Abstracts. Utrecht: Utrecht University. 10.5281/zenodo.4284001
  12. Gittel, B. (2021). An Institutional Perspective on Genres: Generic Subtitles in German Literature from 1500–2020. Journal of Cultural Analytics, 6(1). 10.22148/001c.22086
  13. Hamilton, S., & Piper, A. (2023). MultiHATHI: A Complete Collection of Multilingual Prose Fiction in the HathiTrust Digital Library. Journal of Open Humanities Data, 9(1), 3. 10.5334/johd.95
  14. Huynh, D. (2012). OpenRefine 3.5. https://openrefine.org/
  15. Langer, L., Burghardt, M., Borgards, R., Böhning-Gaese, K., Seppelt, R., & Wirth, C. (2021). The rise and fall of biodiversity in literature: A comprehensive quantification of historical changes in the use of vernacular labels for biological taxa in Western creative literature. People and Nature, 3(5), 10931109. 10.1002/pan3.10256
  16. Manske, M. (2019). QuickStatements 2.0. https://quickstatements.toolforge.org/
  17. Müller, S., Brunzel, M., Kaun, D., Biswas, R., Koutraki, M., Tietz, T., & Sack, H. (2019). HistorEx: exploring historical text corpora using word and document embeddings. In The Semantic Web: ESWC 2019 Satellite Events. Lecture Notes in Computer Science. Cham: Springer International Publishing, pp. 13640. 10.1007/978-3-030-32327-1_27
  18. Nešić, M. I., Stanković, R., Schöch, C., & Skoric, M. (2022). From ELTeC Text Collection Metadata and Named Entities to Linked-data (and Back). In Proceedings of the 8th Workshop on Linked Data in Linguistics within the 13th Language Resources and Evaluation Conference, pp. 716, Marseille, France. European Language Resources Association. https://aclanthology.org/2022.ldl-1.2/
  19. Reeve, J. (2020). Corpus-DB: a scriptable textual corpus database for cultural analytics. In Digital Humanities 2020: Conference Abstracts, Ottawa: Carleton University and Université d’Ottawa (University of Ottawa), p. 230.
  20. Rohrbacher, K. (2025) de-Corp: A Corpus of German-language Fiction and Non-Fiction (1780–1930). Journal of Open Humanities Data, 11(1), p. 51. 10.5334/johd.350
  21. Röttgermann, J. (2024). The Collection of Eighteenth-Century French Novels 1751–1800. Journal of Open Humanities Data, 10(1), 31. 10.5334/johd.201
  22. Schelbert, G., & Müller, M. (2023, März 10). Sammlungsdaten mit Wikidata anreichern und für die Vernetzung öffnen. Konzepte und praktische Erpobungen. DHd 2023 Open Humanities Open Culture. 9. Tagung des Verbands “Digital Humanities im deutschsprachigen Raum” (DHd 2023), Trier, Luxemburg. 10.5281/zenodo.7715472
  23. Soudani, A., Meherzi, Y., Bouhafs, A., Frontini, F., Brando, C., Dupont, Y., & Mélanie-Becquet, F. (2019). Adapting a system for named entity recognition and linking for 19th century French novels. In Digital Humanities 2019: Conference Abstracts, Utrecht: Utrecht University. https://hal.science/hal-02187283v1
  24. Teichmann, L. (2025). The “Mapping German fiction in translation” dataset: Data collection, scope, and data quality. Journal of Cultural Analytics, 10(1). 10.22148/001c.128010
  25. Tharani, K. (2021). Much more than a mere technology: A systematic review of Wikidata in libraries. The Journal of Academic Librarianship, 47(2), 102326. 10.1016/j.acalib.2021.102326
  26. Underwood, T. (2019). Distant Horizons: Digital Evidence and Literary Change. University of Chicago Press. https://press.uchicago.edu/ucp/books/book/chicago/D/bo35853783.html
  27. Underwood, T. (2021). “HathiTrust Post45 Fiction (1945–2013).” Edited by Dan Sinykin and Laura McGrath. Post45 Data Collective, April. data.post45.org/hathitrust-post45-fiction/
  28. Underwood, T., Kimutis, P., & Witte, J. (2020). NovelTM Datasets for English-Language Fiction, 1700–2009. Journal of Cultural Analytics, 5(2). 10.22148/001c.13147
  29. Wilkens, M. (2021). Too isolated, too insular: American Literature and the World. Journal of Cultural Analytics, 6(3). 10.22148/001c.25273
  30. Wikidata. (n.d.-a). Wikidata: Main Page. Wikimedia Foundation. Retrieved October 16, 2025, from https://www.wikidata.org/wiki/Wikidata:Main_Page
  31. Wikidata. (n.d.-b). Wikidata: Notability. Wikimedia Foundation. Retrieved December 28, 2025, from https://www.wikidata.org/wiki/Wikidata:Notability
  32. Wikidata. (n.d.-c). Mass-editing policy. Wikimedia Foundation. Retrieved October 20, 2025, from https://www.wikidata.org/wiki/Wikidata_talk:Requests_for_comment/Mass-editing_policy
  33. Wojcik, P., Bunzeck, B., & Zarrieß, S. (2023). The Wikipedia Republic of Literary Characters. Journal of Cultural Analytics, 8(2). 10.22148/001c.70251
  34. Wolfe, E. (2019). Natural Language Processing in the Humanities: A Case Study in Automated Metadata Enhancement. The Code4 Lib Journal, 46. https://journal.code4lib.org/articles/14834
  35. Zhao, F. (2023). A systematic review of Wikidata in Digital Humanities projects. Digital Scholarship in the Humanities, 38(2), 852874. 10.1093/llc/fqac083
DOI: https://doi.org/10.5334/johd.483 | Journal eISSN: 2059-481X
Language: English
Submitted on: Nov 17, 2025
|
Accepted on: Jan 9, 2026
|
Published on: Feb 27, 2026
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2026 Katrin Rohrbacher, David Schrittesser, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.