Machine Learning Assisted Conversion of Biographical Records Into Wikidata Triples

Daniel Baránek

doi:10.5334/johd.466

Abstract

Wikidata is substantially dependent on the outputs of research institutions, notably biographical dictionaries. The present paper examines the extent to which information from biographical dictionaries has already been transferred to Wikidata. Specifically, it focuces on the Biographisches Lexikon zur Geschichte der böhmischen Länder (BLGBL), a substantial encyclopaedia on the history of Central Europe. In contrast to numerous digitised academic encyclopaedias, until recently, the BLGBL was only available in digital format as scans and lacked Wikidata identifiers, which would have facilitated the monitoring of the flow of information.

Consequently, the initial section of the article is dedicated to the methods and procedures that facilitate the conversion of printed information into structured data, thereby enabling subsequent analysis. The development of machine learning models was driven by two primary objectives: firstly, to segment and read individual biographical entries; and secondly, to convert the text of the entries into structured data. According to various metrics, the latter model is approximately 50% reliable. The article therefore also addresses the question of how much further training could improve accuracy.

The analysis of the transfer of information into Wikidata highlights the largely untapped potential of the BLGBL, and hypothetically, other undigitised academic encyclopaedias. More than 25% of the BLGBL entries have no corresponding entity on Wikidata, and even existing entities contain significant gaps, missing almost all information about important family members, schools attended, professional careers, organisational memberships, and awards.

References

Baker, J., & Mahal, A. K. (2024, dec 23). “I have always found the whole area a minefield”: Wikidata, historical lives, and knowledge infrastructure. International Journal of Digital Humanities, 6(2), 217–236. 10.1007/s42803-024-00090-5
Open DOI Search in Google Scholar Back to article
Baránek, D. (2024a). biography2wikidata. 10.57967/HF/1898
Open DOI Search in Google Scholar Back to article
Baránek, D. (2024b, mar 5). Kraken segmentation model for two-column prints. 10.5281/ZENODO.10783346
Open DOI Search in Google Scholar Back to article
Baránek, D. (2025). Research: Wikimedia versus traditional biographical encyclopedias. Meta-Wiki (Wikimedia Research project page). Retrieved 2025-10-15, from https://meta.wikimedia.org/w/index.php?title=Research:Wikimedia_versus_traditional_biographical_encyclopedias&oldid=29453744 (Last updated: 2025-10-15).
Search in Google Scholar Back to article
Das, P., Karnam, S. K., Panda, A., Guda, B. P. R., Sarkar, S., & Mukherjee, A. (2023). Diversity matters: Robustness of bias measurements in Wikidata. arXiv. 10.48550/ARXIV.2302.14027
Open DOI Search in Google Scholar Back to article
Durdík, P. (1893). Docent. In Ottův slovník naučný (p. 745). J. Otto. Retrieved from https://ceskadigitalniknihovna.cz/uuid/uuid:3bf15a30-0a07-11e5-ae7e-001018b5eb5c.
Search in Google Scholar Back to article
Fagerving, A. (2023, Oct.). Wikidata for authority control: sharing museum knowledge with the world. Digital Humanities in the Nordic and Baltic Countries Publications, 5(1), 222–239. 10.5617/dhnbpub.10665
Open DOI Search in Google Scholar Back to article
Jaskulski, P., Latos, T., Ryńca, M., & Zapała, A. (2025, mar 5). Reliability of large language models as a tool for knowledge extraction from biographical dictionaries: the case of the Polish Biographical Dictionary. Digital Scholarship in the Humanities, 40(2), 538–548. 10.1093/llc/fqaf014
Open DOI Search in Google Scholar Back to article
Neubert, J. (2017, sep 23). Wikidata as a linking hub for knowledge organization systems? Integrating an authority mapping into Wikidata and learning lessons for KOS mappings. Proceedings of the 17th European Networked Knowledge Organization Systems Workshop (pp. 14–25). Retrieved from http://ceur-ws.org/Vol-1937/paper2.pdf.
Search in Google Scholar Back to article
Redi, M., Gerlach, M., Johnson, I., Morgan, J., & Zia, L. (2020). A Taxonomy of Knowledge Gaps for Wikimedia Projects (Second Draft). 10.48550/ARXIV.2008.12314
Open DOI Search in Google Scholar Back to article
Zahra, T. (2008). Kidnapped Souls. Cornell University Press. 10.7591/9780801461910
Open DOI Search in Google Scholar Back to article

Machine Learning Assisted Conversion of Biographical Records Into Wikidata Triples

Abstract

Paradigm

My account