Have a personal or library account? Click to login
Machine Learning Assisted Conversion of Biographical Records Into Wikidata Triples Cover

Machine Learning Assisted Conversion of Biographical Records Into Wikidata Triples

By: Daniel Baránek  
Open Access
|Jan 2026

Abstract

Wikidata is substantially dependent on the outputs of research institutions, notably biographical dictionaries. The present paper examines the extent to which information from biographical dictionaries has already been transferred to Wikidata. Specifically, it focuces on the Biographisches Lexikon zur Geschichte der böhmischen Länder (BLGBL), a substantial encyclopaedia on the history of Central Europe. In contrast to numerous digitised academic encyclopaedias, until recently, the BLGBL was only available in digital format as scans and lacked Wikidata identifiers, which would have facilitated the monitoring of the flow of information.

Consequently, the initial section of the article is dedicated to the methods and procedures that facilitate the conversion of printed information into structured data, thereby enabling subsequent analysis. The development of machine learning models was driven by two primary objectives: firstly, to segment and read individual biographical entries; and secondly, to convert the text of the entries into structured data. According to various metrics, the latter model is approximately 50% reliable. The article therefore also addresses the question of how much further training could improve accuracy.

The analysis of the transfer of information into Wikidata highlights the largely untapped potential of the BLGBL, and hypothetically, other undigitised academic encyclopaedias. More than 25% of the BLGBL entries have no corresponding entity on Wikidata, and even existing entities contain significant gaps, missing almost all information about important family members, schools attended, professional careers, organisational memberships, and awards.

DOI: https://doi.org/10.5334/johd.466 | Journal eISSN: 2059-481X
Language: English
Submitted on: Nov 10, 2025
|
Accepted on: Dec 27, 2025
|
Published on: Jan 16, 2026
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2026 Daniel Baránek, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.