Abstract
Wikidata is substantially dependent on the outputs of research institutions, notably biographical dictionaries. The present paper examines the extent to which information from biographical dictionaries has already been transferred to Wikidata. Specifically, it focuces on the Biographisches Lexikon zur Geschichte der böhmischen Länder (BLGBL), a substantial encyclopaedia on the history of Central Europe. In contrast to numerous digitised academic encyclopaedias, until recently, the BLGBL was only available in digital format as scans and lacked Wikidata identifiers, which would have facilitated the monitoring of the flow of information.
Consequently, the initial section of the article is dedicated to the methods and procedures that facilitate the conversion of printed information into structured data, thereby enabling subsequent analysis. The development of machine learning models was driven by two primary objectives: firstly, to segment and read individual biographical entries; and secondly, to convert the text of the entries into structured data. According to various metrics, the latter model is approximately 50% reliable. The article therefore also addresses the question of how much further training could improve accuracy.
The analysis of the transfer of information into Wikidata highlights the largely untapped potential of the BLGBL, and hypothetically, other undigitised academic encyclopaedias. More than 25% of the BLGBL entries have no corresponding entity on Wikidata, and even existing entities contain significant gaps, missing almost all information about important family members, schools attended, professional careers, organisational memberships, and awards.
