1 Introduction
The mutual dependence between Wikidata and academic production is continuously growing. On the one hand, research institutions use Wikidata to integrate their research into the Linked Open Data (LOD) ecosystem and to ensure compliance with FAIR standards. On the other hand, Wikidata – partly due to the No Original Research rule1 – heavily relies on authority data produced by research institutions.
Nevertheless, the integration of research data into Wikidata frequently involves the mere addition of an authority database identifier, rather than the enrichment of Wikidata entities with content data. Consequently, a significant proportion of Wikidata entities function primarily as hubs for authority identifiers or as a tool for performing authority control (Neubert, 2017; Fagerving, 2023), rather than as a source of content statements – such as date and place of birth or death in the case of persons. For example, 31,898 entities of persons born in Prague contain a total of 300,967 statements, of which 209,826 (70%) are authority identifiers.2 Only one-third of the statements are content statements. Authority identifiers exhibit a comparable prevalence among people born, for instance, in Berlin (70%) and London (72%).
The primary objective of this paper is to identify content gaps between academic biographical dictionaries and Wikidata. In other words, it aims to detect types of information that are present in academic dictionaries but missing from Wikidata (cf. Redi et al., 2020). In order to undertake such an analysis, the Biographisches Lexikon zur Geschichte der böhmischen Länder (BLGBL) [Biographical Dictionary of the History of the Bohemian Lands] was selected as a case study for several reasons.
The BLGBL represents one of the most significant academic biographical lexicons for the study of Central European history. Individual issues have been published by the Collegium Carolinum in Munich since 1974, and the project has thus far progressed to the letter Su (2025). The earlier volumes (up to 1999) are accessible online only as scanned images without an OCR layer. Consequently, they are not available as a database containing full texts or identifiers. The process of transferring information from the BLGBL to Wikidata has therefore been very difficult, as it has required manual work involving scrolling through scanned books lacking full text. To illustrate the scale of this issue, it is noteworthy that, until October 2025, the BLGBL was referenced as a source in only one entity describing a human.3
The selection of the BLGBL thus offers an opportunity to address the following questions: What is the potential of encyclopaedias that are half a century old or more? Has information been transferred to Wikidata indirectly through various authoritative data sources and other lexicons over time? What kind of information has not been transferred to Wikidata? Does Wikidata suffer from a definable bias?
In order to facilitate content gap analysis, it was necessary to digitise the BLGBL scans in depth, specifically its first volume (nine issues, 1974–1979, letters A–H). The process involved the development of machine learning models for the purpose of machine-reading the printed encyclopaedia and converting the full text into structured data. Subsequently, these data were compared with the content of Wikidata. Thus, a substantial part of the paper is devoted to workflow and methodology.
2 Methods
2.1 Machine Learning Tools
During the in-depth digitisation process, it was necessary not only to read the scans and create full texts, but also to recognise the individual dictionary entries and their subcomponents (the biographical entry itself, the person’s works, and the sources). Despite the printed nature of the dictionary, OCR methods proved insufficient for such advanced text segmentation. Therefore, the handwritten text recognition (HTR) method was used. The use of Transkribus4 for digitising the BLGBL was deemed unsuitable for two primary reasons. Firstly, the cost of digitising hundreds of pages would be prohibitive. Secondly, models created in Transkribus can only be used on that platform and cannot be exported and transferred elsewhere, resulting in a form of vendor lock-in. Therefore, the open-source software eScriptorium5 was selected for the digitisation.
Two models were developed to convert scans into full text. The layout segmentation model (Baránek, 2024b) was designed to recognise regions such as headers, page numbers, and text columns, as well as lines that mark the beginning of individual biographical entries and their subsequent paragraphs (Figure 1). This model displays an accuracy of 61.3%. The recognition model, developed to process mixed Czech-German texts, achieves 99.8% accuracy.

Figure 1
Application of the segmentation model to an unseen BLGBL page.
The main innovative tool is biography2wikidata (Baránek, 2024a), an NLP model for transforming acquired full texts into structured data (triples). Jaskulski et al., 2025, sought to extract such data from the Polski Słownik Biograficzny [Polish Biographical Dictionary] using LLM prompts, achieving a relatively high success rate. However, the objective of the Wikidata Triples project was not only to extract text strings, but also to assign Wikidata identifiers to them. Therefore, a custom machine learning model was created.
The biography2wikidata model is a text-to-text transformer based on Google’s mT5 small model. Prior to manual annotation, a pre-trained model was developed. Specifically, the pre-training involved converting authentic biographical data from Wikidata into text strings that adhered to the format employed in the BLGBL.6 Consequently, the pre-trained model successfully structured the introductory text of the biographical entries into meaningful units (claims), identifying elements such as name, occupation, birth and death dates and places, and assigning Wikidata identifiers to the most common entities. This approach eliminated the need to begin annotation entirely from scratch, as initial biographical records were annotated with the assistance of the pre-trained model.
Subsequent fine-tuning of the model was conducted using verified annotated texts from individual issues of the first BLGBL volume. The resulting model is based on a dataset containing more than 5,000 dictionary entries. The model annotates the full text of biographical entries (Figure 2) using Wikidata identifiers, thereby ensuring that the extracted data aligns structurally with Wikidata as closely as possible. This enables a direct comparison between the content of a dictionary entry and its corresponding Wikidata entity. According to the transformer, the model displays a loss value of 0.39.

Figure 2
Annotated full text of the entry ‘Donath, Eduard’, BLGBL, Vol. I, Issue 4, p. 271.
However, of greater relevance is the number of valid statements that the model is capable of extracting from the input. The evaluation test was performed on 100 unseen entries from the second BLGBL volume (Table 1). The data indicate that the model accurately retrieves approximately 60% of basic statements and 20% of qualifiers, resulting in an overall retrieval rate of 50% for both basic and qualifier statements. While these figures fall short of 100% accuracy, the model segments the text effectively and significantly accelerates data annotation. When using the model, manual work is confined to verifying annotations and validating the correct assignment of identifiers to entities with the rare occurrence. The presence of such rare entities also precludes the possibility of achieving 100% model accuracy.
2.2 Data Modelling
The primary concern in annotating the texts of the biographical entries was data modelling. For each entry, three interrelated questions were addressed: what information should be modelled, how it should be modelled, and how much detail should be included. The following examples illustrate each of these questions.
Firstly, what information should be modelled? The biographies in the BLGBL are concise and frequently composed of incomplete sentences. This is a significant advantage, as it allows complete information to be modelled for a substantial proportion of the entries. However, it is important to note that the BLGBL also contains entries that include text deviating from the subject and not directly related to the individual. This phenomenon is particularly evident in entries concerning entrepreneurs, which may include information about the factories they founded, the locations of factory branches, and the number of employees. Such information relating to factories rather than individuals was deliberately excluded from the modelling and annotation.
Secondly, how should the information be modelled? Simple statements, such as date and place of birth, or the names of parents or descendants, are straightforward to model. Appropriate properties for such claims exist on Wikidata, and qualifiers can be used to express nuances. However, more complex information can often be modelled in multiple ways. The intention was to follow the logic of Wikidata. Yet this is complicated by inconsistent data modelling practices within Wikidata itself. For instance, in order to identify a school where a teacher worked, three different properties are used as qualifiers: employer (P108, 799 statements), workplace (P937, 352 statements), and location (P276, 328 statements).7 It is not possible to simply choose the most commonly used qualifier, as historically university teachers often held unpaid positions as Privatdozent (Durdík, 1893). For this reason, the workplace property was considered the more reliable option when annotating BLGBL entries.
The situation is further complicated by the fact that Wikidata is a dynamic, evolving system. For example, during the course of the project, the community decided to remove the ‘of’ property (P642) and replace it with more specific properties.8 To maintain consistency with Wikidata, the annotation of the BLGBL full text has been adjusted accordingly.
Thirdly, how much detail should be modelled? A typical example concerns information about a person studying a specific faculty at a university. Modelling requires a balanced approach between two fundamental objectives: on the one hand, achieving the highest possible fidelity to the source data, and on the other hand, ensuring the efficiency and functionality of the model. Modelling at the university level enables faster machine learning than modelling specific faculties. It should also be noted that highly specific items are frequently absent from Wikidata. Consequently, the level of detail (granularity) was determined by the principle of efficiency relative to the time invested.
3 Results and Discussion
The structured data extracted from the annotated full text were subsequently analysed and compared with relevant items on Wikidata. This analysis reflects the state of Wikidata as of 24 July 2025. To match a biographical entry with the relevant item, a script was used to extract names and dates of birth and death from the annotated full text and to execute a series of SPARQL queries.9 The existence of items that were not automatically found on Wikidata was then verified manually.
3.1 Items Not Existing on Wikidata
The basic matching revealed that a corresponding entity on Wikidata existed for only 73% (3,693 of 5,065) of biographical entries, despite the fact that half a century has passed since the publication of the initial BLGBL volume. This finding raises a crucial question: Do the 27% of personalities “lost in the gap” share any common characteristics? In other words, is there any definable gap on Wikidata?
A detailed analysis, encompassing life dates and occupations, supports this supposition. The majority (75%) of these individuals were born during the 19th century, and nearly a quarter (23%) of all of them were employed in education as teachers, university professors or school principals – most of whom were active around the turn of the 19th and 20th centuries. Other significant professions (each representing 3–5%) included factory owners, actors, doctors, regional writers, and regional researchers.
These data reveal a significant bias in the reflection of historical reality. Over the course of the 20th and 21st centuries, the perceived notability of some of these professions has declined, meaning they may no longer be considered notable enough from a contemporary perspective. However, during the period of nationalization and intensifying Czech-German national tensions at the turn of the 19th and 20th centuries, teachers, school principals, and regional cultural figures played a pivotal role in disseminating ideas and shaping socio-cultural identities and boundaries (Zahra, 2008). Moreover, during the Industrial Revolution, the direct influence of individual industrialists on society was far greater than in contemporary times, when the decisions of large corporations are made not by individuals but rather by boards of directors.
In conclusion, the data demonstrate that the evaluation of notability on Wikidata is characterised by a predominantly contemporary (ahistorical) perspective, rather than by an accurate reflection of the historical reality of who was truly notable during the period in question. This conclusion aligns with the findings of other studies analysing content gaps (Das et al., 2023). Baker and Mahal (2024) even argue that ‘the Wikidata model favours a presentist and event-based temporality’. As a free knowledge base, the content of Wikidata is largely based on the current and more or less random interests of its editors. In contrast, academic institutions produce biographical dictionaries according to predetermined criteria. These are certainly not free of bias, but still represent a more systematic and objective approach. Academic biographical dictionaries can therefore help to at least partially eliminate the biases and historical content gaps present in Wikidata.
3.2 Contents of Existing Items
A comparison of the 73% of dictionary entries that have corresponding entities on Wikidata enables the determination of the completeness of the latter. From another perspective, the analysis addresses the extent to which academic dictionaries (the BLGBL in this case) still possess untapped potential for enriching Wikidata. An examination of the fundamental data (Tables 2 and 3), encompassing birth and death dates and places, reveals that this information is consistent between the dictionary and Wikidata in 75% of cases.
Table 2
Comparison of entries in the BLGBL and on Wikidata: birth and death dates.
| CATEGORY | BIRTH DATE (P569) | DEATH DATE (P570) | ||
|---|---|---|---|---|
| Same value | 2,850 | 80.2% | 2,758 | 76.9% |
| More detailed in the BLGBL | 183 | 5.1% | 227 | 6.3% |
| Less detailed in the BLGBL | 86 | 2.4% | 65 | 1.8% |
| Contradictory value | 398 | 11.2% | 455 | 12.7% |
| Value only in the BLGBL | 38 | 1.1% | 80 | 2.2% |
| Total values | 3,555 | 3,585 | ||
Table 3
Comparison of entries in the BLGBL and on Wikidata: birth and death places.
| CATEGORY | BIRTH PLACE (P19) | DEATH PLACE (P20) | ||
|---|---|---|---|---|
| Same value | 2,618 | 74.9% | 2,367 | 68.2% |
| More detailed in the BLGBL | 62 | 1.8% | 42 | 1.2% |
| Less detailed in the BLGBL | 44 | 1.3% | 164 | 4.7% |
| Contradictory value | 202 | 5.8% | 180 | 5.2% |
| Value only in the BLGBL | 567 | 16.2% | 717 | 20.7% |
| Total values | 3,493 | 3,470 | ||
However, between 5% and 6% of entries in the BLGBL contain more precise information about the date of birth or death than that provided by Wikidata – that is, the BLGBL contains the full date, whereas Wikidata records only the year. The BLGBL has even greater untapped potential with regard to places of birth and death: 16% and 21% of its entries, respectively, contain information on the place of birth and death, while this type of information is entirely absent from the corresponding Wikidata entities. Finally, a notable proportion of entries contain information that contradicts the information on Wikidata; the accuracy of these claims should be verified through further historical research.
The potential of the BLGBL remains even more underutilised in relation to other properties. For instance, the first BLGBL volume contains 1,968 entries that correspond to Wikidata entities and together provide in total 2,626 claims about the schools attended. Of these claims, 1,999 (76%) are entirely absent from Wikidata – that is, there is neither a more nor a less precise value. A similar situation is evident in properties related to academic degrees, function or position, membership in religious orders or any organisations (such as associations), and awards received (Table 4).
Table 4
Comparison of entries in the BLGBL matched with Wikidata (WD): non-unique properties.
| PROPERTY | BLGBL ENTRIES | BLGBL CLAIMS | |||
|---|---|---|---|---|---|
| WD ID | LABEL | WITH THE PROPERTY | IN TOTAL | MISSING ON WD | |
| P106 | occupation | 3,664 | 12,948 | 5,454 | 42.1% |
| P69 | educated at | 1,968 | 2,626 | 1,999 | 76.1% |
| P39 | position held | 1,262 | 2,663 | 1,484 | 55.7% |
| P512 | academic degree | 1,073 | 1,163 | 1,097 | 94.3% |
| P410 | military rank | 209 | 660 | 533 | 80.8% |
| P551 | residence | 462 | 518 | 500 | 96.5% |
| P937 | workplace | 1,004 | 534 | 465 | 80.1% |
| P463 | membership | 649 | 526 | 453 | 86.1% |
| P166 | award received | 672 | 515 | 434 | 84.3% |
| P1066 | student of | 334 | 333 | 307 | 92.2% |
| P3342 | significant person | 233 | 290 | 290 | 100.0% |
| P108 | employer | 658 | 245 | 219 | 89.4% |
| P22 | father | 527 | 308 | 134 | 43.5% |
| P611 | religious order | 220 | 325 | 114 | 35.1% |
In conclusion, the collected data indicates that, despite Wikidata’s 13-year existence, the potential of academic dictionaries that have not yet been digitised as full texts remains largely untapped. A case study of the first BLGBL volume reveals that Wikidata entities are missing for 27% of dictionary entries; that existing Wikidata entities lack 14% of the birth and death data available in the BLGBL; and that, for other types of information, the rate of absence ranges between 80–90%.
In the case of the BLGBL, the main obstacle to data transfer was the fact that it was available only as scanned images. For other biographical dictionaries, the barrier may lie in restrictive copyright policies that prevent the dissemination of data. Some dictionaries may have already been digitised as full texts, but their content remains behind a paywall (e.g. Grove Music Online10 with Wikidata ID P8591).
3.3 Data Publishing
Following a thorough analysis, the BLGBL dataset was published in October 2025 in collaboration with Collegium Carolinum as a MediaWiki-based online encyclopaedia under a Creative Commons BY-NC-SA 4.0 licence (https://blgbl.de). All BLGBL entries were linked to Wikidata. In order to facilitate the comprehensive integration of the online version of the BLGBL into the LOD ecosystem, new entities were created on Wikidata where none previously existed, and existing Wikidata entities were enriched with the BLGBL identifier (P13876) and at least basic biographical data (birth and death details).
Furthermore, a SPARQL endpoint was established (https://blgbl.de/sparql),11 enabling the public to construct their own queries on the digitised structured data.
3.3.1 RDF Export
The exported structured data were stored in RDF format on Zenodo:
Repository location https://doi.org/10.5281/zenodo.18023314
Repository name Zenodo
Object name Biographisches Lexikon zur Geschichte der böhmischen Länder, vol. 1
Format names and versions RDF
Creation dates 2025-12-22
Dataset creators Daniel Baránek
Language German, English
License Creative Commons BY-NC-SA 4.0 (The licence designated by the Collegium Carolinum)
Publication date 2025-12-22
3.3.2 BLGBL & Wikidata Dataset
The second dataset stored on Zenodo is a JSON export of simplified structured data containing a property and value for each item of the BLGBL and the corresponding items on Wikidata as of 24 July 2025.
Repository location https://doi.org/10.5281/zenodo.18021081
Repository name Zenodo
Object name Biographisches Lexikon zur Geschichte der böhmischen Länder, vol. I – compared with Wikidata
Format names and versions JSON
Creation dates 2025-12-22
Dataset creators Daniel Baránek
Language English
License Creative Commons BY-NC-SA 4.0 (The licence designated by the Collegium Carolinum)
Publication date 2025-12-22
4 Implications/Applications
This paper demonstrates the substantial potential of machine learning methods for converting printed biographical dictionaries into structured data suitable for Wikidata. Although the biography2wikidata model currently achieves an accuracy of only around 50% in retrieving valid statements – and therefore still requires human verification – its performance is nonetheless sufficient to significantly accelerate the annotation workflow of biographical entries and the conversion of full texts into structured data.
Despite having been trained on a single biographical dictionary, the biography2wikidata can be applied to other dictionaries after appropriate fine-tuning. The data produced from the BLGBL can subsequently be employed to analyse content discrepancies between academic dictionaries and Wikidata.
The results of the analysis reveal the definable gap in the content of Wikidata. Notability on Wikidata tends to be assessed from a predominantly contemporary (ahistorical) perspective, rather than as a reflection of historical reality. The analysis further demonstrates that the information contained within Wikidata entities is highly fragmented compared with that contained within academic dictionaries, specifically the BLGBL.
The workflow developed for converting the content of the BLGBL into structured data can therefore be logically applied to other dictionaries as well, in order to enrich the content of Wikidata and transform it from a hub of authority identifiers to a valuable source of content information.
Notes
[1] This rule is elaborated especially in the language versions of Wikipedia (e.g. https://en.wikipedia.org/wiki/Wikipedia:No_original_research). However, the validity of this rule also follows from various definitions of what Wikidata is and is not (https://www.wikidata.org/wiki/Wikidata:Introduction, https://www.wikidata.org/wiki/Wikidata:What_Wikidata_is_not).
[2] SPARQL query: https://qlever.dev/wikidata/2wXvTo. Retrieved 2025-10-15.
[3] SPARQL query: https://qlever.dev/wikidata/B3RLGb. Retrieved 2025-10-15.
[6] Example SPARQL query: https://qlever.dev/wikidata/IOzYnR.
[7] SPARQL query: https://qlever.dev/wikidata/SwDBo6. Retrieved 2025-11-04.
[9] SPARQL query for searching by date of birth: SELECT DISTINCT ?item WHERE { ?item wdt:P569 “{birthdate}”^^xsd:dateTime; (rdfs:label|skos:altLabel) ?label. FILTER (STR(?label) = “{name}”) }.
[11] Brief documentation: https://blgbl.de/w/Hilfe:SPARQL.
Acknowledgements
Special thanks go to Veronika Kršková, who collaborated on creating ground truth for HTR and text-to-text models as part of the project. Computational resources were provided by the e-INFRA CZ project (ID:90254), supported by the Ministry of Education, Youth and Sports of the Czech Republic.
Funding Statement
Digitisation and analysis were largely carried out as part of the “Wikimedia versus traditional biographical encyclopaedias. Overlaps, gaps, quality, and future possibilities” project, granted by Wikimedia Research Fund, Grant ID: G-RS-2402-15215. (Baránek, 2025). The writing and publication of this article was made possible thanks to resources and services provided by the Czech Literary Bibliography research infrastructure (https://clb.ucl.cas.cz/, ORJ: 90243).
Competing Interests
Digitisation and analysis were largely granted by Wikimedia Research Fund.
