Bridging the Gaps: Integrating Bibliographic Metadata Into Wikidata for Literary Corpora

Katrin Rohrbacher; David Schrittesser

doi:10.5334/johd.483

Full Article

(1) Context and motivation

Wikidata, as an open and collaborative knowledge base, provides a rich resource for storing, linking, and analyzing structured data that can be edited and accessed at scale by both humans and machines (Wikidata, n.d.-a). It is currently one of the largest global knowledge repositories (Schelbert & Müller, 2023, p. 1). Because it is centrally maintained and continuously updated by a broad community, it supports the sharing of authority data and the integration of diverse datasets (Tharani, 2021).

In the Digital Humanities (DH), Wikidata is used primarily to extract and visualize relationships between entities and linked open data resources (Wojcik et al., 2023; Conroy, 2023; Soudani et al., 2019; Egloff et al., 2019; Almeida et al., 2016)—for example, geographical or biographical information, or character networks within literary works—and to supplement and enrich existing bibliographic metadata and databases (e.g., Reeve, 2020; Fischer et al., 2019; Müller et al., 2019). For a systematic review of projects in DH using Wikidata, see Zhao (2023). For a recent Special Issue on “Wikipedia, Wikidata, and World Literature”, see Fischer et al. (2023).

Metadata—data about data—provides essential contextual information about the items it describes. Missing or incomplete metadata can make texts difficult or even impossible to use for research (Egloff & Picca, 2020, p. 813). DH projects working with large datasets often rely on such information to trace cultural trends over long time periods (Underwood, 2019), examine geographic or social contexts (Erlin et al., 2021; Wilkens, 2021; Algee-Hewitt et al., 2020), study translation networks (Teichmann, 2025), or analyze genre patterns (Gittel, 2021). None of this work is feasible without reliable metadata for analysis and interpretation.

Creating comprehensive metadata, however, is labor-intensive. Even when digital libraries, on which many corpora rely, include fields such as publication date, genre, or book length, these records are often incomplete. Building datasets entirely from scratch, for instance through book digitization, frequently requires manual metadata creation. Scholars must therefore supplement existing information by consulting library catalogues, bibliographies, or Wikidata. Although Wikidata offers a relatively convenient way to retrieve such information via SPARQL, its API, or tools such as OpenRefine (Huynh, 2012), its coverage is uneven, and many literary works are only partially described or not represented at all. Furthermore, use of this technology can also presuppose a considerable investment of time due to difficult or partial documentation. As a result, large-scale metadata creation still requires substantial manual effort and triangulation across multiple sources, and it remains difficult to integrate such processes into a living workflow, since they are often treated as one-time cleaning operations (e.g., in tools such as OpenRefine).

Major digital libraries such as HathiTrust or the American Project Gutenberg typically provide only very basic metadata, often limited to author and title.² Since these collections contain millions of books spanning multiple genres, languages, and publication contexts, consistently tagging them with reliable metadata remains a monumental task. To circumvent this, researchers often automatically infer missing publication dates when constructing such corpora for quantitative analysis. Common strategies include using the year of publication of the earliest edition available in the collection (Bagga & Piper, 2022; Underwood et al., 2020) or estimating dates from authors’ lifespans (e.g., Langer et al., 2021). For genre classification within the HathiTrust library, predictive models have been used to distinguish fiction from non-fiction (Hamilton & Piper, 2023; Underwood et al., 2020). While these approaches offer a reasonable proxy when analyzing cultural phenomena at scale, they do not always reflect actual publication histories. Moreover, large-scale collections often include translations, reprints, or later editions, which obscure the cultural context in which a work was originally produced. Having accurate metadata would allow scholars to study such cultural specificity in greater detail.

Storing metadata for such works is not only beneficial for more robust DH research practices but also contributes to “data longevity” (Egloff & Picca, 2020, p. 813). The effort invested in gathering bibliographic information is valuable in itself; making it publicly accessible ensures that it can be reused rather than repeatedly reconstructed. This is especially significant for marginalized communities whose literary output is often underrepresented or poorly documented (de Beyssat, 2025, Wolfe, 2019). In our own corpus, for instance, many women authors have Wikidata profiles, yet their works are not recorded there. Wikidata provides an infrastructure for storing such information in a non-localized, verifiable, and community-maintained environment (Candela, 2023).

While Wikidata is already used to enrich bibliographic metadata, the reverse practice—uploading newly collected metadata from DH corpus-building back into Wikidata—is far less common. Some libraries have begun experimenting with manual uploads (Tharani, 2021), but semi-automated integration or embedding this step into DH workflows remains rare. To our knowledge, only one project has attempted this systematically: Nešić et al. (2022), who used OpenRefine and QuickStatements (Manske, 2019) to upload metadata and named entities for 720 novels in seven different languages from the ELTeC corpus as linked open data. Their uploads included items on first editions, print editions, digital editions, and ELTeC editions, as well as manually added character and place entities tagged within the novels, resulting in more than 20,000 new items (Nešić et al., 2022, p. 7). Although they relied on open-source tools to automate parts of the workflow, the authors also emphasize the necessity of manual checks, especially for metadata validation and reconciliation.

Building on this work, our aim is twofold: (1) to provide an in-depth feasibility analysis of existing tools, OpenRefine and more programmatic Python-based approaches, and (2) to outline a reproducible pipeline for preparing data and linking it to Wikidata that can be integrated directly into DH corpus-building workflows.³ Some projects already save Wikidata QIDs in metadata files during corpus creation, but this remains comparatively uncommon.⁴ Where such metadata exists, it offers an important first step toward more complete and interoperable datasets.

Linking bibliographic records to Wikidata items via QIDs offers several advantages:

metadata becomes reusable and shareable across projects;
updates can circulate both upstream (from local corpora to Wikidata) and downstream (from Wikidata to institutional or research metadata);
DH corpora gain persistent identifiers that support reproducibility and interoperability;
the broader community benefits from work already carried out by individual researchers.

Despite these benefits, integration at scale remains challenging. In the following pages, we outline these challenges and share insights from our own attempts to negotiate them. We assess different approaches, discuss their strengths and limitations, and offer preliminary pathways forward. We have argued that the effort invested in gathering bibliographic information is valuable in itself and that making it publicly accessible ensures that it can be reused rather than repeatedly reconstructed. The same holds true for the effort invested in developing mechanisms and workflows to gather, refine, share, and maintain such data, as well as in documenting and sharing knowledge about these mechanisms. Our goal is not to provide comprehensive solutions but to open discussion and share practical guidance based on our experience.

(2) Dataset description

For this case study, we use metadata from a German-language fiction and non-fiction corpus spanning the years 1780 to 1930, previously published in this journal (Rohrbacher 2025), and upload records to Wikidata using partial automation (a “bot”) to carry out the actual Wikidata edits. The corpus consists of public-domain texts scraped from the German and US Project Gutenberg repositories. It was compiled to support large-scale analyses of literary production across genres and periods. The metadata set for the fiction corpus contains 4215 entries, corresponding to literary works by a total of 1064 authors.

The available metadata fields include “author”, “title”, “publication year”, “genre”, “gender”, “source” (i.e., the collection from which the text was obtained), and “Gutenberg ID” in the case of works obtained from the US site. Since the German Gutenberg site’s numeric ID seems not to be used outside of the site, we kept the “source URL” in this case. As will be further explained below, part of this URL serves as an ID for authors on Wikidata. Especially the entries “gender” and “publication year” required a substantial effort to ensure reliability of the data.

(3) Method

Our workflow begins by retrieving QIDs for authors with existing Wikidata entries and saving these identifiers in the metadata table. We then attempt to create new Wikidata items programmatically for works associated with these authors. At this stage, we limit ourselves to uploading information on “title”, “publication year”, “available at” and of course “language” and “author” as well as an English and a German label.

Our workflow ultimately unfolded as follows:

We used OpenRefine, an open-source tool originally developed at Google, to clean our list of authors—its clustering feature identified several near-duplicates we had previously missed—and to match a subset of our 1064 authors to Wikidata entities. Matching our 4215 literary texts to existing (and often inconsistent) items representing works or editions remains, in our view, a substantial challenge for which we currently do not have an elegant solution. While OpenRefine offers a workflow for this type of matching, the process is extremely labor-intensive and suffers from a lack of reproducibility and transparency. For instance, OpenRefine suggested 307 correct QIDs out of our 1,064 authors, and 319 out of our 4,215 works. We were not able to keep statistics on false positives, and such results are not easily reproducible across OpenRefine sessions. Using automated scripting, we were able to identify additional authors, increasing our set of matches to 719 (see below for details). Identifying further texts remains an ongoing endeavor.

As a next step following OpenRefine’s reconciliation tool, we queried Wikidata via SPARQL to retrieve all publications associated with our matched authors. Within this set of works, we performed string-based searches using regular expressions applied to the title, label, description, and alias fields to identify matches with our dataset; the regular expressions are provided in the annex and, in context, in our GitHub repository.⁵ Despite its simplicity and need for further development, this method produces many reliable matches in a short amount of time and yields easily reproducible results. It also made clear that we could more readily determine when an author did not already have one of our works associated with them. Based on this insight, our next steps were to upload data for all items authored by individuals who had no associated works on Wikidata.

Using a Python script that reads from our CSV table, we interact with the Wikidata API via the wikibaseintegrator package. The source code is made publicly available on GitHub; to maximize reusability, we provide detailed documentation of the more technical aspects of the project there.

For each item in our subset, the script creates three entities:

one representing the literary work,
one representing its first edition (with the date from our metadata),
and one representing its digital edition on Projekt Gutenberg-DE or Project Gutenberg-US—the source from which we obtained the text file, including a link and/or ID when available.

The script then generates the appropriate links between these entities; that is, it adds “has edition or translation” properties to the work to relate it to its editions, and, conversely, uses “is edition or translation of” to link editions back to the work—an operation that is only possible once the items have been created and assigned a QID. Refer to the flowchart in Figure 1 for an overview of the process.

Flowchart illustrating the process of integrating bibliographic metadata into Wikidata.

To be or not to be a bot?

Scripted, high-volume editing on Wikipedia is, of course, a potentially dangerous endeavor: small mistakes on the part of a script operator can cause damage that requires substantial effort to repair. A large number of edits per minute can also place undue strain on servers. Moreover, scripted access is often perceived as correlating with other problematic practices, such as the creation of low-quality entries. Accordingly, high-volume access is regulated by community conventions, which are institutionalized through the so-called bot approval process. If a script is perceived as unapproved bot activity, both the script and its operator may be banned from Wikidata.

To qualify for bot access, a script and its operation must meet certain criteria. The kind of scripted access we carry out, however, occupies a grey area. Reconciliation tools such as OpenRefine can generate a volume of edits comparable to that of our script. Scripted, high-volume editing is not automatically classified as bot activity; the core (though not the only) defining feature of a bot appears to be a periodically and autonomously acting script that must provide a means for an administrator to halt its operation in case it causes widespread disruption. Nevertheless, since we were unable to find unambiguous confirmation that our task would be exempt from the bot approval process, we decided to err on the side of caution and apply for bot access.

Information on the approval process is scattered across various sources and varies in completeness and relevance. We recommend carefully reviewing the available documentation, as well as examining both successful and unsuccessful bot approval requests. Before implementing the script, the project should be discussed with the community on one of the dedicated discussion pages—in our case, Project Books. Before requesting approval, the operator must register a separate user account for the bot and create a user page for it that includes the bot template. This template requires the username of the owner (i.e., the user account representing the operator) as a parameter. The bot’s user page must also provide additional, specific information about the bot.

Bots must not exceed a certain number of edits per minute. Our bot goes a step further by implementing edit groups. Under the bot account, a bot password or OAuth login should be requested to help limit the bot’s capabilities.

Finally, the operator requests approval and performs a certain number of test edits, which are then assessed by the bot approval group. If done correctly, the process can take several days. Mistakes are generally tolerated, and the community is helpful within the limits of available resources. For further details, see our GitHub repository and the relevant Wikidata pages.

To be or not to be notable?

When should a work be represented on Wikidata? According to Wikidata, there are “two main goals: to centralize interlanguage links across Wikimedia projects and to serve as a general knowledge base for the world at large.” (Wikidata, n.d.-b) The page lists three criteria for inclusion, all of which are relevant to our project.

First, entities that have a page on one of the various Wikimedia sites—for example, many poems, fairy tales, or articles in our corpus that have pages on Wikisource. Second, entities that are a “clearly identifiable conceptual or material entity that can be described using serious and publicly available references.” (Ibid.) One could argue that this applies to every item in our corpus, since they are all available on Project Gutenberg. Third, inclusion is warranted due to structural needs; one example is the convention imposed by the editing community to separate series and parts of series, or editions and works (see below).

Based on these criteria, we see no obstacle to sharing any of our data items on Wikidata in terms of notability. On the other hand, in many cases we postpone sharing because of inconsistencies, such as misspellings or formatting errors in titles, unclear status as a work or part of a work, or uncertainty about whether the item is already present. The last issue may sound simple, but it is decidedly not.

Step by step, we whittled down the dataset:

On one side were exact matches, where we can edit existing Wikidata items (319 items). We have noticed that some of these items will be improved by our edits, as they are missing claims (exact number unknown).
On the other were items clearly absent from Wikidata, which we could create without introducing inconsistencies (475 items).

The remaining records sit in a data “limbo,” potentially requiring slow, item-by-item reconciliation and perhaps cleanup. This is by far the greatest number, namely 3421 items. This is the part of the process that benefits the community the most, since it improves the consistency and accuracy of shared data, while at the same time being the most labor-intensive.

(4) Discussion

Using the proposed method and pipeline, we managed so far to upload 193 titles resulting in 579 items. While this is not enormous, it represents a first step toward semi-automatically uploading high-quality entries from existing bibliographic sources.

Both technically and conceptually, we identify four broadly predictable sources of difficulty when sharing data with a large, pre-existing structured data repository.

1) Internal inconsistencies in the data

First, the data one intends to share may contain internal inconsistencies—spelling or formatting errors, duplicated entries, or both. These problems, of course, already exist before the decision to share the data. Yet the process of aligning local data with an external repository often points to new inconsistencies that have previously gone unnoticed. In this sense, sharing data also promotes greater internal consistency. For example, in our dataset several entries represent not an entire literary work but only a part of it, such as a single volume. Such items should not be represented on Wikidata as standalone works but rather as “parts of” a larger entity. If a translation or edition includes only one volume of a multi-volume work, the edition should be linked to the complete work and, ideally, qualified accordingly. However, the boundary between “work” and “part of a work” is not always straightforward.

2) Pre-existing inconsistencies in Wikidata

This leads to a second, more interesting source of difficulties: pre-existing inconsistencies in Wikidata itself. Here the same types of problems reappear on a larger scale. Wikidata contains numerous items for Goethe’s Faust, for instance, and several of them do not follow established guidelines for representing literary works. If we want to add an edition of Faust, how responsible are we for cleaning up the existing tangle of entries? Many Wikidata items blur the distinction between a literary work, its editions, translations, or even parts of a work. This is partly a consequence of Wikidata’s crowdsourced nature—contributors cannot be expected without fail to follow editing conventions—and partly a legacy of early stages of its multilingual development. The practices of communal curating of data continue to evolve. Any attempt to add new data to an item shaped by this history inevitably involves a judgement about how much one is willing to contribute to that evolutionary process.

2a) Wikidata’s ontological model for literary works

Wikidata distinguishes between a work and its edition. This is a simplification of the Functional Requirements for Bibliographic Records (FRBR) model, adapted to the specific requirements of Wikipedia and Wikisource (see citation 1 for details on the distinction between work and edition). In our case, we must create entries both for the work itself and for an edition representing its manifestation on Project Gutenberg. It is debatable whether the first known edition should be represented by a separate Wikidata item. We may in the future forego to create a separate entity for the first edition unless we can link it with a library record via a common ID, depending on further discussion with other editors at wikidata.

For works that are part of a series, an item marked as an instance of “book series” (Q277759) should also be created. The individual parts and the entire work should be connected via the qualified property “has parts of the class volume”, using “series ordinal” qualifiers to specify the ordering of parts. In the other direction, the property “is part of” can be used, with qualifiers such as “volume”, “follows”, and “is followed by”.

To avoid uploading records that represent only partial works (such as single volumes of multi-volume works), we developed simple procedural heuristics—for example, treating titles ending with Roman numerals as likely volumes—and then manually reviewed the results. These heuristics are summarized in Annex 2, and the full code is available in our GitHub repository.

This iterative process quickly produced a working solution for this dataset, although the heuristics themselves are inevitably corpus-specific. Nevertheless, when working with a few hundred texts at a time, this method can efficiently filter out partial records. In our batch, only about 25 of 200 items were partial works, a manageable number for manual review.

3) Conflicts between datasets

A third category of difficulties arises where the two datasets intersect. In cases of conflict, Wikidata’s rules naturally take precedence. Many texts in our corpus should presumably be classified as “part of” or linked to a collection via “published in”, such as books in a series, journal articles (e.g., Die Gartenlaube),⁶ fairy tales, and poems.

The matter is complicated by the fact that such items often lack “published in” or “part of” properties on Wikidata, even when this information is available on Wikisource. We cannot provide concise instructions for handling all special cases; an instructive example can be seen by following the “is edition of” and “part of” links in the Wikidata item for a Wikisource page of Rumpelstilzchen.⁷ Sorting out these intricacies is an ongoing effort and requires discussion with the editing community.

4) Linking missing items

A fourth challenge emerges when no clear connection can be made between the two datasets. This proved to be the most difficult stage for us: determining how to link our data items (authors, works, editions) to existing Wikidata entities. We may have a list of author names, but do they correspond to existing Wikidata items? And if an entity seems not to exist—perhaps because the author’s name is spelled differently—how much time should one invest before creating a new entry? For example, Project Gutenberg lists Emmy Hennings as “Emmy Ball-Hennings.” Overcoming this initial absence of bidirectional references, without which one cannot even detect inconsistencies, requires by far the greatest investment of labor.

4a) More on author matching

We used OpenRefine to match author names from our dataset to Wikidata entities. After many hours reviewing OpenRefine’s spreadsheet-like interface, we identified 307 reliable matches out of 1,064 authors. The same problem occurs at a greater scale for our 4,215 records of literary works, further complicated by the proliferation of inconsistent item types and multiple instances of works and editions in different languages. How to raise the level of certainty for bidirectional identification in a systematic way remains largely an open question.

When common identifiers are available, the process is immensely simplified. After noticing that a certain part of the Gutenberg-DE URL is used as an external identifier on Wikidata (Projekt Gutenberg-DE author ID, P7753), the number of reliable matches increased to 719 with just a few lines of Python code. This is also how we were able to associate Emmy Hennings with her Wikidata item, which OpenRefine did not catch.

OpenRefine is sometimes difficult to work with and can be buggy. For instance, when reconciling, German values in the title column are sometimes replaced with English ones. Results are not entirely reproducible; during several sessions, the tool suggested many implausible matches, apparently ignoring information such as publication date. In some cases, it treats editions and translations as equivalent to the works themselves. At other times, reconciliation worked well and yielded zero false positives. One helpful strategy was to include the URL to Gutenberg with the “available at” property. Nevertheless, to ensure reliability, we had to manually review hundreds of matches, often with incomplete information displayed by the tool. As mentioned, working with OpenRefine yielded 319 matched works and 307 matched authors.

(5) Implications/Applications

We see our work as a first step toward establishing communal practices and simplifying the process of the integration of bibliographic data from DH-curated datasets into a shared knowledge graph. Since these datasets often contain high-quality metadata about historical literary works, it would be beneficial not only to store them locally or as static supplementary files to journal publications or data repository, but also to make them available in a shared environment where they can be edited, extended, corrected, and dynamically updated. This in turn, via the QIDs could then be used to update existing files with more metadata, allowing for a more dynamic process of adding information.

In terms of reproducibility and integration into broader DH workflows, it is important to emphasize that this project is ongoing. As it stands, the code we developed can be reused with minimal corpus-specific adjustments and can semi-automatically upload metadata while taking into account the challenges we have outlined. Our longer-term goal is to refine this pipeline in ways that make it more generalizable, allowing for finer control over the automated parts of the process.

A particularly interesting task is the development of bi-directional matching algorithms that yield reproducible results in a fully transparent manner. While OpenRefine was useful for reconciliation, it did not perform as effectively as expected, and many of the most time-consuming tasks—particularly reliable matching—remain unresolved. Identifying systematic solutions to these problems remains a major task for future work. Despite its limitations, we plan to continue experimenting with OpenRefine. In parallel, we aim to develop additional programmatic solutions to support manual verification and make the upload workflow more scalable.

Maintaining shared resources, however, carries costs. Wikidata editors and the Wikimedia Foundation have recently expressed concerns (Wikidata, n.d.-c) about actors who use the service for mass uploads without taking responsibility for the consistency and quality of this shared resource. As a consequence, Wikidata’s online services—such as the SPARQL endpoint—have, in the recent past, come close to enacting contingency plans; that is, they were at risk of becoming unusable due to the sheer volume of access. One outcome of this strain is the so-called “graph split”: the database for scholarly articles is now separated from the rest of the knowledge graph.

This development raises important questions about whether Wikidata, originally designed to represent items with corresponding pages on a wiki site, should serve as a general-purpose repository for all kinds of research data, such as, in our case, metadata used in literary research. Perhaps a better approach would be to host a Wikibase instance that can fulfill the same role at a public university or similar institution, using it as a common repository for data used and generated in the digital humanities. While data repositories for the digital humanities exist,⁸ we are not aware of one that hosts a general-use collaborative knowledge graph like Wikidata. A possible bridge between this new instance and Wikidata could consist of semi-automated scripts. Hosting and maintaining such a site would require considerable resources, especially in terms of human labor. Developing and maintaining bi-directional data flow with Wikidata or other instances would multiply these requirements. While offloading some of the strain on Wikidata to another institution, this would also duplicate many tasks, decreasing overall efficiency. Using Wikidata as a central instance has the further advantage of being able to draw on, without complicated triangulation, many other kinds of data. A smaller, dedicated instance, in contrast, has the advantage of simplifying the process for the DH community, as adhering to the conventions and requirements of this behemoth of knowledge graph at the Wikidata site also comes at a cost.

On the other hand, it is unlikely that the kind of use we envision in this article will contribute to Wikidata’s demise. This is precisely because we are describing a use that does not simply reap the benefits of the shared resource without bearing the costs. When researchers take responsibility for improving both existing data and the data they add, and participate in developing the communal practices around this resource, there is no reason, at least in our view, to shift this work elsewhere. This is precisely what we have attempted to do: on one level, we improve data integration within the DH community, leading to all the advantages described at the beginning of this article; on a second level, we aim to further develop communal practices around data integration by sharing our code and documenting our workflow. Documentation and sharing of code facilitate that procedures can be followed by other researchers in the DH community, who then, in turn, contribute to curating the shared resource.

Notes

[1] https://github.com/logicdavid/wikidata-public.

[2] Both libraries only provide metadata for “author” and “title” (see Langer et al., 2021, 1095). In HathiTrust, “year” refers to a work’s appearance in the HathiTrust collection rather than to the date of first publication (see Underwood et al., 2020, 15).

[3] Documentation and code for this project are available at: http://github.com/logicdavid/wikidata-public.

[4] For examples of existing datasets that provide QIDs, see for instance Röttgermann (2024), Nešić et al. (2022), and Underwood (2021).

[5] http://github.com/logicdavid/wikidata-public.

[6] https://de.wikisource.org/wiki/Die_Gartenlaube_(1893)/Heft_2 (accessed: January 3, 2026).

[7] https://www.wikidata.org/wiki/Q19240860 (accessed: January 3, 2026).

[8] See, for instance, https://arche.acdh.oeaw.ac.at/browser/.

Appendices

Annex 1

Some useful regular expressions for author matching:

The suffix “ncnp” denotes that these have been stripped of punctuation, case, and whitespace:

Annex 2

Some useful heuristics for finding parts of work:

Competing Interests

The authors have no competing interests to declare.