Wikidata 4 Open Culture: Lessons Learned from Hands-On Work with Cultural Heritage Data in the Expanded Wikibase Ecosystem

Lucia Sohmen; Lozana Rossenova; Ina Blümel

doi:10.5334/johd.440

Full Article

(1) Context and motivation

The increasing importance of digital research methods and the growing need for data sharing and collaboration have led to a surge in the development of digital infrastructure for research data management. Within this context, the Consortium for Research Data on Material and Immaterial Cultural Heritage in Germany (NFDI4Culture) has been working to establish a robust and sustainable research data infrastructure for the arts and humanities. One key component of this infrastructure is the Wikibase4Research¹ service, which provides customized Wikibase instances as repositories for research data (Rossenova, 2025). This service is offered and maintained by the Open Science Lab, based at the German National Library of Science and Technology (TIB).

The open-source software suite Wikibase facilitates the storage and management of Linked Open Data (LOD) and features collaboration and version control capabilities. Wikidata, the most well-known public instance of Wikibase, serves as a centralized hub for structured data across all knowledge domains, complementing the encyclopedic vision of Wikipedia. The notability criteria for Wikipedia articles² are less strictly enforced in Wikidata. Still, there is a strong preference in manual and automated curatorial policies for Wikidata to act as a secondary database for data originally published elsewhere, i.e. to include verifiable sources and references wherever possible (Vrandečić & Krötzsch, 2014; Mietchen et al., 2015). The data model of Wikidata (and also Wikibase) structures data as semantic triples, consisting of items with attached labels and unique identifiers (IDs) and properties that attribute specific values to the items.³ These values may be other items within the database or other textual or numerical information. It is also possible to further define the primary triples with secondary statements about additional qualifications of the primary values and source references (Vrandečić & Krötzsch, 2014). This approach fits closely to the way humanities researchers are used to representing knowledge in their domains, too – as statements consisting of a subject, a predicate and an object, with the possibility to add additional argumentation to each statement and to add relevant references, too – making Wikibase and Wikidata suitable environments to represent research data (Rossenova et al., 2022).

Independent of Wikidata, instances of the Wikibase software can be used as LOD repositories of original research, customized to meet the specific needs of individual projects (Rossenova et al., 2022). While each approach – structuring research data in Wikidata vs Wikibase – comes with specific advantages and disadvantages, utilizing them in combination has proven fruitful in a wide range of use case scenarios (Thiery et al., 2024, 2025).

This paper discusses concrete methods we have developed in working across Wikibase and Wikidata as research data management infrastructures in the context of the ongoing, iterative development of the Wikibase4Research service. We work closely with specific research projects in order to develop workflows for data modelling, data enrichment, and data upload informed by concrete data management needs and challenges faced by our partners.

Since Wikidata provides large data coverage across a range of both general and specialized topics, it is of particular interest to humanities scholars for its aggregation of authority file IDs (VIAF, GeoNames, National Library Authorities, and many more). The IDs, used to uniquely identify entities across different databases, can be accessed and reused following the methods for data reconciliation, enrichment and federation described in this paper. This enables researchers to use Wikidata as a shortcut to map entities and concepts in other datasets to their own data. This in turn contributes to the creation of a robust and interconnected network of data, placing Wikidata as a central actor in achieving the final step in the 5-Star LOD Model (Berners-Lee, 2006), namely “linking to other Linked Open Data sets”.

To discuss the capabilities of both Wikibase and Wikidata, as well as the potential for optimisation and automation, we present three use case projects in the architectural field implementing the Wikibase4Research service and taking advantage of the ‘hub’ role of Wikidata: Manor Houses in the Baltic Sea region, Corpus of Baroque Ceiling Paintings, and Federation of German-Speaking Architectural Collections. Through these case studies, we aim to provide a practical guide for researchers and data managers seeking to integrate their data with other linked data sources in general, and Wikidata in particular.

(2) Dataset descriptions

(2.1) Manor Houses in the Baltic Sea Region

The German-language project – Herrenhäuser des Ostseeraums – is a research initiative at the University of Greifswald that applies an interdisciplinary research perspective to the digital documentation of manor houses and their estates in the Baltic Sea region dating from the 18th century, or earlier. The project involves collecting and documenting data on manor houses spread across ten countries in the region. The dataset includes basic information on all manor houses in the region, as well as detailed documentation of over a dozen representative estates. The data is presented in a visually engaging way through a web portal that utilizes a custom frontend design connected to the Wikibase repository (Bailly, 2024). The dataset features a range of media formats, such as images, videos, and 3D documentation, as well as interactive formats like timelines and clickable maps. The project’s goal is to create a multidimensional access point to the research outputs, allowing users to query and explore the complex contents and generate a broader interest in the topic of manor houses.

Repository location – https://wb.manorhouses.tibwiki.io/

Format names and versions – Data can be queried at https://query.wb.manorhouses.tibwiki.io and downloaded as JSON or CSV/TSV. RDF dumps can be created upon request.

Creation dates – 2021–2025

Dataset creators – Initial dataset creation by the research team at Greifswald (Carsten Berger, Kilian Heck, Maria Mischke, Thomas Wilke, Marion Müller, Torsten Veit, Julia Jauch, Ulrike Ide, Ulrike Gawlik), data curation by Lucia Sohmen

Language – German

License – CC-BY-4.0

(2.2) Corpus of Baroque Ceiling Paintings – Weikersheim case study

This dataset constitutes a small sample from a larger dataset dedicated to the detailed art and architectural history of baroque ceiling painting across castles in Germany. The sample covers the Weikersheim castle complex – documenting all comprising buildings and building parts, down to individual rooms and the artworks within them. The dataset also included media items such as 2D photographs and 3D models with annotations. The dataset was uploaded to a dedicated Wikibase instance, deployed for the Semantic Kompakkt service, another NFDI4Culture service which connects a Wikibase backend to the Kompakkt viewer, allowing the viewing and annotation of 2D and 3D media.

Repository location – https://wikibase.semantic-kompakkt.de

Format names and versions – Data can be queried at https://query.semantic-kompakkt.de and downloaded as JSON or CSV/TSV. RDF dumps can be created upon request.

Creation dates – 2021–2022

Dataset creators – Initial dataset creation by Corpus der barocken Deckenmalerei in Deutschland (CbDD) research team at https://www.deckenmalerei.eu/. In collaboration with the project team, a subset consisting of detailed data about the Weikersheim castle was uploaded the Semantic Kompakkt Wikibase by Lucia Sohmen.

Language – English, German

License – CC-BY-4.0

(2.3) Federation of German-Speaking Architectural Collections

The Architectural Collections Wikibase combines data from the “Föderation deutschsprachiger Architektursammlungen”, a network of institutions in the German-speaking region that collect and preserve architectural materials. The project aims to create a centralized catalog of architectural archival holdings from various institutions, making it easier for researchers to locate relevant materials. While some collections have fully digitized and cataloged their holdings, others remain unprocessed, particularly in the case of archival materials. To overcome these challenges, the project initially focused on documenting individuals whose materials were available in the archives, with plans to expand to include buildings and other works in the future.

Repository location – https://architektursammlungen.tibwiki.io/

Format names and versions – Data can be queried at https://query.architektursammlungen.tibwiki.io/ and downloaded as JSON or CSV/TSV. RDF dumps can be created upon request.

Creation dates – 2024–ongoing

Dataset creators – raw data supplied by federation members, data curation by Lucia Sohmen.

Language – German

License – CC-BY-4.0

(3) Method

The workflows and approaches we discuss in this paper were developed following an agile, case-study based methodology. Following open science and FAIR data principles,⁴ we worked closely with the researchers producing the datasets in each of our case studies in order to better understand the specific needs and requirements of each project while serving the goal of making new research data easily findable, accessible, interoperable and reusable on the web. We developed corresponding workflows and infrastructure services that are able to generate, process, and retrieve data and visualizations, best suited to the research questions commonly addressed in our target communities. Lessons learned from each case study fed into follow-up projects and continue to provide input for the ongoing development of our services.

(3.1) Data tools

One of the most common tools used for data cleaning and reconciliation in the library and information science fields is OpenRefine,⁵ an open-source software that offers a graphical user interface and a wide range of possibilities for data manipulation and analysis (Groves, 2016; Sterner, 2019). Data reconciliation, which involves linking research data in a local repository to data from relevant external resources, plays a particularly important role in how Wikibase4Research instances interact with Wikidata. The reconciliation process in OpenRefine is semi-automated, meaning that the software uses an algorithm to match local data values to data in the remote resource, but human judgment is still required to review and approve the suggested results. OpenRefine uses the Reconciliation API,⁶ a protocol that enables communication between a client, and a database, such as Wikidata, to facilitate the reconciliation process. Currently, when users (and client software) connect to the reconciliation endpoint of Wikidata or individual Wikibase instances, they are connecting to an external Python wrapper service. This solution is error-prone, maintenance-heavy and no longer officially maintained by the OpenRefine team.⁷ The mid- to long-term roadmap of the developer communities around both the Wikibase and OpenRefine tooling is to develop a reconciliation service as a dedicated MediaWiki extension that can be more easily run and maintained by Wikibase users.⁸

(3.2) Data workflows

Data preparation

When working with unstructured data from diverse sources, our initial step is to transform the data into a structured table format, thereby facilitating manipulation in OpenRefine. The approach to achieving this varies significantly depending on the input data, which can take numerous forms. In the architectural collections project, for instance, data has been delivered in Word documents (either unstructured or presented in tables), Excel files, websites, JSON, or XML files. While the latter can be directly opened with OpenRefine, due to their structure, they often contain superfluous information that necessitates filtering. Consequently, it is beneficial to parse these files using a script prior to importing them into OpenRefine. Beyond adhering to formal requirements, we structure data values in a manner that ensures one type of entity per table and one value per cell (Rossenova & Sohmen, 2021). This second step can be accomplished directly within OpenRefine utilizing its data manipulation methods. Further refinements to the data include removing trailing whitespace, formatting dates, or deleting irrelevant data.

Data reconciliation

Once the input data is cleaned and prepared in OpenRefine, we proceed with reconciling entities to other databases (usually Wikidata) and enriching them with useful information from that database. OpenRefine allows us to automatically reconcile entities with databases that have implemented a reconciliation endpoint.⁹ Reconciliation compares the string values to labels in the requested database, so typos might lead to errors. Furthermore, reconciliation only checks labels in one language, so the appropriate language for the dataset has to be chosen for each reconciliation operation or it should be repeated for each language. The returned matches can be filtered in advance by specifying the class type (for example “human” or “country”). This will also return values whose type is a subclass of the given type (so if “place” is given as a type, the reconciliation will also return countries). Manual checks are still necessary to confirm the results or search for matches when the automated process does not find any. In our experience, this manual verification process is crucial for ensuring the accuracy of the reconciled data.

In the Architectural Collections Wikibase, we reconciled people to Wikidata, and in the Manor Houses Wikibase – estates, locations, as well as people. In the Weikersheim sample dataset, the castle estate itself as well as iconographic concepts used to describe the objects inside were also reconciled. Besides Wikidata, locations in the manor houses dataset were matched with the global geo data authority, Geonames,¹⁰ because its database has a more extensive coverage for geolocations and allows search by coordinates through its API. The locations were actually matched to Wikidata using their Geonames ID. This is possible because OpenRefine allows values for specific properties (e.g. Geonames IDs) to be added as additional criteria for reconciliation, in addition to the label. This typically makes matches more accurate, especially when the dataset contains entities with similar (or same) labels, but different semantic meanings.

For each matched entity, there are three possible results: no matches; one or more possible matches to choose from; or one automatically confirmed match. Even for values with automatically confirmed matches, we still manually verify the results, as there have been cases where an entity with the same label is found in the remote database, but it is not the same entity as the one in our local dataset. This occurs frequently with individuals who share the same name, but the one from our dataset is not notable enough to be included in the authority database. Our verification process for automatically matched values is similar to that for choosing one out of multiple possible matches. We use additional information from the local dataset, as well as the remote database to either deny or confirm a match. Identifiers from authority files can be particularly useful, as they are unique. Other useful comparison data includes dates or coordinates. However, since this kind of data is not always available, we often have to make informed guesses based on occupation or general geographic area.

In our experience, different approaches to automating reconciliation are often necessary for larger datasets. When unique identifiers, such as those from authority files, are available in both the local dataset and remote database, we leverage this information to automate the verification process. By adding the relevant information from the database to our local OpenRefine dataset and utilizing OpenRefine’s built-in functionality, we can perform automated checks to see if the identifiers match. Additionally, we can incorporate information other than IDs into our dataset to facilitate mass verification of matches. For instance, in the case of the Architectural Collections data, we added occupation information and then established a rule that any matches with the occupation “architect”, or similar, were likely to be true matches.

Data enrichment

Reconciling our datasets to external databases, such as Wikidata, enables us to enrich them with additional information. First, we add identifiers from the database to our dataset, providing a unique reference point for each entity. Once values in an OpenRefine project are reconciled to Wikidata (or another resource), the matched entity identifiers can be uploaded to the source database (e.g. the Wikibase instance) via data statements that use a dedicated property of the data type External Identifier. In the case studies in this paper, we added matched identifiers from Wikidata and Geonames via such dedicated properties.¹¹

Next, we can leverage the information available in the external database to extend our own data. Entities represented in Wikidata often include multiple identifiers from authority files, such as VIAF,¹² GND,¹³ AAT¹⁴ and more. When authority files have their own reconciliation endpoints, we can tap into their resources to get locations, images, birth dates, and more. The enrichment process can even cascade further: for example, by adding a city of birth for a person based on the reconciled Wikidata entry for that person, we can (if needed) subsequently enrich our local dataset with the city’s geo coordinates and corresponding country, too. The specific data that is added must be guided by the actual needs of the particular research project. In the case of the dataset of architects linked to the Architectural Collections Wikibase, we have extended the original dataset with a GND identifier from the German National Library, birth and death dates, occupation, and an image from Wikimedia Commons – all extended via Wikidata. Once all the relevant information is added to the local dataset in OpenRefine and all data fields are formatted correctly, we can upload the data to the dedicated Wikibase repository via the Wikibase extension for OpenRefine. Data extension via external databases and upload in Wikibase allows us to create richer presentations and visualizations of the data in customized frontend interfaces, which can be further utilized in research outputs of the project partners. The data extension itself is a key example of active data reuse: it reuses data from one source to enrich data from another. A downside is the potential for data to go stale (or out of date), but in cases where data doesn’t change regularly (such as dates of historic events, geolocation coordinates, etc.) it is an effective means to improve presentation and analysis potential.

Data federation

An alternative to data enrichment, that does not involve duplication of the data from the remote database (Wikidata) to the local database (Wikibase) – or vice versa – is data federation via SPARQL¹⁵ queries. A federated SPARQL query is initially posted to one SPARQL interface, but then leverages any number of other SPARQL interfaces, allowing it to combine data from different sources. Federated queries are often used to combine information about resources from one Wikibase with Wikidata, since many specialized Wikibases do not require storing generalized information already available in Wikidata (e.g. geo locations for common places, etc). Furthermore, Wikidata IDs can also be used to post queries (from a Wikibase) to other databases that have matched their entities to Wikidata as well (e.g. Geonames, or other Wikibase instances).¹⁶

Federated queries have several advantages beyond simply saving disk space: they ensure data is always synced and up to date, as they do not rely on data from the external source being also stored locally; furthermore, they enable potentially new use cases for data reuse, not foreseen by the original research team. For example, the historical database Factgrid,¹⁷ another research-driven instance of Wikibase, contains data on a vast number of historical people across Europe and beyond. Data in Factgrid can be queried alongside the Manor Houses dataset (via the shared Wikidata IDs) to reveal additional information about the families that owned the houses over centuries (Thiery et al., 2024). This reuse case did not exist for Factgrid until the Manor Houses Wikibase was created.

At the same time, federated queries can involve fairly complex syntax-writing and come with a steep learning curve. They might also have poorer performance (partly depending on the resource availability in the remote data sources), resulting in slower response times or time-outs.

In the next sections of this paper, we discuss in more detail how each research case study utilized available tools and methods, taking advantage but also challenging the current capabilities of Wikidata as a central hub of cross-disciplinary information and Wikibase as a space for FAIRification of original research data.

(4) Results and discussion

(4.1) Data enrichment results

Manor Houses

Out of a total of 1645 manor houses spread across nine countries, 1439 (87%) have been successfully matched with a Wikidata ID, indicating a good coverage for most countries (Table 1). A notable exception is Germany, where only roughly a quarter of the manor houses in the original dataset can be found on Wikidata, suggesting that either many of these manor houses do not have a Wikidata entry or they were not identified by our team. Despite the dataset created by the research partners already containing extensive information, we have been able to add coordinates to many of the manor houses via Wikidata, greatly improving the capabilities for visualization (e.g. on a map). If we analyse the reconciliation of locations more closely, only 1083 out of 1565 locations (69%) have a Wikidata ID. This is due to the fact that reconciliation with Wikidata was not a priority for locations that already had a match with a Geonames ID, from which coordinates were obtained. For reasons of efficiency, in the case of location reconciliation we only relied on automated matches and did no further manual matching.

Table 1

Results from the reconciliation of manor houses with Wikidata IDs per country.

COUNTRY	MANOR HOUSES WITH WIKIDATA ID	MANOR HOUSES TOTAL	% MANOR HOUSES WITH WIKIDATA ID
Poland	570	572	99.6%
Denmark	152	178	85.3%
Norway	28	32	87.5%
Germany	52	204	25.4%
Lithuania	73	73	100%
Finland	56	68	82.3%
Estonia	129	131	89.4%
Sweden	306	309	99%
Latvia	73	78	93.5%

Corpus of Baroque Ceiling Paintings

This case study dataset is substantially different from the manor houses dataset, since it only contains one building ensemble and 103 items contained within this ensemble. These items include buildings, rooms, and individual paintings. Only the building ensemble itself is matched to Wikidata, while its parts do not have individual Wikidata items revealing the limits of the global database which focuses on items with a particular degree of notability. We performed reconciliation with Wikidata primarily for the iconographic concepts present in the original dataset, used to describe what is depicted on paintings. We already had Iconclass¹⁸ identifiers for those concepts via the original dataset wherein 65 artworks had been annotated with 36 iconographic concepts. 21 of these concepts could be matched with a Wikidata ID. Matched concepts were mostly animals, while unmatched concepts tended to be highly specific like “Hercules chokes the Nemean lion with his arms”. Confidence in many of these matches was not high, but having depiction concepts mapped to Wikidata still brings advantages even in cases with low confidence. For example, Wikidata mappings could still be used to provide translations of the concepts to different languages with low effort.

Architectural Collections

While in the previous two cases, our enrichment process was more focused on working with objects and geolocations, in this case we were able to test the opportunities and limits of enhancing data related to people through reconciliation with authority files. At the time of writing, the dataset of architects contains 2008 people. We were able to link 1741 (87%) of these individuals to existing Wikidata items, providing valuable connections to the broader network of information. Additionally, 1777 individuals (88%) were matched with GND IDs. We were able to add birth dates for 1758 individuals (87%), either from the original data or through the enrichment process. Occupation information for 1621 individuals (81%) was added from Wikidata. Perhaps most notably, we were able to link 732 individuals (36%) to images from Wikimedia Commons, extending the potential for visual representation of our dataset. These enhancements improved the depth and breadth of our dataset, providing a more comprehensive understanding of the architects represented within it, as well as of the gaps of representations in other authority records: the architects not matched to external databases require further research and possibly future inclusion in the authority databases.

(4.2) Federated queries

Federated queries¹⁹ originating from the SPARQL query service deployed with our Wikibase4Research instances can integrate data from Wikidata, effectively treating it as if the data were stored locally in the Wikibase instance. For example, in the case of the Architectural Collections Wikibase, researchers can create queries that generate a map of architects’ birthplaces; filter the list of architects by female gender; or display images of their notable works, even though the Wikibase itself does not contain this data (Figure 1). Thanks to the core functionalities of LOD, this process works the other way round, too: the manor houses project Wikibase, which contains many manor houses not present in Wikidata, can be used to supplement Wikidata. By executing queries that retrieve manor houses from both sources, we can display them side by side on a map (Figure 2). Even a small dataset, like the one describing the Weikersheim castle complex, can be put into more context using federated queries: for instance, by finding other castles with the same architectural style nearby (Figure 3).

Manor houses in Germany’s Baltic Sea region.

Renaissance style castles near Weikersheim castle.

Authority data (e.g. the GND IDs for the architects) pulled from Wikidata can also be used for federated queries with other databases. For instance, the NFDI4Culture Knowledge Graph²⁰ uses GND IDs for people, allowing us to initiate a query to identify all people in our dataset and determine which of these also appear in the NFDI4Culture KG, along with their associated works, even though we do not know the local NFDI4Culture KG IDs for these people. Many other databases have mapped their own IDs to Wikidata precisely for this reason, enabling us to use Wikidata IDs to establish connections between items in our database and theirs. For example, we can create a connection between the Manor Houses Wikibase and the historical database Factgrid to find out which manor houses are documented in both.

(4.3) Ongoing challenges

Working through the steps of data preparation, enrichment and querying in the context of these specific case studies delivered valuable results in terms of quantitative data mappings and extensions. The quantitative results contribute to an improved qualitative experience of researchers working with the respective datasets—having access to more comprehensive data via the frontend interface (in the case of data enrichment) and/or via APIs (e.g. with federated queries in the SPARQL query service). But there are ongoing challenges for humanities projects aiming to work effectively across Wikibase instances and Wikidata.

Verification of reconciliation results is one of the challenges we continue to encounter across different case study datasets. Differences in the data models between Wikidata and our datasets can make this process more difficult. For example, in the case of the manor houses datasets, Wikidata tends to provide historical locations—like parishes—as the only given location for a manor house. Our dataset, on the other hand, only contained current administrative locations for each manor house. Therefore, verifying matches often involved consulting Wikipedia articles in foreign languages in order to find out hints to current administrative locations or any other meaningful data points for individual parishes.

We encountered additional challenges in the reconciliation process when performing manual searches in the database to locate potential matches when no suggested match was found. Errors due to typos, use of special characters or nicknames, or names given in different order are common. In some cases, we were unable to find matches at all, and the local data entries remained ‘less rich’, leaving space for more research in the future. Sometimes, matches could be found in different databases. However, this complicates the enrichment process because the manual search has to be repeated for each database.

Throughout our experience with different case studies, the reconciliation step tended to be the longest one when preparing upload of a complete dataset to Wikibase. We were typically able to complete reconciliation much faster when identifying data was present in the original dataset, particularly identifiers from authority files since we could then add additional criteria to the matching algorithm of OpenRefine and make full use of the benefits of Wikidata being a ‘hub’ of external IDs.

Once reconciliation and upload were complete, the next set of challenges we encountered were related to enabling data federation. Our Wikibase4Research instances required additional configuration and/or documentation (Rossenova, 2024), such as the setup of a so-called “Allowlist”, which signifies to the Wikibase/Wikidata query service which other SPARQL endpoints are considered “safe” and can be queried via the local service. The relatively complex syntax of federated queries, which requires not only good knowledge of the SPARQL protocol, but both the local and remote data models of the resources being queried, presents further barriers. As with data reconciliation, differences in the data models can make both writing and interpreting the results of federated queries more difficult.

The challenges we have encountered throughout the case studies described above have served to influence decisions about the roadmap of development of the Wikibase4Research service. Furthermore, as part of our work in various open-source communities and working groups (e.g. the NFDI Knowledge Graphs Working Group²¹ or the Wikibase Stakeholder Group²²), we aim to contribute to the documentation and improvement of the ecosystem of related tools and extensions which also impact our workflows with(in) Wikibase/Wikidata.

(5) Implications and outlook

The methods for reconciliation and enrichment described in this paper can be applied to a wide range of use cases in the culture heritage and humanities fields since Wikidata’s coverage of domains and mappings to authority files can bring added value to most scenarios where at least some form of structured data is available. Most of the steps we applied to the case study datasets rely on the OpenRefine and Wikibase softwares, and sometimes on the custom versions of Wikibase and the OpenRefine reconciliation service we deploy with the Wikibase4Research service. We have identified the reliance on customization where possible, and additional documentation is available in the open-source code repositories of the services. As all open-source software, these services remain in a state of flux, as is the data available in the collaboratively-curated environment of Wikidata. This opens opportunities to actively influence the future development of not only the software services, but also the wealth of data available in Wikidata. The use cases discussed so far primarily focused on reusing data from Wikidata, but what if Wikidata could reuse data from individual research project repositories? The reconciliation processes carried out to extend data in Wikibase with Wikidata IDs (and more), can also effectively be reused in the other direction, too – adding previously missing data items to Wikidata, but this requires additional resources, which may not always be present in research funding. Furthermore, the ambiguity around which data belongs in Wikidata, what constitutes ‘notability’ and which items are ‘not notable enough’, remains a concern among many humanities researchers and a frequent topic among user communities and technical working groups.

As more humanities projects enrich their datasets and/or federate with Wikidata, the global landscape of connected knowledge changes and so do the challenges associated with keeping the data accurate, complete and FAIR. Addressing these challenges has social as well as technical implications: besides further developments of the software services for reconciliation and federation, research communities need to work closely on aligning data models, FAIR data policies, and sharing best practices.

Additional File

The additional file for this article can be found as follows:

Supplementary File

Federated queries and results. DOI: https://doi.org/10.5334/johd.440.s1

Notes

[1] See: https://nfdi4culture.de/services/details/wikibase4research.html. Last accessed date: 01/11/2026.

[2] See: http://en.wikipedia.org/wiki/Wikipedia:Notability. Last accessed date: 01/11/2026.

[3] See: https://www.mediawiki.org/wiki/Wikibase/DataModel/Primer. Last accessed date: 01/11/2026.

[4] See: https://www.go-fair.org/fair-principles/. Last accessed date: 01/11/2026.

[5] See: https://openrefine.org/. Last accessed date: 01/11/2026.

[6] See: https://github.com/reconciliation-api. Last accessed date: 01/11/2026.

[7] Several forks run by different institutions are in use (including the one hosted by TIB), but not necessarily actively developed. See: https://gitlab.com/nfdi4culture/openrefine-reconciliation-services/openrefine-wikibase and https://github.com/judaicadh/wikibaseopenrefine. Last accessed date: 01/11/2026.

[8] See the discussion in the OpenRefine forum: https://forum.openrefine.org/t/new-mediawiki-extension-reconciliationapi/2218. Last accessed date: 01/11/2026.

[9] There is a list of available endpoints: https://reconciliation-api.github.io/testbench/#/. Last accessed date: 01/11/2026.

[10] See: https://www.geonames.org/. Last accessed date: 01/11/2026.

[11] An alternative method, which we also use in some cases, is to add mappings via direct owl:sameAs RDF statements thanks to the Wikibase RDF extension (See: https://github.com/ProfessionalWiki/WikibaseRDF), but this has to be done via custom upload scripts, rather than OpenRefine, as OpenRefine does not yet offer support for this experimental Wikibase. extension.

[12] See: https://viaf.org/en. Last accessed date: 01/11/2026.

[13] See: https://www.dnb.de/EN/Professionell/Standardisierung/GND/gnd_node.html. Last accessed date: 01/11/2026.

[14] See: https://www.getty.edu/research/tools/vocabularies/aat/. Last accessed date: 01/11/2026.

[15] SPARQL is the query language for RDF data, the Resource Description Framework for encoding Linked Open Data used in most semantic web resources available on the open web. For more information, see (Sohmen, 2025).

[16] Although there are specific conditions for the configuration of each SPARQL endpoint that need to be met before federation from both sides becomes possible. See more in Discussion.

[17] See: https://database.factgrid.de/wiki/Main_Page. Last accessed date: 01/11/2026.

[18] See: https://iconclass.org/. Last accessed date: 01/11/2026.

[19] Full description of the queries discussed in this section, including the query code, and screenshots of the results are documented in the Supplement document to this paper.

[20] See: https://nfdi4culture.de/de/dienste/details/culture-knowledge-graph.html. Last accessed date: 01/11/2026.

[21] See: https://zenodo.org/records/7228955. Last accessed date: 01/11/2026.

[22] See: https://wbstakeholder.group/. Last accessed date: 01/11/2026.

Competing interests

The authors have no competing interests to declare.

Author Contributions

Lucia Sohmen – Data Curation, Formal Analysis, Methodology, Visualization, Writing – original draft, Writing – review & editing.

Lozana Rossenova – Supervision, Conceptualization, Funding Acquisition, Project Administration, Visualization, Writing – original draft, Writing – review & editing.

Ina Blümel – Supervision, Conceptualization, Funding Acquisition, Project Administration, Writing – review & editing.

Wikidata 4 Open Culture: Lessons Learned from Hands-On Work with Cultural Heritage Data in the Expanded Wikibase Ecosystem

Full Article

(1) Context and motivation

(2) Dataset descriptions

(2.1) Manor Houses in the Baltic Sea Region

(2.2) Corpus of Baroque Ceiling Paintings – Weikersheim case study

(2.3) Federation of German-Speaking Architectural Collections

(3) Method

(3.1) Data tools

(3.2) Data workflows

Data preparation

Data reconciliation

Data enrichment

Data federation

(4) Results and discussion

(4.1) Data enrichment results

Manor Houses

Table 1

Corpus of Baroque Ceiling Paintings

Architectural Collections

(4.2) Federated queries

Figure 1

Figure 2

Figure 3

(4.3) Ongoing challenges

(5) Implications and outlook

Additional File

Supplementary File

Notes

Competing interests

Author Contributions

Paradigm

My account