Have a personal or library account? Click to login
Open Data Practices of Art Museums in Wikidata: A Compliance Assessment Cover

Open Data Practices of Art Museums in Wikidata: A Compliance Assessment

Open Access
|Dec 2025

Full Article

(1) Context and motivation

Museums are among the most important cultural institutions that preserve the tangible and intangible heritage of societies, reflect cultural accumulation, and promote national culture. Historically, museums have primarily collected, preserved, and exhibited physical artifacts. Until the early 20th century, they mainly employed an object-oriented approach; however, in the following years, with the advancement of the Information Sciences, they began to adopt an information-oriented perspective to convey cultural knowledge more effectively to their target audiences (Ayaokur & Yılmaz, 2014). This paradigm shift accelerated with the widespread adoption of digital technologies, enabling museums to reach broader audiences, facilitate interactive participation, and adopt innovative interpretive methods.

The changing identity of museums has introduced new concepts, such as computational museology (Kenderdine, 2022). As a result, museums began sharing their collections under open licences, in high quality, and as data (Open GLAM, 2025; Pekel, 2014; Vollmer, 2014). With the adoption of approaches such as Collections as Data, which support the publication of GLAM (Galleries, Libraries, Archives, Museums) collections as machine-actionable datasets (Padilla et al., 2023), the compliance of these collections with the FAIR (Findable, Accessible, Interoperable, Reusable) data principles (Wilkinson et al., 2016) has also gained importance. Moreover, the enrichment, interlinking, and open reusability of digital collections are becoming increasingly significant (Dişli, Gabriëls, et al., 2025).

Since Tim Berners-Lee introduced the Semantic Web technology in 2001, the goal has been to enhance the discovery and interlinking of content across diverse contexts and domains (Berners-Lee et al., 2001). Berners-Lee proposed a 5-star deployment scheme for Open Data to guide the sharing of data online, enabling the work of data providers, including museums, and recommending Linked Open Data (LOD) as the best option. Alternatively, approaches such as LOUD (Linked Open Usable Data) have been proposed to support the publication of these data in ways that render them usable by potential users (Linked Art, 2025). These advancements have facilitated the reuse of digital collections as rich data sources, paving the way for innovative research methods and broader participation.

In this context, Wikidata stands out as a collaborative, creative, and pioneering platform for enriching GLAM digital collections (Candela et al., 2024). Wikidata increases discoverability, interoperability, and reusability of cultural heritage institutions’ collections. Indeed, initiatives such as the International GLAM Labs Community, Europeana, and Collections as Data also recommend using Wikidata for publishing GLAM data (Candela, Gabriëls, et al., 2023; Europeana, 2015; Mahey et al., 2019). Moreover, the literature provides numerous examples of how linking cultural heritage collections to Wikidata enriched the data (Araújo, 2025; Fagerving, 2023; Faraj & Micsik, 2019; Freire & Isaac, 2019; Okuonghae, 2024; Sant et al., 2025).

Alongside the enrichment of metadata with Wikidata, there are also works that address the organisation of Wikidata as a controlled vocabulary or taxonomy (Araújo, 2025; Mizota, 2021, 2022). Additionally, Wikidata enhances the discoverability and interoperability of resources, offering users a richer and broader network of information (Okuonghae, 2024). By interlinking their catalogues with Wikidata, institutions enable users to access broader information relevant to their research interests (Stinson, 2018). It has the potential to become a database encompassing cultural heritage from around the world (Darnell et al., 2016). In addition, Wikidata’s data, openly available under the Creative Commons CC0 licence, can be freely reused by anyone (Allison-Cassin & Scott, 2018). It is an essential resource for researchers, developers, and other stakeholders who wish to utilise structured data (Fagerving, 2023). Moreover, it facilitates cultural heritage institutions’ adoption of LOD (Zhu et al., 2023), as anyone can create, publish, and use it without complicated technical skills (Allison-Cassin & Scott, 2018). It provides data on a wide range of topics, facilitates collaboration, crowdsourcing, and integration (Zhao, 2023). This structure increases community contributions, thereby enabling the representation of topics not included in official collections, allowing collections to exist beyond their institutional borders and to be connected to different organisations and publics. Its machine-readable and reusable nature also supports initiatives such as Collections as Data.

In GLAM institutions, Wikidata is utilised for various purposes, including data curation and publishing, data extraction and visualisation, data modelling, data enrichment, data analysis, mapping, and enhancing data quality (Candela et al., 2024). On the other hand, Wikidata is also an essential resource for Digital Humanities studies (Zhao, 2023). It is generally used for description and enrichment, metadata editing, knowledge modelling, Named Entity Recognition (NER), and Entity Linking (Ehrmann et al., 2020; Möller et al., 2022).

Initiatives advocating the use of Wikidata to enrich humanities data emphasise the need for “interoperable and reusable datasets” (Farina & McGillivray, 2024). Our compliance framework offers a preliminary assessment aimed at establishing the necessary foundation to enable the reliable integration and reuse these efforts depend on.

Many institutions have partnered to share their data in Wikidata (Wikidata, 2025). However, their compliance with certain open data preconditions remains highly uneven. This situation can both limit its academic and institutional reuse and undermine its interoperability.

Previous works have examined the current state, potential, and challenges of using Wikidata in GLAM institutions (Candela et al., 2024). However, these works have primarily focused on library and cultural heritage data (Candela, 2023; Candela, Chambers, et al., 2023; Freire & Isaac, 2019). Comprehensive methods to assess the compliance of museum data with open-access requirements remain needed. For Art museums in particular, utilising Wikidata to provide context beyond the collection is especially important (Stinson, 2018). For example, the MoMA (Museum of Modern Art) found that users most often access artworks through the artist’s name, and by linking artists to Wikidata, they could begin to provide broader information (Romeo, 2016). While such practices undoubtedly support a broader information network and enhance reusability, they also necessitate assessments of their compliance with aspects that would improve their integration into Wikidata, making it better, more transparent, and sustainable.

This work aims to extract Art museum data from Wikidata using SPARQL1 queries and to assess them based on criteria such as accessibility, enrichment, machine-readability, and reusability. Within this scope, the work seeks to answer the following questions: 1) What criteria can be used to assess the compliance of Art museums’ open data practices with Wikidata? 2) Which Art museums are most represented on Wikidata, and what is the level of maturity in their data practices and ecosystem integration? The purpose of this work is to define a set of best practices for open data publishing in Wikidata and to benchmark the current level of compliance among major Art museums. The results will provide a clear roadmap for institutions to improve their open data strategies.

(2) Dataset description

Repository location

https://doi.org/10.5281/zenodo.17440061

Repository name

Zenodo

Object name

hibernator11/Wikidata-museum-quality: V1

Format names and versions

Version V1.1

Creation dates

2025-10-25

Dataset creators

Meltem Dişli, Gustavo Candela, Silvia Gutiérrez De la Torre, Giovanna Fontenelle

Language

English

Licence

Creative Commons Attribution 4.0 International

Publication date

2025-10-25

(3) Method

Wikidata has played a relevant role in DH and GLAM institutions (Candela, 2024; Zhao, 2023). To enrich the catalogues, institutions can create properties in Wikidata that can be used to link resources. In addition, when institutions provide a SPARQL endpoint, Wikidata can federate with the repository to search across several datasets. This technique enables the integration of datasets, avoiding data silos.

This work aims to define a set of criteria for publishing open data on Wikidata and to assess the current compliance levels of leading Art museums. The method proposed in this approach works in 3 steps: i) selection of repositories; ii) definition of open data compliance criteria; and iii) reporting the results. Each step is described below.

Selection of repositories

The data repositories of Art museums in Wikidata were first identified through the SPARQL query shown in Figure 1. It retrieves Art museums with at least 5,000 records in Wikidata. For the assessment process, the sample was limited to the top 10 museums with the most records in Wikidata. Art museums were identified in Wikidata using the item “art museum” (Q207694) as the reference point. Accordingly, institutions whose P31 (instance of) property is Q207694 were considered “art museums”. Table 1 shows the main features of the selection of Art museums.

Table 1

Museums included in the assessment and the number of their records in Wikidata (Retrieved data: Oct 6, 2025).

WIKIDATA IDMUSEUMSTOTAL RECORDS IN WIKIDATA
Q214867National Gallery of Art128062
Q160236Metropolitan Museum of Art71813
Q19675Louvre Museum35946
Q2087788National Gallery of Armenia24057
Q153306National Museum in Warsaw22044
Q1568434Yale University Art Gallery19659
Q812285Bavarian State Painting Collections18018
Q19877Museo Egizio In Turin (IT)10012
Q2983474Finnish National Gallery9200
Q1471477Royal Museum of Fine Arts Antwerp7984
johd-11-438-g1.png
Figure 1

SPARQL Query for retrieving Art museums with at least 5,000 records in Wikidata (https://w.wiki/FZGn).

Definition of open data compliance criteria

In the second stage of the method proposed in this work, a set of compliance criteria was defined for application to the selected Art museums. In this context, and based on previous work (Faerber et al., 2017), we tried to answer the following questions.

  • Does it provide a public SPARQL endpoint? Is the repository federated in Wikidata?

  • Does it include examples of reuse and prototypes (e.g., timelines, reproducible code)?

  • Does it get updated frequently?

  • Is additional documentation provided to improve its understandability?

  • What is the ratio between the total number of records in the source and the number of records linked in Wikidata?

Based on these questions, a set of data quality criteria was defined, as described below.

Understandability

Providing descriptive and textual information about the dataset is essential to foster understandability (Candela, Gabriëls, et al., 2023; Dişli, Gabriëls, et al., 2025). Textual documentation can be provided as a document or a README file. It can include the creators, information about provenance and biases as well as references and potential uses (Alkemade et al., 2023). This criterion is relevant to potential researchers who wish to reuse the repository’s content.


                                      Dund =
                    {1 documentation provided | 0 other case}

Availability

Providing an API2 is crucial for enabling reuse (Candela, Gabriëls, et al., 2023). However, the availability of these services requires the use of IT infrastructure that, in some cases, such as in small and medium-sized institutions, can be an issue. This criterion assesses whether the repository is available as a public SPARQL endpoint.


                                      Dava =
                    {1 SPARQL endpoint available | 0 other case}

Federated

New uses of the data include exploring federated queries to search across repositories (Dişli, Candela, et al., 2025). Wikidata facilitates the employment of federated SPARQL queries with a selection of repositories.3 This criterion assesses whether the repository is federated with Wikidata.


                                      Dfed =
                    {1 repository federated in Wikidata | 0 other case}

Reusability

Providing examples of use is a key element in publishing digital collections (Candela, Gabriëls, et al., 2023; Dişli, Gabriëls, et al., 2025; Tønnessen & Birkenes, 2025). This criterion assesses whether the institutions provide examples of use in the form of prototypes, tools or reproducible code such as Jupyter Notebooks as promoted by international initiatives (Candela, Gabriëls, et al., 2023; Mahey et al., 2019).


                                      Dreuse =
                    {1 provision of examples of use | 0 other case}

Timeliness

Based on previous work (Candela et al., 2022), this criterion assesses whether the data is updated frequently. Figure 2 shows a SPARQL query to retrieve the last edit date of the selection of Art museums. To assess the update frequency of Art museums’ Wikidata entries, the following rating scale was established:

johd-11-438-g2.png
Figure 2

The SPARQL query for retrieving the last edit date of the assessed Art museums (https://w.wiki/FTnc) (Query: 2025-10-25).


                                      Dtime =
                    { 1 updated within the last month |
                    0.75 updated within the last 1–3 months |
                    0.5 updated within the last 3–6 months |
                    0.25 updated within the last 6 months–1 year|
                    0 updated more than 1 year ago }

Enrichment

This criterion assesses the ratio between the total number of resources linked to Wikidata and the total number of resources in the original repository. It facilitates information about which museums are connected to external repositories.

The total records in Wikidata were obtained using Figure 1 and are presented in Table 1. The total number of resources in the original repository was manually retrieved from the museums’ websites and presented in Table 2.

Table 2

Total number of resources in the original repository. Note that the total record value was manually retrieved from the museum’s websites.

MUSEUMSTOTAL RECORDS
National Gallery of Art160000
Metropolitan Museum of Art490000
Louvre Museum500000
National Gallery of Armenia25000
National Museum in Warsaw830000
Yale University Art Gallery191886
Bavarian State Painting Collections30000
Museo Egizio In Turin (IT)40000
Finnish National Gallery42000
Royal Museum of Fine Arts Antwerp10000

File repository

In addition to the metadata, institutions can also make their content, including images or textual descriptions, available for the public. Wikimedia Commons has emerged as a powerful service to publish multimedia collections of images and material.4 For instance, images can be used to enhance Wikipedia articles for various purposes, such as adding portraits to artist biographies and illustrating historical events and concepts. Wikimedia Commons has also incorporated structured data,5 which links Wikidata to Wikimedia Commons files. It helps images to be better linked (on Wikipedia and beyond), found, and described, including its visual attributes, which can facilitate query and search using their visual information, which is particularly important for Art museums. This criterion assesses whether the Art museums have a Wikimedia Commons URL. Figure 3 shows an SPARQL example that uses the Wikidata property P373 to extract links.

johd-11-438-g3.png
Figure 3

The SPARQL query for retrieving the Wikimedia Commons links of the assessed Art museums (https://w.wiki/FTnp). Note that the VALUES instruction is employed to provide the Wikidata identifiers of the Art museums selected for this work.


                                      Dfile =
                    {1 Wikidata Commons URL available | 0 other case}

Reporting the results

The last step of the process consists of the reporting. It can be done in the form of text, describing the findings, the similarities, and the differences. The information can be provided as a textual description (e.g., PDF or README document), but also as data (e.g., CSV). This step is relevant to identifying best practices and gaps amongst institutions.

(4) Results and discussion

All museums included in the assessment provide basic descriptive information in Wikidata, such as description, museum type, inception, and country. In addition to these fields, the Louvre Museum and the National Museum in Warsaw also include opening hours. A wide range of further information is available for some museums, including social media, director, founder, coordinates, and logo. The completeness of descriptive information enhances the visibility of these museums in Wikidata.

To assess the criterion Understandability, it was examined whether additional documentation is provided in the Wikidata entries. In the Wikidata entry of the National Gallery of Art, documentation is provided in the “external data available at URL” section. The Wikidata entries of the other museums do not contain any documentation.

In assessing the Availability criterion, it was examined whether the museums provide access to their collection data via an API. In this context, a detailed web search was conducted to determine whether the assessed Art museums offer a SPARQL endpoint. As a result of this examination, museums that provide data access through an API or SPARQL endpoint were marked as “1”, while those that do not provide an API were marked as “0”. The National Gallery of Art, the Metropolitan Museum of Art, the Yale University Art Gallery, and the Finnish National Gallery make their data accessible via API; no API information was found for the other museums.

The criterion Federated assessed the federation of museums with Wikidata. This was determined by examining whether the museums assessed were included in the SPARQL Endpoints List provided by Wikidata. The list also indicates which SPARQL endpoints are federated. None of the museums assessed in this work is included in the aforementioned list.

A thorough examination of the Museums’ entries on Wikidata was conducted to identify the presence of reusable examples (criterion Reusability). In this context, particular attention was paid to the utilisation of reusable code and the sharing of GitHub repositories. The National Gallery of Art facilitates access to GitHub repositories under the section entitled “external data available at URL”. At the same time, the Metropolitan Museum of Art provides access through the “GitHub username” section. Reusable examples can be accessed via the links provided. However, it is notable that no such connections are available in the entries of the other museums.

To assess the Timeliness criterion of the museums’ representations in Wikidata, the last edit dates were retrieved via a SPARQL query (Figure 2) on 25 October 2025. Accordingly, museums updated within the last 1 month were scored as 1, those updated within 1–3 months as 0.75, within 3–6 months as 0.5, within 6 months–1 year as 0.25, and those not updated for more than 1 year as 0. As shown in Table 3, the museums’ Wikidata entries are up to date, with most having been updated within the last 1 month or 1–3 months.

Table 3

Assessment of open data compliance of Art museums in Wikidata based on the defined criteria.

MUSEUMSCRITERIA
DUNDDAVADFEDDREUSEDTIMEDENRICHDFILE
National Gallery of Art110110.801
Metropolitan Museum of Art010110.141
Louvre Museum000010.071
National Gallery of Armenia00000.750.961
National Museum in Warsaw00000.750.021
Yale University Art Gallery01000.750.101
Bavarian State Painting Collections00000.750.601
Museo Egizio In Turin (IT)000010.251
Finnish National Gallery010010.211
Royal Museum of Fine Arts Antwerp000010.791

In assessing the Enrichment criterion, the ratio between the total number of records in Wikidata and the total number of resources in the original repository was determined. A ratio close to 1 indicates that the museum is highly integrated with Wikidata. A large portion of the National Gallery of Armenia’s original collections is integrated into Wikidata. In contrast, only about 2% of the National Museum in Warsaw’s collections are present in Wikidata.

To assess whether the museums provide content on Wikimedia Commons through structured data, their Wikimedia Commons links were retrieved using a SPARQL query (Figure 3). As shown in Table 3, in the File Repository criterion, all the museums have content in Wikimedia Commons.

In light of all these assessments, it can be stated that the National Gallery of Art demonstrates the highest level of open data compliance maturity and can be considered a best practice example.

Discussion

This work has limitations in scope and coverage. Regarding the provided compliance criteria, additional topics could be considered as an extension of this work. Other approaches to assess LOD repositories use Shape Expressions to analyse the metadata provided (Candela, 2023).

Note that in some cases, an institution may have several properties to link several types of resources, such as artworks and artists.6

In addition to Wikimedia Commons, alternative services could be used to extend this work, such as GitHub, Zenodo, and other platforms.

Regarding machine-readability, when using Wikidata to enrich metadata, many original sources do not include this kind of metadata. By using machine-readable metadata, institutions can explore new uses of the data.

Our methodology for assessing the Timeliness criterion of museums based on “last edit” timestamps has a key limitation. A recent edit to a museum’s Wikidata entry may not originate with the institution itself or an official partner, such as a Wikimedian in Residence, but with a dedicated volunteer editor. For instance, while The Metropolitan Museum of Art formally concluded its Wikidata initiative years ago, recent edits associated with its collection are likely the work of the volunteer community. Consequently, timeliness alone is an insufficient metric for determining if a cultural organisation is actively contributing to the linked open data ecosystem. Future research would benefit from more nuanced methods to distinguish between institutional and crowdsourced contributions.

For the assessment of the Availability criterion, a web-based investigation was conducted to determine whether the museums provide a SPARQL endpoint. It should be noted that an endpoint may exist but remain undiscovered. Nevertheless, the findings clearly indicate that these museums do not offer easily accessible APIs.

In addition to the technical dimension of accessibility, the biases inherent in Wikidata may also affect accessibility (Ford & Wajcman, 2017; Mandiberg, 2023). Although biases were excluded from the criteria of this work, it remains an important issue that should be addressed in future research on the accessibility of Wikidata.

Finally, there are two key criteria, the provision of machine-readable metadata and clear licencing information, which were not part of Table 3. Our analysis revealed that these are not binary properties of an institution but rather emergent characteristics of digital collections. We propose reframing these as quantifiable “metadata footprints” for future research.

Machine-Readability

The provision of machine-readable formats is essential for computational access, data enrichment, and reuse (Candela, Gabriëls, et al., 2023; Dişli, Gabriëls, et al., 2025; Faerber et al., 2017). While traditional formats like MARC7 represent one type of machine-readable data, modern practices include a spectrum of formats, notably LOD. As a preliminary assessment, the Machine-Readability criterion was assessed by examining randomly selected collections of the Art museums to determine whether they provide metadata in machine-readable formats such as MARC. The analysis revealed that the Louvre Museum, the National Museum in Warsaw, and the Yale University Art Gallery provide metadata in machine-readable formats. The other Art museums, on the other hand, were found to share information about their collections mostly in the form of textual descriptions. However, more comprehensive evaluations indicated that a simple binary assessment (available/not available) is insufficient. Future work should instead calculate the percentage of records within an institution’s collection that are available in a machine-readable format, providing a more nuanced measure of their computational accessibility.

Licencing Information

The clear provision of licences is a critical factor for data reuse, particularly for researchers (Dişli & Candela, 2025). Our initial assessment using the licence property (wdt:P275) on museum entities yielded limited results, as this property is rarely applied to institutions themselves. We determined that licence information is typically asserted on digital representations of individual artworks.

An alternative analysis using the copyright status property (wdt:P6216) on artworks proved more fruitful. To overcome SPARQL query-timeout limitations caused by data volume, we developed an R script to calculate the proportion of artworks with licence and copyright information per institution. The analysis revealed that:

  • wdt:P275 (licence) was used almost exclusively by the Metropolitan Museum of Art among the assessed institutions.

  • wdt:P6216 (copyright status) provided more comprehensive coverage across multiple institutions.

This detailed, record-level recording of rights information enables meaningful comparison through what we term metadata footprints. This approach aligns with methodologies that examine how cultural institutions document metadata, allowing researchers to trace and compare licencing practices (Gutiérrez De la Torre et al., 2021). As shown in Figure 4, visualising these footprints enables an at-a-glance comparison of how different institutions manage rights metadata across their collections, moving beyond a simplistic yes/no.

johd-11-438-g4.png
Figure 4

Metadata footprint of licences across institutional collections. This chart illustrates the varying proportions of artworks with documented licence or copyright status within each museum’s Wikidata records.

(5) Implications/Applications

This work is the first approach to analyse how Art museums and Wikidata are connected. It explores the benefits and gaps and provides best practices for museums willing to start using Wikidata to enrich and reuse their digital collections.

More importantly, because it clearly highlights the geographical bias in Wikidata, it can also be seen as a call to action: all the top museums in Wikidata (by number of records) are located in the Global North (Pereda et al., 2025). This is not a coincidence, but rather a reflection of the material and institutional resources required for the sustained digital cultural work that facilitates integration with platforms like Wikidata. This disparity, however, risks creating and reinforcing digital silos that reproduce the unequal global distribution of knowledge. By mapping this limitation, our article aims to raise awareness of this inequity and contribute to scholarly and practical efforts to diversify the digital cultural sphere. In the context of these inequalities in data sharing and open data practices, we particularly emphasise the importance of considering the CARE (Collective benefit, Authority to control, Responsibility, Ethics) Principles, which highlight that the sharing of Indigenous data should not only ensure accessibility but also uphold ethical principles and promote collective benefit (Carroll et al., 2020). In addition to the CARE Principles, it is argued that the primary reasons certain criteria discussed in this study are not fully met stem from the absence of clear open data policies, documentation, and implementation guidelines within museums. The implementation of such measures would not only facilitate the integration of museums and Wikidata, but also support the development of more inclusive, equitable, and open practices within the cultural heritage ecosystem.

Data Accessibility Statement

https://doi.org/10.5281/zenodo.17440061

Notes

[1] SPARQL is a query language and protocol utilised for the retrieval and manipulation of data stored in RDF (Resource Description Framework) format.

[2] The term ‘API’ (Application Programming Interface) refers to a set of rules that enable different software applications to interoperate.

[6] See, for example, el Prado Museum (Q160112) which has two properties to link artists (P5321) and Art works (P8905).

[7] MAchine-Readable Cataloging.

Acknowledgements

We would like to thank Olga Holownia for her help during the preliminary versions of the text. A reduced version of this work was presented at the LOD4 Community event in 2024 and the VIII Jornadas sobre Bibliotecas de Museos in 2025.

Competing interests

The authors have no competing interests to declare.

Author Contributions

Conceptualisation – all authors; Methodology – all authors; Writing – original draft – all authors; Writing – review and editing – all authors.

DOI: https://doi.org/10.5334/johd.438 | Journal eISSN: 2059-481X
Language: English
Submitted on: Oct 26, 2025
Accepted on: Nov 26, 2025
Published on: Dec 12, 2025
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2025 Meltem Dişli, Gustavo Candela, Silvia Gutiérrez, Giovanna Fontenelle, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.