Have a personal or library account? Click to login
Piloting Wikidata as an Authority Identifier: The (In)visible Women Project at the Smithsonian Institution Cover

Piloting Wikidata as an Authority Identifier: The (In)visible Women Project at the Smithsonian Institution

Open Access
|Feb 2026

Full Article

(1) Overview

Repository location

Wikidata American Women’s History Initiative Focus List

https://www.wikidata.org/wiki/Special:WhatLinksHere/Q62695096

Smithsonian Figshare: 10.25573/data.30898958

Context

The Smithsonian Institution consists of 21 museums, 21 libraries, and 14 education and research centers. A vast amount of data is created and managed across these “units,” including information about people associated with Smithsonian collections. While several Smithsonian units maintain people records, many of these records, before the completion of our project, did not previously include authority identifiers and these “local” names were siloed within unit databases. The (In)Visible Women project: 1) gathered existing identifiers; 2) built data sharing practices by developing Smithsonian-wide guidelines for creating Wikidata identifiers for people; and 3) ingested all authority identifiers gathered and created into Smithsonian unit databases.

Wikidata is an open, multilingual, structured database and each record in it is indexed by a unique identifier, or Q number (Wikidata, n.d.). All records can express statements, which connect records relationally. Compared to other authority identifier systems, Wikidata has a relatively low barrier of entry. It is extremely flexible with no minimum or maximum number of statements, the capacity to cite sources and express doubt, and methods for documenting multiple and/or conflicting values. Smithsonian leadership guided our project toward Wikidata, which has a history of use as an authority identifier in cultural heritage institutions (Van Veen, 2019; Martin et al., 2022). In fact, Smithsonian Libraries and Archives administers its own internal Wikibase (Naples, 2022; Shieh, 2024; 2022).

Our use of Wikidata places our work in conversation with broader discussions surrounding the use of linked data to document and share cultural heritage information (Hyvönen, 2012; Goddard, 2010; Hawkins, 2022; Davis & Heravi, 2021; Nuno & Antoine, 2020). Much of this work focuses on the potential benefits and applications of linked data (Sorensen et al., 2023; Fenlon et al., 2025; Marsh et al., 2024; Milbrodt, 2025). However, we have also been mindful of Wikidata’s public nature. Indigenous cultural heritage professionals and scholars are critically considering Wiki platforms, what they offer, and the risks to Indigenous data sovereignty that these sites can pose (Carlson & Rana, 2024; Thorpe et al., 2023). We approached our project mindful of Wikidata’s risks and affordances (Bunout et al., 2025).

The (In)Visible Women Project dovetails with ongoing work across the cultural heritage and digital humanities fields. Scholarship has long documented how ideologies about gender and race have shaped our digital technologies, particularly in both the development, design, and building of these technologies (Wen, 2014; Nakamura, 2014; Chun, 2004), and the data therein (D’Ignazio & Klein, 2020; Noble, 2018; Kitchin, 2014; 2017). Museum collecting and labor practices have also historically devalued women as artists, scholars, and professionals, which is reflected as silences in associated collections (Heitman, 2017; Rayburn, 2025; Baeza Ruiz, 2018; Turner, 2020). Several digital humanities projects aim to correct gendered bias in the historical record and more fully document the contributions and identities of women, including The Orlando Project (University of Alberta, 2025), The Women Writers Project (The NULab, 2025), the Mrs. Project (Mattson, 2025), the Funk List (Dikow & Tsuchiya, n.d.), and Women in Red (Wikiproject Women in Red, n.d.). We see our contributions to Wikidata’s American Women’s History Initiative Focus List as working in tandem with these efforts.

This focus list is dynamic and results from the work of many Wikidata editors. The first author contributed 747 records to this dataset with more records linked on an ongoing basis by Smithsonian staff and potentially other editors. We do not claim to have developed all the records within this dataset, and all edits can be traced within Wikidata to specific editors.

(2) Method

Steps

Phase One: May 2022–April 2023

Units across the Smithsonian use different collections information systems, including EMu (Electronic Museum), TMS (The Museum System), and ArchivesSpace. These systems do not speak to each other and units prioritize different data points to include within their people records, as reflected in their collecting policies and scopes. These workflows often result in duplicate people records across the Smithsonian.

Phase One of the project focused on data cleanup. A contractor disambiguated names provided by the 10 participating Smithsonian units and associated existing identifiers from external sources to people records. Each participating Smithsonian unit sent datasets of person names and relevant biographical information. Using OpenRefine, identifiers for each person name were located and imported from several databases: Social Networks and Archival Context (SNAC), Union List of Artist Names (ULAN), Virtual International Authority Files (VIAF), and Wikidata. The Library of Congress Authority Files (LCNAF) identifiers were also imported manually using links from other identifiers or by searching in the Library of Congress Name Authorities. OpenRefine software was used to access identifiers via APIs in order to automate this part of the process; even so, each name required research and disambiguation to find the appropriate identifiers. Each relevant unit received a spreadsheet of the gathered identifiers for quality checks.

Of the over 20,000 people records reviewed by the contractor, 8,700 had existing external identifiers. However, it was determined that over 11,600 names had been locally created (see Table 1).

Table 1

Number of person names reviewed during phase one of the (In)Visible Women project by Smithsonian unit.

SMITHSONIAN UNITTOTAL NAMESNAMES w/IDSNAMES w/out IDS
Archives of American Art22451515730
Archives of American Gardens504010
Anacostia Community Museum1689672
Anacostia Community Museum Archives698480218
National Museum of Asian Art Archives41034565
National Museum of African American History and Culture306917701299
National Museum of the American Indian (ArchivesSpace)45943722
National Museum of the American Indian (EMu)890515137392
National Portrait Gallery25236951828
Smithsonian American Art Museum183418340
TOTAL20,3618,72511,636

Phase Two: May 2024–August 2025

During this portion of the project, we created 747 Wikidata records, drawing on the local names specifically. We also edited 720 existing people records in Wikidata, linking them back to Smithsonian collections that they are affiliated with. For all of the records created and edited, we have linked them to the American Women’s History Initiative Focus List on Wikidata.

The information that we entered into these records is based on a data model we developed collaboratively with the participating units. At the start of phase two, we conducted informational interviews with the project partners using semi-structured interview methods (Lui & Wildemuth, 2009). During the interviews, we asked questions about data management within the interviewees’ units, their database workflows, what types of information they find important to capture about people represented in their collections, and what names they see as a priority for Wikidata record development. Based on our preliminary work with Wikidata and insights gleaned via thematic coding (Saldana, 2016), we developed a flexible data model available on our Wikiproject page. Every field is optional to account for varying degrees of data maintained across Smithsonian units.

With the support of Smithsonian colleagues, we ingested all external identifiers gathered and created through phase one and two into multiple Smithsonian databases, including ArchivesSpace, TMS, and EMu. We have also collaborated with participating unit representatives and with the Office of Digital and Innovation to build shared practices on where to document external authority identifiers in the various databases used at the Smithsonian. For example, in ArchivesSpace, identifiers are ingested into the record IDs section, and in TMS, identifiers are ingested as URLs in the alternate numbers field.

Our work in Wikidata, as a public and collaborative platform, prompted consideration about when we should not make a Wikidata record for a given person. Based on conversations with the Smithsonian Privacy Office, we developed a decision tree that considers a number of legal and privacy questions (see Figure 1). For example, in Smithsonian records, is this person a named donor, archival creator, or artist/maker? If so, Smithsonian staff are in a position to share information about that person (Smithsonian Institution, 2024). For this document, we have drawn on research concerning legal definitions of who counts as a public figure (Legal Information Institute, 2020), scoping outlines from ULAN and LCNAF (Library of Congress, n.d.; Getty Research Institute, 2024), and notability guidance from Wikipedia (Wikipedia Notability, n.d.; Wikipedia Notability (people), n.d.; Wikipedia Notability (academics), n.d.).

johd-12-467-g1.png
Figure 1

Notability decision tree.

Data Model and Dataset Limitations

Even though this project aimed to remedy persistent data gaps, the resulting dataset has limitations, per the institutions and data models that shaped it. The names in this dataset were composed by the collecting policies of the partnering Smithsonian units, each with their own histories and focuses. Further, the databases in use at each Smithsonian unit have different search capacities. Some of the database exports of people names were scoped by a recorded gender identity and others were not.

We also encountered affordance challenges from using Wikidata as an authority identifier source. For example, Wikidata lacks accurate statement options for documenting Tribal affiliation. We considered a number of potential statements, such as “country of citizenship” and “ethnic group,” but these options felt either inappropriate for our use as digital curators or inaccurate. We eventually settled on including Tribal affiliation in the description field as it is searchable there. This impacts possible SPARQL queries, however, as this search method cannot be scoped by Tribal affiliation as a result of our data model decision. Additionally, our data model does not include a “sex or gender” statement, as in the majority of cases we did not know gender identity from the perspective of the individual reflected in a given record (see The Trans Metadata Collective et al., 2025).

(3) Dataset Description

Repository name

Wikidata; Smithsonian Figshare

Object name

American Women’s History Initiative Focus List

Format names and versions

Wikidata Linked Open Data; AWHI_wikidata_items.json

Creation dates

Author contribution dates May 2024–August 2025

Dataset creators

The authors and all Wikidata editors documented within Wikidata editing history

Language

English

License

CC0

(4) Reuse Potential and Outcomes

The American Women’s History Focus List includes biographical information about many American women and their artistic, historical, and philanthropic achievements. Researchers can reuse this data source when conducting research associated with anyone on the list. Additionally, cultural heritage institutions looking to develop and share their own datasets with Wikidata can draw on our data model for curation guidance, and our notability tree to learn about our approach to privacy and organizational risk.

Some reuse cases are made possible by Wikidata itself. Even with limited coding experience, digital humanists can use the Wikidata Query Service to formulate quantitative research questions. Many Wikidata tools and extensions operate across Wikimedia projects. Women in Red, one of the most prominent Wikipedia projects, uses the Wikidata tool Listeria to automatically generate and update lists of “red links,” or names of women with Wikidata items but no Wikipedia articles. The American Women’s History Initiative Focus List items constitute at least 800 of the names on these lists.1

Smithsonian staff are already demonstrating the reuse potential of this dataset and the viability of our workflows. National Museum of the American Indian (NMAI) Archives staff regularly create Wikidata and ArchivesSpace records in tandem. Smithsonian American Art Museum staff have also assembled a list of new artists added to their collection and ingested a new set of identifiers into TMS.

Notes

Acknowledgements

We thank Kara Lewis for her leadership. During phase one and phase two, she spearheaded this work in partnership with Rachel Menyuk. Many thanks to all the Smithsonian project partners and collaborators who made this work possible. Lastly, one big thank you to the American Women’s History Initiative and the Smithsonian American Women’s Museum for their financial support.

Competing interests

Amanda Sorensen was a copyeditor for the Journal of Open Humanities Data from winter 2021 to July 2025. This article was drafted and edited after she stepped away from this volunteer position.

From May 2024–August 2025, Amanda Sorensen was employed as a contractor at the National Museum of the American Indian where she managed phase two of the (In)Visible Women Project. Her contributions to the American women’s focus list were as a paid Wikidata editor.

Author Contributions

Amanda Sorensen: data curation, investigation, methodology, formal analysis, writing – drafting, editing, and reviewing.

Rachel Menyuk: conceptualization, supervision, funding acquisition, project administration, writing – review and editing.

Deborah Shapiro: conceptualization, resources, writing – review and editing.

Richard Naples: conceptualization, resources, writing – review and editing.

DOI: https://doi.org/10.5334/johd.467 | Journal eISSN: 2059-481X
Language: English
Submitted on: Nov 10, 2025
|
Accepted on: Jan 5, 2026
|
Published on: Feb 3, 2026
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2026 Amanda H. Sorensen, Rachel Menyuk, Deborah Shapiro, Richard Naples, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.