Have a personal or library account? Click to login
From Websites to Wikidata: Digitising Scotland’s Stories Cover

From Websites to Wikidata: Digitising Scotland’s Stories

Open Access
|Feb 2026

Full Article

(1) Overview

Repository location

Both datasets are available on Figshare: https://doi.org/10.6084/m9.figshare.30585410 (Shaw et al., 2025) and https://doi.org/10.6084/m9.figshare.30585428 (Young et al., 2025)

Context

The data for these two projects was produced as part of CS4099 (University of St Andrews, 2026), a module undertaken by Computer Science students in the final year of their Honours degree at the University of St Andrews. The module requires a substantial piece of research, frequently inspired by real-world challenges, in collaboration with a member of staff over the course of an academic year. The software engineering approaches applied were similar, but each student experienced unique challenges posed by the original sources and subsequent cleaning of the data. The theses created are titled: ‘Mapping Memorials to Women in Scotland’ (Shaw, 2024), using the Women’s History Scotland website (Women’s History Scotland, 2012) as its data source, and ‘Visualising the foundations of Scotland’s industrial past’ (Young, 2024), using the Scottish Brick History website (Cranston, 2014) as its source. The data on both websites appears predominantly as loosely structured continuous prose. Both theses and code bases are available on request. The web output for the latter project is also available online and is an example of how such datasets could be creatively re-used (Young, 2025).

(2) Method

Step 1 – Web scrapers

Both projects built web scrapers in Python using Beautiful Soup (Richardson, 2019). Both scrapers were designed to complete two tasks: (1) extract data from a website and (2) convert data into a .csv format, ready for cleaning.

Step 2a – Tasks specific to Women’s History of Scotland scraping and cleaning

To create the Women of Scotland memorials dataset from the Women’s History Scotland website, the web scraper needed to collect information about the memorials as well as about the individuals that they were honouring. Before saving the data into a .csv file, the program separated the name of the memorial and/or the subject by using the unique end of the URL, highlighted in bold in this example https://womenofscotland.org.uk/memorials/memorial-jenny-geddes. This was used to populate the ‘Women of Scotland memorial ID’ (Wikidata property P8048 (Wikimedia community, 2020a)) or ‘Women of Scotland subject ID’ (Wikidata property P8050 (Wikimedia community, 2020b)) for the Wikidata entries. It also extracted div elements with ‘class’ attributes named ‘birth’, ‘death’, ‘dedicated to’, ‘inscription’, ‘location’, ‘erected by’, ‘location name’, ‘street address’, ‘city province’, and ‘postal code.’ Substantial manual data cleaning was required due to high variation in inscription and material fields. These were standardised on a case-by-case basis to ensure consistency e.g., removal of editor notes. Additional fields relating to historic counties and current Scottish council areas were included using both Wikidata (Wikimedia community, 2025b) and Towns, Counties, Postcodes, UK! (Towns, Counties, Postcodes, UK!, 2025) as sources. Once the initial data cleaning was complete, the data was moved into OpenRefine version 3.6.2 (Open source community, 2012) so that it could be cross-referenced and reconciled with Wikidata. Most data (e.g., material, inception date) could be matched to existing Wikidata items, but this was less straightforward for the subjects of the memorials. For subjects, it was necessary to check married and maiden names, nicknames or variations, with and without middle names. If that failed, web searches were performed for male relatives to try to identify the subject. If these avenues were exhausted, then a new Wikidata item was created for the subject. Once reconciliation was complete, data was uploaded to Wikidata in batches according to the schema in Tables 1 and 2 below:

Table 1

Women of Scotland schema for uploads via OpenRefine, specifically for the memorials.

WIKIDATA STATEMENT LABELWIKIDATA PROPERTY (NUMERICAL VALUE)DESCRIPTION
Women of Scotland memorial IDP8048The Women of Scotland project identifier for a memorial to a woman in Scotland
Coordinate locationP625Coordinates of the subject
Instance ofP31The class of which this subject is a particular example and member. For example, St Margaret’s Chapel is an instance of a memorial
DepictsP180Entity visually depicted in an image
Made from materialP186Material the subject is made of or derived from
Historic countyP7959Traditional, geographical division of Great Britain and Ireland
Table 2

Women of Scotland schema for uploads via OpenRefine, specifically for the subject(s) commemorated by the memorial.

WIKIDATA STATEMENT LABELWIKIDATA PROPERTY (NUMERICAL VALUE)DESCRIPTION
Women of Scotland subject IDP8050Subject identifier for a woman in the Women of Scotland project
Instance ofP31The class of which this subject is a particular example and member. For example, Isabella Elder is an instance of a human

Step 2b – Tasks specific to Scottish Brick History scraper and cleaning

To create the Scottish Brick History dataset from the Scottish Brick History website, the web scraper needed to collect information about the industrial locations but not the individuals involved in brick manufacture or the physical bricks. This was agreed, following detailed discussions with M.C., to make the project feasible in the time available. Before saving the data into a .csv file, the program separated the name of the works from the location and extracted the unique end of the URL to be used as the ‘Scottish Brick History Brick & Tile ID (Wikidata property P8700 (Wikimedia community, 2020c))’ on the Wikidata entries. Due to the naturalistic writing of the Scottish Brick History website, it was necessary to gather all remaining data manually. During data cleaning, it was clear that geolocations were inaccurate and so manual comparisons between Ordinance Survey maps (National Library of Scotland, 2024), Google maps (Google, 2025), Canmore (Historic Environment Scotland, 2025), Railscot (Crawford, 2025), Mapcarta (OpenStreetMap, 2025) and/or Greenhill Historical Society (Greenhill Historical Society, 2022) were made to identify the original location for each work. Once the geolocation and information gathering stages were complete, the data was moved into OpenRefine version 3.6.2 (Open source community, 2012) so that it could be matched to Wikidata. Additional information, such as country and modern administrative entities were added. Data was uploaded to Wikidata in batches according to the schema in Table 3 below.

Table 3

Scottish Brick History schema for uploads via OpenRefine. Country (Wikidata property P17) was marked as United Kingdom (Wikidata item number Q145) on all entries.

WIKIDATA STATEMENT LABELWIKIDATA PROPERTY (NUMERICAL VALUE)DESCRIPTION
Scottish Brick History Brick & Tileworks IDP8700Scottish Brick History project identifier for a brick or tile works in Scotland, without any trailing slash
InceptionP571Date or point in time when the organisation/object was founded/created
Date of official closureP3999Date of official closure of a building or event
Canmore IDP718Identifier in the Royal Commission on the Ancient and Historical Monuments of Scotland’s Canmore database
Coordinate locationP625Coordinates of the subject
Instance ofP31The class of which this subject is a particular example and member. For example, Anchor Brickworks is an instance of a brickworks
Located in the administrative territorial entityP131The item is located on the territory of the following administrative entity

Sampling strategy

Both datasets were small enough in size that a sampling strategy was not necessary. We were able to collate all memorials and subjects from the Women’s History Scotland website, and all brick, tile, and associated works in Scotland from the Scottish Brick History website.

Quality control

Data was cleaned as described in step 2a and 2b. In cases where there was conflict between identical names, additional internet research was conducted to identify the correct match. If the Wikidata item did not exist, then the item was created de novo.

(3) Dataset Description

Repository name

Figshare

Object names

Wikidata Entries for Women of Scotland Memorials and Wikidata Entries for Scottish Brick History

Format names and versions

.csv

Creation dates

2023-11-23 to 2026-01-08 and 2024-04-07 to 2026-01-08 respectively.

Dataset creators

Jennifer Shaw (undergraduate researcher, School of Computer Science, University of St Andrews) created and cleaned the original dataset built from the Women’s History of Scotland website. Grace Young (undergraduate researcher, School of Computer Science, University of St Andrews) created and cleaned the original dataset built from the Scottish Brick History website. Sara Thomas (Wikimedia UK Programme Manager, Wikimedia UK) provided training about OpenRefine, proposed the Women of Scotland memorial ID3 (P8048), Women of Scotland subject ID4 (P8050), and Scottish Brick History Brick & Tileworks ID5 (P8700) properties in Wikidata, and performed the batch uploads to Wikidata. Kirsty Ross (project supervisor, School of Computer Science, University of St Andrews) uploaded the datasets to the Figshare repository.

Language

English

License

CC BY 4.0

Publication date

2025-11-10

(4) Reuse Potential

The data is available in the Figshare repositories cited above, but it is also available as Linked Open Data under the Women of Scotland memorial ID (P8048), Women of Scotland subject ID (P8050), and Scottish Brick History Brick & Tileworks ID (P8700) properties in Wikidata. As proof of principle, J.S. and G.Y. reused Wikidata to create accessible, user-tested, web-based visualisations as part of their dissertations, as shown in Figure 1A and B respectively.

johd-12-471-g1.png
Figure 1

A Examples of data reuse. (A) J.S. created a Data Explore tab to give headline statistics on the memorials e.g., how many memorials are on Wikidata, how many women they commemorate, which memorial and subject has the most statements, and the average number of statements on the Wikidata items. (B) G.Y. mapped the brickworks onto contemporary maps from 1883–1903, highlighting the co-localisation of industrial sites with railway lines that were active at the time.

Adding such datasets to Wikidata enables Wiki editors to enrich each Wikidata item even further by adding additional Wikidata statements (e.g., inscription (P1684), image (P18)) or connecting to other identifiers in Wikidata such as British Listed Buildings ID (P12485). Both datasets on Wikidata already benefit from editors adding other properties and identifier links to other databases, such as translating names into other languages and uploading images to Wikimedia Commons (Figure 2B).

johd-12-471-g2.png
Figure 2

Screenshots taken from the WikiShootMe web tool (A) focuses on St Giles Cathedral in Edinburgh, which contains several memorials to women. In the pop-up window is the memorial to Jenny Geddes (Q124387850). The coordinates are marked with a green circle, as a Wiki editor has uploaded a photo of the memorial to Wikimedia Commons. (B) focuses on Larbert Fire Clay Works (Wikidata item Q124514935) which was added to Wikidata because of this project. The coordinates are marked with a red circle, as the Wikidata item does not have an image (P18) statement associated with it.

Since the upload to Wikidata, the data has already been incorporated into tools that support Wiki editors to add images to Wikimedia Commons. Memorials and brickworks are now populating the WikiShootMe map (Manske, Magnus, 2016), as seen in Figure 2A and B respectively:

In addition to these concrete examples of reuse, there are a plethora of other opportunities inside and outwith the Wiki projects to extend and expand such approaches. For example, adding biographies about the identified women to the Women in Red project (Wikimedia community, 2025a) or expanding the dataset to include additional data about the individual bricks which number in the thousands. Academia collaborating with the Wikiprojects is not new, as demonstrated by the 120 million words added to Wikipedia across 7,650 classes at higher education institutions via the ‘Wikipedia Student Program’ in the USA and Canada (Wiki Education, 2022), and Wikipedians-in-Residence in universities across the UK (McAndrew & Thomas, 2025). However, these projects primarily focus on filling knowledge gaps on Wikipedia by editing articles. This data paper provides a novel Wikidata mechanism for encouraging collaboration between charities, universities, and the Wiki community to increase access to open knowledge. From a teaching perspective, potential barriers to reuse include the difficulties experienced in developing web scrapers to pull data from the continuous prose found on blogs and websites. A considerable amount of time was required for the initial data cleaning (approximately 10 hours/week for 11 weeks). This will be a common problem for those looking to replicate the project with similar sources of data, but it is manageable with enthusiastic students, time, and a lot of patience.

Notes

[1] https://www.womeninscotland.org.uk/ [Accessed 2026-01-19].

[2] https://www.scottishbrickhistory.co.uk/ [Accessed 2026-01-19].

Acknowledgements

The authors would like to thank our participants from the School of Computer Science at the University of St Andrews, Women’s History Scotland, Glasgow Women’s Library, Wikimedia UK, and Scottish Brick History, who provided valuable user feedback on the two websites which led to substantial improvements.

Competing Interests

ST is an employee of Wikimedia UK. The remaining authors have no competing interests to declare.

Author Contributions

  • - J.S.: conceptualisation; data curation; formal analysis; investigation; methodology; software; validation; visualisation; writing (review & editing)

  • - G.Y.: conceptualisation; data curation; formal analysis; investigation; methodology; software validation; visualisation; writing (review & editing)

  • - M.C.: conceptualisation; data curation; funding acquisition; resources; writing (review & editing)

  • - S.T.: conceptualisation; data curation; methodology; project administration; supervision; validation; writing (review & editing)

  • - K.R.: conceptualisation; methodology; project administration; supervision; validation; writing (original draft); writing (review & editing)

  • - Jennifer Shaw and Grace Young contributed equally to the datasets and are listed alphabetically by surname

DOI: https://doi.org/10.5334/johd.471 | Journal eISSN: 2059-481X
Language: English
Submitted on: Nov 13, 2025
|
Accepted on: Jan 9, 2026
|
Published on: Feb 25, 2026
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2026 Jennifer Shaw, Grace Young, Mark Cranston, Sara Thomas, Kirsty S. Ross, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.