Have a personal or library account? Click to login
The Shakespeare’s World Crowdsourced Transcription Project Datasets Cover

The Shakespeare’s World Crowdsourced Transcription Project Datasets

Open Access
|Nov 2024

Full Article

(1) Overview

Repository location

https://doi.org/10.7910/DVN/PNTOAF, Folger Shakespeare Library Dataverse.

Context

In 2015, the crowdsourcing platform Zooniverse.org, the Folger Shakespeare Library, and the Oxford English Dictionary launched Shakespeare’s World (SW), a project aimed at transcribing manuscripts created around Shakespeare’s lifetime. 14,330 single- and double-page-spread images were included:

  • 2,050 images of recipes from 29 culinary and medicinal receipt codices;

  • 6,147 images of letters representing all pre-1700 letters at the Folger in 2015;

  • 6,133 images of ‘Newdigate Newsletters’: handwritten newssheets sent to three generations of the Newdigate family of Arbury Hall, Warwickshire, between 1674 and 1715. Only 4,431 images from this collection received transcriptions before SW was concluded in 10/2019.

SW received contributions from 3,926 registered Zooniverse volunteers, and more than 94,570 anonymous contributions representing an unknown number of individuals (Figure 1).

johd-10-237-g1.png
Figure 1

Treemap of total classifications Contributed per volunteer throughout Shakespeare’s World. Numeric strings are individual registered volunteer’s Zooniverse IDs.

This paper describes a cleaned dataset of each individual volunteer transcription (IVT), and three additional social datasets to contextualize the project.

SW was designed to facilitate transcription by amateurs (Van Hyning, 2019a; Van Hyning & Wolfe, 2024; a video of the site is available through Van Hyning, 2019b). Volunteers transcribed small text strings by placing dots around a character, word, or line on a digitized manuscript. The first dot delineated the start of the transcription string, while the second dot triggered a transcription box with preset brevigraphs (scribal annotations), expansion, deletion, insertion, unclear, and superscript tag buttons (Dingman, 2015; Van Hyning, 2019b; Van Hyning & Jones, 2021). Participants were encouraged to transcribe only what they felt confident they understood (Van Hyning, 2016).

SW, like all Zooniverse projects before the year 2017, relied on 3 or more people to independently perform a task on the same object; these independent task responses then needed to be compared and aggregated into a single reading. SW deployed a real-time aggregation protocol developed at Zooniverse (Hines, 2017) to align and compare different text strings within individual lines. The IVT dataset presented here reveals that, due to aggregation failures, some lines and pages were transcribed by more volunteers than the initial chosen threshold required for retirement, as some subjects continued to be presented to volunteers for transcription. The exact reasons for these failures will be explored in a future publication.

SW engaged volunteers through the SW-Talk forum, which is a social and informational space where any logged-in user could post comments and add hashtags to specific pages. We also used a blog to provide updates. The social datasets include:

  1. SW-Talk-hashtags: used by volunteers and the SW team to tag pages, predominantly with research-oriented tags requested by the SW team and guest researchers, such as #paper, #catholic, #womanwriter.

  2. SW-Talk-comments: Text and metadata for public conversations.

  3. SW-Blog: Ten SW team members and volunteers wrote 39 blog posts about project progress, discoveries, and palaeography tips between 9 December 2015 and 25 September 2019. Ten comments were posted by readers. The blog exemplifies the SW communication strategy of sharing news and engaging volunteers.

By default, the Talk forum remains live after a Zooniverse project is paused or completed. SW volunteer transcriptions concluded on 4 October 2019, but Folger staff and volunteers continued using SW-Talk to coordinate transcription efforts on other platforms, including Dromio and FromThePage, which provide easier integration with Folger digital resources than Shakespeare’s World data outputs (Van Hyning and Jones, 2021). The SW-Talk-hashtags and SW-Talk-comments datasets cover the period between 2015 and 2023.

(2) Method

Steps

IVT

We requested a ‘classification export’ through the Zooniverse platform in July 2023 containing 203,390 rows and 14 fields (columns). Between July 2023 and March 2024, we used OpenRefine (3.7.9) for initial data exploration and cleaning and used Python (3.9.18) with Pandas Dataframe for further transformation. To improve the dataset’s clarity and usability for individuals with limited knowledge of the project, ensure data accuracy, and protect volunteer privacy, we implemented the following changes on these IVT fields listed below:

  • annotation: The annotation field encapsulated all four volunteer classifications under distinct task labels: T0 (identifying if the manuscript page is blank, recorded as a Boolean value), T2 (annotating graphical and text elements with coordinates, element types, and text transcriptions), T3 (indicating if the subject is fully transcribed, recorded as a Boolean value), and init (used only during the testing phases, classifying the subject as text-only, image-only, or mixed). We split the original field into four fields, one per task type. This adjustment preserved the structure of each individual contribution while enhancing the dataset’s overall comprehensibility and searchability.

  • metadata: The original metadata field contained session-related information. We extracted the attributes started_at, finished_at, and user_language into independent fields and omitted attributes used exclusively during testing phases, namely subject_dimensions, viewport, and utc_offset.

  • subject_data: This field comprised both Zooniverse subject retirement information and Folger collections metadata, namely authors, folio/page numbers, and origin. We standardized attribute names and isolated subject retirement information into independent fields. In collaboration with Emily Wahl, Folger Metadata and Digital Asset Management Librarian, we updated URLs for manuscript image and Folger catalog links after a 2024 digital content migration from Hamnet and LUNA to TIND (Swierczek, 2022) and Islandora (Wahl, 2024) respectively.

  • username: To prevent exposure of volunteers’ personal information (names or emails in usernames), and individuals’ transcriptions, we replaced all usernames with Zooniverse-generated IDs.

Quality control

We also validated the IVT dataset by looking for missing, inconsistent, or duplicated data. During our preliminary analysis, we identified several anomalies in dates and volunteer session durations. We kept these entries in the IVT dataset to allow utilizations of other recorded fields, such as classifications and transcriptions. For time-sensitive analysis, additional steps are worth taking to ensure data validity. For example, we flagged and excluded time anomalies in our analysis to more accurately assess the number of subjects each volunteer contributed over time (Wang, 2024).

Steps: SW-Talk-hashtags and SW-Talk-comments

Downloaded from Zooniverse on 23 September 2023. We removed user_login from Talk hashtags and comment_user_login from Talk comments to avoid potential cross-referencing of usernames and volunteer user ids with the IVT dataset, ensuring volunteer privacy across all publicly available Shakespeare’s World datasets.

Steps: SW-Blog

Exported from WordPress on 07 July 2023. We removed author_login, author_email, post_password, comment_author_email, and comment_author_IP to prevent potentially sensitive or personal identifiable information from being published. We also removed some unused fields, such as category_parent and term_parent, to further streamline this data file.

Alongside the XML export, the blog posts were also archived by Archive.org. This backup presents blog content as it originally appeared and offers a more reader-friendly alternative through the Wayback Machine.

(3) Dataset Description

Repository name

Folger Shakespeare Library Dataverse

Object name

The Shakespeare’s World Crowdsourced Transcription Project Datasets

SW-IVT: sw-IVT-dataset-v1–2024_07_12.csv

SW-Talk-hashtags: sw-talk_tags-v1–2024_07_23.json

SW-Talk-comments: sw-talk_comments-v1–2024_07_23.json

SW-Blog: sw-blogs-v1–2023_08_15.xml

Format names and version

.csv, .json, and .xml

All current Shakespeare’s World datasets are provided as Version 1.0. Future updates, should they be available, will be updated and reflected in subsequent versions.

Creation dates

The IVT transcriptions, Talk, and Blog data were originally generated between 8 December 2015 to 4 October 2019. IVT data was exported in July 2023 and cleaned between July 2023 and June 2024. Blog and Talk data were anonymized in July 2024. See file names for the final file creation date.

Dataset creators

Creators are listed in alphabetical order within collaborating institutions or groups during the project period (2015–2019):

Folger Shakespeare Library, Washington, DC

  • Paul Dingman (Project administration; Data curation)

  • Michael Poston (Data curation; Formal Analysis; Investigation; Methodology; Validation; Software)

  • Sarah Powell (Data curation; Formal Analysis; Investigation; Methodology; Validation)

  • Julie Swierczek (Data curation)

  • Elizabeth Tobey (Data curation)

  • Emily Wahl (Data curation; Methodology; Validation; Software)

  • Heather Wolfe (Co-Primary Investigator; Conceptualization; Data curation; Funding acquisition; Investigation; Project administration; Resources; Supervision; Validation)

Oxford English Dictionary, Oxford UK

  • Philip Durkin (Formal Analysis; Investigation; Resources; Supervision)

  • James McKracken (Data curation; Methodology)

Shakespeare’s World Guest Researchers

  • Nina Lamal, postdoctoral researcher, University of Antwerp, Belgium (Investigation)

  • Elaine Leong, research scholar, Max Planck Institute for the History of Science, Berlin, Germany, Early Modern Recipes Online Collective (EMROC) member, (Investigation)

  • Jennifer Munroe, Professor of English, University of North Carolina at Charlotte, United States, Early Modern Recipes Online Collective (EMROC) member, (Investigation)

  • Lisa Smith, lecturer in digital history, University of Essex, United Kingdom, Early Modern Recipes Online Collective (EMROC) member, (Investigation)

University of Maryland, College of Information, MD, USA

  • Victoria Van Hyning (Co-PI; Data curation; Funding acquisition; Investigation; Methodology; Supervision; Validation; Writing-original draft; Writing-review & editing)

  • ZhiCheng Wang (Data curation; Investigation; Methodology; Validation; Writing-original draft; Writing-review & editing)

Zooniverse.org

  • Campbell Allen (Software; Methodology)

  • Simone Duca (Software; Methodology)

  • Greg Hines (Data curation; Software; Methodology)

  • Roger Hutchings (Software; Methodology)

  • Coleman Krawczyk (Data curation; Software; Methodology)

  • Chris Lintott (Conceptualization; Methodology; Supervision)

  • Jim O’Donnel (Software; Methodology)

  • Victoria Van Hyning (Co-PI; Conceptualization; Funding acquisition; Methodology; Project administration; Supervision; Validation)

  • Zooniverse volunteers registered and anonymous (Commenting; Tagging; Transcription), with special thanks to our Zooniverse volunteer moderators @mutabilitie, @Christoferos, @parsfan, and @jules.

Languages

Early modern English

Early modern Latin

Early modern French

License

CC BY-SA

Publication date

2024-09-09

(4) Reuse Potential

SW data holds potential for many domains, such as history of science and food, linguistics, Handwritten Text Recognition, and as base texts for scholarly editions. Uses to date include Leong’s (2018) inclusion of recipes with ‘#paper’ identified by volunteers on Talk; new words for the OED (Durkin 2015, 2017, 2018); recipe interpretation and bakes by Great British Bake-Off contestant Mary-Anne Boermans of ‘taffety tarts’, (Boermans 2019; Van Hyning & Jones, 2021). Rohden et al. (2019) analyzed SW Talk networks, and Kullenberg (2020) created a Graphical User Interface analyzer to support SW Talk network analysis.

The IVT dataset may support analysis of palaeography skills acquisition, and Machine Learning and Handwritten Text Recognition corpus development (Wang & Van Hyning, 2024).

Notes

Funding Information

This publication uses data generated via the Zooniverse.org platform, developed with support from a Google Global Impact Award, the Alfred P. Sloan Foundation. Shakespeare’s World was funded in part by the Institute of Museum and Library Services grant ‘Folger Early Modern Manuscripts Online (EMMO)’, LG-05-13-0353-13.1 This data paper was partly supported by a Shakespeare Association of America, Folger Fellowship (June 2023) for Van Hyning’s ‘Preparing and publishing Shakespeare’s World data for further use and reuse’.

Competing Interests

The authors have no competing interests to declare.

Author Contributions

VVH: Conceptualization; Data curation; Funding acquisition; Investigation; Methodology; Supervision; Validation; Writing – original draft.

ZCW: Data curation; Methodology; Validation; Software; Writing – original draft.

DOI: https://doi.org/10.5334/johd.237 | Journal eISSN: 2059-481X
Language: English
Submitted on: Aug 19, 2024
Accepted on: Oct 29, 2024
Published on: Nov 22, 2024
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2024 Victoria Van Hyning, ZhiCheng Wang, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.