Have a personal or library account? Click to login
Are We There Yet? Notes Towards Benchmarking an Experimental AI-Assisted Workflow for Humanities Data Cleaning and Reconciliation Cover

Are We There Yet? Notes Towards Benchmarking an Experimental AI-Assisted Workflow for Humanities Data Cleaning and Reconciliation

Open Access
|Mar 2026

Figures & Tables

Table 1

Data sources for the STEMMA project.

SOURCECONTACTFORMATDATA TYPEDATA VOLUME
Catalogue of English Literary ManuscriptsJohn Lavagnino, Kings College LondonXMLBibliographic dataset103.9 MB (979 XML files)
DigitalDonneBrent Nelson, University of SaskatchewanCSVBibliographic dataset1.2 MB (1 table, 4,240 lines)
Index of Selected English Poetry Manuscripts, 1590–1660Joshua Eckhardt, Virginia Commonwealth UniversityCSVBibliographic dataset1.8 MB (1 table, 9,178 lines)
Perdita Project: A Database for Early Modern Women’s Manuscript CompilationsVictoria Burke, University of OttawaHTMLBibliographic dataset25 MB (500 entries)
RECIRC: The Reception and Circulation of Early Modern Women’s Writing 1550–1700Marie-Louise Coolahan, University of GalwayCSVBibliographic dataset100 MB (170 tables)
Union First-Line Index of English VerseEric Johnson, Folger Shakespeare LibraryCSVBibliographic dataset247.3 MB (1 table, 704,321 rows)
Figure 1

The first diagram of the STEMMA deduplication pipeline, created by Jan Putzan (Ember Ltd.).

Figure 2

A screenshot of the Terminal application for reviewing “buckets” created by locality sensitive hashing.

Figure 3

A diagram of the PRISM method created by Jan Putzan (Ember Ltd).

DOI: https://doi.org/10.5334/johd.490 | Journal eISSN: 2059-481X
Language: English
Submitted on: Nov 27, 2025
|
Accepted on: Feb 20, 2026
|
Published on: Mar 18, 2026
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2026 Erin A. McCarthy, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.