Abstract
This paper introduces a novel AI-assisted pipeline developed to prepare data for the European Research Council-funded project “STEMMA: Systems of Transmitting Early Modern Manuscript Verse, 1475–1700.” Now approaching its midpoint, STEMMA develops and applies a data-driven approach to provide the first comprehensive study of the circulation of early modern English poetry in manuscript. The project began by aggregating and reconciling five of the largest and most authoritative existing datasets about early modern verse circulation. The sheer volume of data, along with the need to preserve early modern English spelling and scribal idiosyncrasies for later analyses, meant that off-the-shelf data cleaning tools like OpenRefine were not fit for purpose. To that end, our software developer created a staged pipeline to aid the removal of duplicates, creation of authorities, reconciliation, and assignment of unique identifiers.
The rapid and pragmatic way that this process was developed and deployed means that we did not take the time to benchmark it, nor is it feasible to do so retrospectively. However, this discussion paper records observations from this process and reflects on challenges and bottlenecks as well as opportunities. It points the way toward future benchmarks that are increasingly needed for novel applications of computational methods in the digital humanities. It also briefly considers the relationship between technical benchmarking and research project management.
