Table 1
Data sources for the STEMMA project.
| SOURCE | CONTACT | FORMAT | DATA TYPE | DATA VOLUME |
|---|---|---|---|---|
| Catalogue of English Literary Manuscripts | John Lavagnino, Kings College London | XML | Bibliographic dataset | 103.9 MB (979 XML files) |
| DigitalDonne | Brent Nelson, University of Saskatchewan | CSV | Bibliographic dataset | 1.2 MB (1 table, 4,240 lines) |
| Index of Selected English Poetry Manuscripts, 1590–1660 | Joshua Eckhardt, Virginia Commonwealth University | CSV | Bibliographic dataset | 1.8 MB (1 table, 9,178 lines) |
| Perdita Project: A Database for Early Modern Women’s Manuscript Compilations | Victoria Burke, University of Ottawa | HTML | Bibliographic dataset | 25 MB (500 entries) |
| RECIRC: The Reception and Circulation of Early Modern Women’s Writing 1550–1700 | Marie-Louise Coolahan, University of Galway | CSV | Bibliographic dataset | 100 MB (170 tables) |
| Union First-Line Index of English Verse | Eric Johnson, Folger Shakespeare Library | CSV | Bibliographic dataset | 247.3 MB (1 table, 704,321 rows) |

Figure 1
The first diagram of the STEMMA deduplication pipeline, created by Jan Putzan (Ember Ltd.).

Figure 2
A screenshot of the Terminal application for reviewing “buckets” created by locality sensitive hashing.

Figure 3
A diagram of the PRISM method created by Jan Putzan (Ember Ltd).
