Are We There Yet? Notes Towards Benchmarking an Experimental AI-Assisted Workflow for Humanities Data Cleaning and Reconciliation

Erin A. McCarthy

doi:10.5334/johd.490

Abstract

This paper introduces a novel AI-assisted pipeline developed to prepare data for the European Research Council-funded project “STEMMA: Systems of Transmitting Early Modern Manuscript Verse, 1475–1700.” Now approaching its midpoint, STEMMA develops and applies a data-driven approach to provide the first comprehensive study of the circulation of early modern English poetry in manuscript. The project began by aggregating and reconciling five of the largest and most authoritative existing datasets about early modern verse circulation. The sheer volume of data, along with the need to preserve early modern English spelling and scribal idiosyncrasies for later analyses, meant that off-the-shelf data cleaning tools like OpenRefine were not fit for purpose. To that end, our software developer created a staged pipeline to aid the removal of duplicates, creation of authorities, reconciliation, and assignment of unique identifiers.

The rapid and pragmatic way that this process was developed and deployed means that we did not take the time to benchmark it, nor is it feasible to do so retrospectively. However, this discussion paper records observations from this process and reflects on challenges and bottlenecks as well as opportunities. It points the way toward future benchmarks that are increasingly needed for novel applications of computational methods in the digital humanities. It also briefly considers the relationship between technical benchmarking and research project management.

References

Ahnert, R., & Ahnert, S. E. (2014). A community under attack: Protestant letter networks in the reign of Mary I. Leonardo, 47, 275. 10.1162/LEON_a_00778
Open DOI Search in Google Scholar Back to article
Ahnert, R., & Ahnert, S. E. (2015). Protestant letter networks in the reign of Mary I: A quantitative approach. English literary history, 82, 1–33. 10.1353/elh.2015.0000
Open DOI Search in Google Scholar Back to article
Ahnert, R., Ahnert, S. E., Coleman, C. N., & Weingart, S. B. (2020). The network turn: changing perspectives in the humanities. Cambridge University Press. 10.1017/9781108866804
Open DOI Search in Google Scholar Back to article
Ahnert, R., Griffin, E., Ridge, M., & Tolfo, G. (2023). Collaborative historical research in the age of big data: Lessons from an interdisciplinary project. Cambridge University Press. 10.1017/9781009175548
Open DOI Search in Google Scholar Back to article
Akkerman, N. (2018). Invisible agents: Women and espionage in seventeenth-century Britain. Oxford University Press.
Search in Google Scholar Back to article
Alteryx. (2025). Data preparation, blending, and enrichment tools. Retrieved February 27, 2026, from https://www.alteryx.com/products/capabilities/data-preparation-tools.
Search in Google Scholar Back to article
Apell, A. (2025). Our incredible journey, so far. https://www.getflookup.com/about-us/
Search in Google Scholar Back to article
Bauer, A., Züfle, M., Grohmann, J., & Kounev, S. (2025). Machine learning and artificial intelligence. In S. Kounev, K. D. Lange, & J. von Kistowski (Eds.), Systems benchmarking (pp. 323–346). Springer. 10.1007/978-3-031-85634-1_16
Open DOI Search in Google Scholar Back to article
Beal, P. (1998). In praise of scribes: Manuscripts and their makers in seventeenth-century England. Clarendon Press. 10.1093/oso/9780198184713.001.0001
Open DOI Search in Google Scholar Back to article
Bourke, E. (2017). Female involvement, membership, and centrality: A social network analysis of the Hartlib Circle. Literature Compass, 14(4). 10.1111/lic3.12388
Open DOI Search in Google Scholar Back to article
Christen, P. (2012). Data matching: Concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer Nature. 10.1007/978-3-642-31164-2
Open DOI Search in Google Scholar Back to article
Crawford, J. (2010). Literary circles and communities. In C. Bicks & J. Summit (Eds.), The history of British women’s writing, 1500–1610 (vol. 2, pp. 147–164). Palgrave Macmillan.
Search in Google Scholar Back to article
Cust, R. (1986). News and politics in early seventeenth-century England. Past & present, 112, 60–90. 10.1093/past/112.1.60
Open DOI Search in Google Scholar Back to article
Daybell, J. (2011). Gender, politics and diplomacy: Women, news and intelligence networks in Elizabethan England. In R. Adams & R. Cox (Eds.), Diplomacy in early modern culture, pp. 101–19. Palgrave Macmillan. 10.1057/9780230298125_7
Open DOI Search in Google Scholar Back to article
Daybell, J. (2012). The material letter: Manuscript letters and the culture and practices of letter-writing in early modern England. Palgrave. 10.1057/9781137006066
Open DOI Search in Google Scholar Back to article
De Groot, J. (2006). Coteries, complications and the question of female agency. In I. Atherton & J. Sanders (Eds.), The 1630s: Interdisciplinary essays on culture and politics in the Caroline era (pp. 189–209). Manchester University Press. 10.1515/9781503627994
Open DOI Search in Google Scholar Back to article
Ezell, M. J. M. (1999). Social authorship and the advent of print. Johns Hopkins University Press.
Search in Google Scholar Back to article
Greg, W. W. (1950–51). The rationale of copy-text. Studies in bibliography, 3, 19–36.
Search in Google Scholar Back to article
Greteman, B. (2021). Networking print in Shakespeare’s England: Influence, agency, and revolutionary change. Text technologies. Stanford University Press.
Search in Google Scholar Back to article
Hackett, H. (2012). Women and Catholic manuscript networks in seventeenth-century England: New research on Constance Aston Fowler’s miscellany of sacred and secular verse. Renaissance quarterly, 65, 1094–24. 10.1086/669346
Open DOI Search in Google Scholar Back to article
Hyvönen, E., Ahnert, R., Ahnert, S. E., Touminen, J., Mäkelä, E., Lewis, M., & Filarski, G. (2019). Reconciling metadata. In H. Hotson & T. Wallnig (Eds.), Reassembling the Republic of Letters (pp. 223–236). Göttingen University Press.
Search in Google Scholar Back to article
Kang, Z., Gong, J., Yan, J., Xia, W., Wang, Y., Wang, Z., Ding, H., Cheng, Z., Cao, W., Feng, Z., He, S., Yan, S., Chen, J., He, X., Jiang, C., Ye, W., Yu, K., & Li, X. (2025). HSSBench: Benchmarking humanities and social sciences ability for multimodal large language models (No. arXiv:2506.03922). arXiv. 10.48550/arXiv.2506.03922
Open DOI Search in Google Scholar Back to article
Karjus, A. (2024). Machine-assisted quantitizing designs: Augmenting humanities and social sciences with artificial intelligence (No. arXiv:2309.14379). arXiv. 10.1057/s41599-025-04503-w
Open DOI Search in Google Scholar Back to article
Kim, W., Choi, B.-J., Hong, E.-K., Kim, S.-K., & Lee, D. (2003). A taxonomy of dirty data. Data mining and knowledge discovery, 7(1), 81–99. 10.1023/A:1021564703268
Open DOI Search in Google Scholar Back to article
Kneidel, G., Nelson, B., & Dase, K. (2021, 30 May–3 June). Striking a NERV: The case for networking English verse [Conference presentation]. Canadian Society of Digital Humanities/Société Canadienne des humanités numériques, online.
Search in Google Scholar Back to article
Ladd, J. R. (2021). Imaginative networks: Tracing connections among early modern book dedications. Journal of cultural analytics, 3, 64–101. 10.22148/001c.21993
Open DOI Search in Google Scholar Back to article
Leahey, E. (2008). Overseeing research practice: The case of data editing. Science, technology, & human Values, 33(5), 605–630. 10.1177/0162243907306702
Open DOI Search in Google Scholar Back to article
Leonelli, S., Rappert, B., & Davies, G. (2017). Special issue introduction: Data shadows: Knowledge, openness, and absence. Science, technology, & human Values, 42(2), 191–202. 10.1177/0162243916687039
Open DOI Search in Google Scholar Back to article
Levy, F. J. (1982). How information spread among the gentry, 1550–1640. Journal of British studies, 21, 11–34. 10.1086/385788
Open DOI Search in Google Scholar Back to article
Levy, M., Elyoseph, Z., & Goldberg, Y. (2025). Humans perceive wrong narratives from AI reasoning texts (No. arXiv:2508.16599). arXiv. 10.48550/arXiv.2508.16599
Open DOI Search in Google Scholar Back to article
Lewis, M., Bosse, A., Hotson, H., Wallnig, T., & van Miert, D. (2019). Time. In H. Hotson & T. Wallnig (Eds.), Reassembling the Republic of Letters (pp. 97–117). Göttingen University Press.
Search in Google Scholar Back to article
Lohr, S. (2014, August 18). For big-data scientists, “janitor work” is key hurdle to insights. The New York times. https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html
Search in Google Scholar Back to article
Manjavacas Arevalo, E., & Fonteyn, L. (2021). MacBERTh: Development and evaluation of a historically pre-trained language model for English (1450–1950). In M. Hämäläinen, K. Alnajjar, N. Partanen, & J. Rueter (Eds.), Proceedings of the workshop on natural language processing for digital humanities (pp. 23–36). NLP Association of India (NLPAI). https://aclanthology.org/2021.nlp4dh-1.4/
Search in Google Scholar Back to article
Manjavacas, E., & Fonteyn, L. (2022). Adapting vs. pre-training language models for historical languages. Journal of data mining & digital humanities, NLP4DH(Digital humanities in languages). 10.46298/jdmdh.9152
Open DOI Search in Google Scholar Back to article
Marçais, G., DeBlasio, D., Pandey, P., & Kingsford, C. (2019). Locality-sensitive hashing for the edit distance. Bioinformatics, 35(14), i127–i135. 10.1093/bioinformatics/btz354
Open DOI Search in Google Scholar Back to article
Marotti, A. F. (1995). Manuscript, print, and the English Renaissance lyric. Cornell University Press. 10.7591/9781501728501
Open DOI Search in Google Scholar Back to article
May, S. W. (2004). The future of manuscript studies in early modern poetry. Shakespeare studies, 32, 56–62.
Search in Google Scholar Back to article
May, S. W., & Marotti, A. F. (2014). Ink, stink bait, revenge, and Queen Elizabeth: A Yorkshire yeoman’s household book. Cornell University Press.
Search in Google Scholar Back to article
May, S. W., & Wolfe, H. (2010). Manuscripts in Tudor England’. In K. Cartwright (Ed.), A companion to Tudor literature (pp. 125–39). Wiley-Blackwell. 10.1002/9781444317213.ch8
Open DOI Search in Google Scholar Back to article
Messeri, L., & Crockett, M. J. (2024). Artificial intelligence and illusions of understanding in scientific research. Nature, 627(8002), 49–58. 10.1038/s41586-024-07146-0
Open DOI Search in Google Scholar Back to article
Meta. (n.d.). Faiss. Retrieved February 27, 2026, from https://ai.meta.com/tools/faiss/
Search in Google Scholar Back to article
Microsoft Corporation. (2025). LightGBM. Retrieved February 27, 2026, from https://lightgbm.readthedocs.io/en/stable/
Search in Google Scholar Back to article
Miller, M., & Vielfaure, N. (2022). OpenRefine: An approachable open tool to clean research data. Bulletin – Association of Canadian Map Libraries and Archives (ACMLA), 170. 10.15353/acmla.n170.4873
Open DOI Search in Google Scholar Back to article
Millstone, N. (2016). Manuscript circulation and the invention of politics in early Stuart England. Cambridge University Press. 10.1017/CBO9781316343111
Open DOI Search in Google Scholar Back to article
Nurmi, A. (2010). The English language of the early modern period. In M. Hattaway (Ed.), A new companion to English Renaissance literature and culture (pp. 15–26). Blackwell Publishing. 10.1002/9781444319019.ch2
Open DOI Search in Google Scholar Back to article
Open Knowledge Foundation. (n.d.). Open Data Editor. Retrieved February 27, 2026, from https://okfn.org/en/projects/open-data-editor/
Search in Google Scholar Back to article
Open Refine. (n.d.). Open Refine. Retrieved February 27, 2026, from https://openrefine.org
Search in Google Scholar Back to article
Oryx Digital Ltd. (2025). Transform data into information. Retrieved February 27, 2026, from https://www.easydatatransform.com
Search in Google Scholar Back to article
Palmer, A., Smith, N. A., & Spirling, A. (2024). Using proprietary language models in academic research requires explicit justification. Nature computational science, 4(1), 2–3. 10.1038/s43588-023-00585-1
Open DOI Search in Google Scholar Back to article
Posner, M. (2022). Agile and the long crisis of software. Logic(s) magazine, 16. Retrieved 9 September 2025, from https://logicmag.io/clouds/agile-and-the-long-crisis-of-software/
Search in Google Scholar Back to article
Rawson, K., & Muñoz, T. (2019). Against cleaning. In M. K. Gold & L. F. Klein (Eds.), Debates in the digital humanities 2019 (pp. 270–292). University of Minnesota Press. 10.5749/j.ctvg251hk.26
Open DOI Search in Google Scholar Back to article
Raymond, J., & Moxham, N. (Eds.) (2016). News networks in early modern Europe. Library of the written word, vol. 47. Brill. 10.1163/9789004277199
Open DOI Search in Google Scholar Back to article
Rehbein, M. (2014). From the scholarly edition to visualization: Re-using encoded data for historical research. International journal of humanities and arts computing, 8, 81–105. 10.3366/ijhac.2014.0121
Open DOI Search in Google Scholar Back to article
Ryan, Y., Ahnert, S. E., & Ahnert, R. (2020). Networking archives: quantitative history and the contingent archive. Proceedings of the Workshop on Computational Humanities Research, 2723, 385–396. http://ceur-ws.org/Vol-2723/
Search in Google Scholar Back to article
SAND. (n.d.). Retrieved February 27, 2026, from https://github.com/usc-isi-i2/sand
Search in Google Scholar Back to article
Scott-Warren, J. (2000). Reconstructing manuscript networks: The textual transactions of Sir Stephen Powle. In A. Shepard & P. Withington (Eds.), Communities in early modern England: Networks, place, rhetoric (pp. 18–37). Manchester University Press.
Search in Google Scholar Back to article
Smith, D. S. (2014). John Donne and the Conway papers. Oxford University Press.
Search in Google Scholar Back to article
Strocchia, S. T. (2014). Introduction: Women and healthcare in early modern Europe. Renaissance studies, 28, 496–514. 10.1111/rest.12076
Open DOI Search in Google Scholar Back to article
Tanselle, G. T. (1989). A rationale of textual criticism. University of Pennsylvania Press.
Search in Google Scholar Back to article
Verweij, S. (2016). The literary culture of early modern Scotland: Manuscript production and transmission, 1560–1625. Oxford University Press. 10.1093/acprof:oso/9780198757290.001.0001
Open DOI Search in Google Scholar Back to article
Vine, A. (2019). Miscellaneous order: Manuscript culture and the early modern organization of knowledge. Oxford University Press. 10.1093/oso/9780198809708.001.0001
Open DOI Search in Google Scholar Back to article
Volz, J., Bizer, C., Gaedke, M., & Kobilarov, G. (2009). Silk – A link discovery framework for the web of data. In Proceedings of the Linked Data on the Web Workshop (LDOW2009), Madrid, Spain, April 20, 2009, CEUR Workshop Proceedings, ISSN 1613–0073, online. https://ceur-ws.org/Vol-538/ldow2009_paper13.pdf
Search in Google Scholar Back to article
Walford, A. (2020). Data aesthetics. In T. Carroll, A. Walford, & S. Walton (Eds.), Lineages and advancements in material culture studies (pp. 205–217). Routledge. 10.4324/9781003085867-15
Open DOI Search in Google Scholar Back to article
Warren, C. N., Shore, D., Otis, J., Wang, L., Finegold, M., & Shalizi, C. (2016). Six degrees of Francis Bacon: A statistical method for reconstructing large historical social networks. Digital humanities quarterly, 10(3). 10.17613/mdwdc-tne88
Open DOI Search in Google Scholar Back to article
Weber, L. M., Saelens, W., Cannoodt, R., Soneson, C., Hapfelmeier, A., Gardner, P. P., Boulesteix, A.-L., Saeys, Y., & Robinson, M. D. (2019). Essential guidelines for computational method benchmarking. Genome biology, 20(1), 125. 10.1186/s13059-019-1738-8
Open DOI Search in Google Scholar Back to article
Wickham, H. (2014). Tidy data. Journal of statistical software, 59, 1–23. 10.18637/jss.v059.i10
Open DOI Search in Google Scholar Back to article
Wilcox, K. R. (2012). American women’s writing in the colonial period. In D. M. Bauer (Ed.), The Cambridge history of American women’s literature (pp. 55–73). Cambridge University Press. 10.1017/CHOL9781107001374.005
Open DOI Search in Google Scholar Back to article
Woudhuysen, H. R. (1996). Sir Philip Sidney and the circulation of manuscripts 1558–1640. Clarendon Press. 10.1093/acprof:oso/9780198129660.001.0001
Open DOI Search in Google Scholar Back to article
Ziems, C., Held, W., Shaikh, O., Chen, J., Zhang, Z., & Yang, D. (2024). Can large language models transform computational social science? Computational linguistics, 50(1), 237–291. 10.1162/coli_a_00502
Open DOI Search in Google Scholar Back to article
Zou, G. (2012). Sample size formulas for estimating intraclass correlation coefficients with precision and assurance. Statistics in medicine, 31(29), 3972–3981. 10.1002/sim.5466
Open DOI Search in Google Scholar Back to article

Are We There Yet? Notes Towards Benchmarking an Experimental AI-Assisted Workflow for Humanities Data Cleaning and Reconciliation

Abstract

Paradigm

My account