Have a personal or library account? Click to login
Are We There Yet? Notes Towards Benchmarking an Experimental AI-Assisted Workflow for Humanities Data Cleaning and Reconciliation Cover

Are We There Yet? Notes Towards Benchmarking an Experimental AI-Assisted Workflow for Humanities Data Cleaning and Reconciliation

Open Access
|Mar 2026

References

  1. Ahnert, R., & Ahnert, S. E. (2014). A community under attack: Protestant letter networks in the reign of Mary I. Leonardo, 47, 275. 10.1162/LEON_a_00778
  2. Ahnert, R., & Ahnert, S. E. (2015). Protestant letter networks in the reign of Mary I: A quantitative approach. English literary history, 82, 133. 10.1353/elh.2015.0000
  3. Ahnert, R., Ahnert, S. E., Coleman, C. N., & Weingart, S. B. (2020). The network turn: changing perspectives in the humanities. Cambridge University Press. 10.1017/9781108866804
  4. Ahnert, R., Griffin, E., Ridge, M., & Tolfo, G. (2023). Collaborative historical research in the age of big data: Lessons from an interdisciplinary project. Cambridge University Press. 10.1017/9781009175548
  5. Akkerman, N. (2018). Invisible agents: Women and espionage in seventeenth-century Britain. Oxford University Press.
  6. Alteryx. (2025). Data preparation, blending, and enrichment tools. Retrieved February 27, 2026, from https://www.alteryx.com/products/capabilities/data-preparation-tools.
  7. Apell, A. (2025). Our incredible journey, so far. https://www.getflookup.com/about-us/
  8. Bauer, A., Züfle, M., Grohmann, J., & Kounev, S. (2025). Machine learning and artificial intelligence. In S. Kounev, K. D. Lange, & J. von Kistowski (Eds.), Systems benchmarking (pp. 323346). Springer. 10.1007/978-3-031-85634-1_16
  9. Beal, P. (1998). In praise of scribes: Manuscripts and their makers in seventeenth-century England. Clarendon Press. 10.1093/oso/9780198184713.001.0001
  10. Bourke, E. (2017). Female involvement, membership, and centrality: A social network analysis of the Hartlib Circle. Literature Compass, 14(4). 10.1111/lic3.12388
  11. Christen, P. (2012). Data matching: Concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer Nature. 10.1007/978-3-642-31164-2
  12. Crawford, J. (2010). Literary circles and communities. In C. Bicks & J. Summit (Eds.), The history of British women’s writing, 1500–1610 (vol. 2, pp. 147164). Palgrave Macmillan.
  13. Cust, R. (1986). News and politics in early seventeenth-century England. Past & present, 112, 6090. 10.1093/past/112.1.60
  14. Daybell, J. (2011). Gender, politics and diplomacy: Women, news and intelligence networks in Elizabethan England. In R. Adams & R. Cox (Eds.), Diplomacy in early modern culture, pp. 10119. Palgrave Macmillan. 10.1057/9780230298125_7
  15. Daybell, J. (2012). The material letter: Manuscript letters and the culture and practices of letter-writing in early modern England. Palgrave. 10.1057/9781137006066
  16. De Groot, J. (2006). Coteries, complications and the question of female agency. In I. Atherton & J. Sanders (Eds.), The 1630s: Interdisciplinary essays on culture and politics in the Caroline era (pp. 189209). Manchester University Press. 10.1515/9781503627994
  17. Ezell, M. J. M. (1999). Social authorship and the advent of print. Johns Hopkins University Press.
  18. Greg, W. W. (1950–51). The rationale of copy-text. Studies in bibliography, 3, 1936.
  19. Greteman, B. (2021). Networking print in Shakespeare’s England: Influence, agency, and revolutionary change. Text technologies. Stanford University Press.
  20. Hackett, H. (2012). Women and Catholic manuscript networks in seventeenth-century England: New research on Constance Aston Fowler’s miscellany of sacred and secular verse. Renaissance quarterly, 65, 109424. 10.1086/669346
  21. Hyvönen, E., Ahnert, R., Ahnert, S. E., Touminen, J., Mäkelä, E., Lewis, M., & Filarski, G. (2019). Reconciling metadata. In H. Hotson & T. Wallnig (Eds.), Reassembling the Republic of Letters (pp. 223236). Göttingen University Press.
  22. Kang, Z., Gong, J., Yan, J., Xia, W., Wang, Y., Wang, Z., Ding, H., Cheng, Z., Cao, W., Feng, Z., He, S., Yan, S., Chen, J., He, X., Jiang, C., Ye, W., Yu, K., & Li, X. (2025). HSSBench: Benchmarking humanities and social sciences ability for multimodal large language models (No. arXiv:2506.03922). arXiv. 10.48550/arXiv.2506.03922
  23. Karjus, A. (2024). Machine-assisted quantitizing designs: Augmenting humanities and social sciences with artificial intelligence (No. arXiv:2309.14379). arXiv. 10.1057/s41599-025-04503-w
  24. Kim, W., Choi, B.-J., Hong, E.-K., Kim, S.-K., & Lee, D. (2003). A taxonomy of dirty data. Data mining and knowledge discovery, 7(1), 8199. 10.1023/A:1021564703268
  25. Kneidel, G., Nelson, B., & Dase, K. (2021, 30 May–3 June). Striking a NERV: The case for networking English verse [Conference presentation]. Canadian Society of Digital Humanities/Société Canadienne des humanités numériques, online.
  26. Ladd, J. R. (2021). Imaginative networks: Tracing connections among early modern book dedications. Journal of cultural analytics, 3, 64101. 10.22148/001c.21993
  27. Leahey, E. (2008). Overseeing research practice: The case of data editing. Science, technology, & human Values, 33(5), 605630. 10.1177/0162243907306702
  28. Leonelli, S., Rappert, B., & Davies, G. (2017). Special issue introduction: Data shadows: Knowledge, openness, and absence. Science, technology, & human Values, 42(2), 191202. 10.1177/0162243916687039
  29. Levy, F. J. (1982). How information spread among the gentry, 1550–1640. Journal of British studies, 21, 1134. 10.1086/385788
  30. Levy, M., Elyoseph, Z., & Goldberg, Y. (2025). Humans perceive wrong narratives from AI reasoning texts (No. arXiv:2508.16599). arXiv. 10.48550/arXiv.2508.16599
  31. Lewis, M., Bosse, A., Hotson, H., Wallnig, T., & van Miert, D. (2019). Time. In H. Hotson & T. Wallnig (Eds.), Reassembling the Republic of Letters (pp. 97117). Göttingen University Press.
  32. Lohr, S. (2014, August 18). For big-data scientists, “janitor work” is key hurdle to insights. The New York times. https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html
  33. Manjavacas Arevalo, E., & Fonteyn, L. (2021). MacBERTh: Development and evaluation of a historically pre-trained language model for English (1450–1950). In M. Hämäläinen, K. Alnajjar, N. Partanen, & J. Rueter (Eds.), Proceedings of the workshop on natural language processing for digital humanities (pp. 2336). NLP Association of India (NLPAI). https://aclanthology.org/2021.nlp4dh-1.4/
  34. Manjavacas, E., & Fonteyn, L. (2022). Adapting vs. pre-training language models for historical languages. Journal of data mining & digital humanities, NLP4DH(Digital humanities in languages). 10.46298/jdmdh.9152
  35. Marçais, G., DeBlasio, D., Pandey, P., & Kingsford, C. (2019). Locality-sensitive hashing for the edit distance. Bioinformatics, 35(14), i127i135. 10.1093/bioinformatics/btz354
  36. Marotti, A. F. (1995). Manuscript, print, and the English Renaissance lyric. Cornell University Press. 10.7591/9781501728501
  37. May, S. W. (2004). The future of manuscript studies in early modern poetry. Shakespeare studies, 32, 5662.
  38. May, S. W., & Marotti, A. F. (2014). Ink, stink bait, revenge, and Queen Elizabeth: A Yorkshire yeoman’s household book. Cornell University Press.
  39. May, S. W., & Wolfe, H. (2010). Manuscripts in Tudor England’. In K. Cartwright (Ed.), A companion to Tudor literature (pp. 12539). Wiley-Blackwell. 10.1002/9781444317213.ch8
  40. Messeri, L., & Crockett, M. J. (2024). Artificial intelligence and illusions of understanding in scientific research. Nature, 627(8002), 4958. 10.1038/s41586-024-07146-0
  41. Meta. (n.d.). Faiss. Retrieved February 27, 2026, from https://ai.meta.com/tools/faiss/
  42. Microsoft Corporation. (2025). LightGBM. Retrieved February 27, 2026, from https://lightgbm.readthedocs.io/en/stable/
  43. Miller, M., & Vielfaure, N. (2022). OpenRefine: An approachable open tool to clean research data. Bulletin – Association of Canadian Map Libraries and Archives (ACMLA), 170. 10.15353/acmla.n170.4873
  44. Millstone, N. (2016). Manuscript circulation and the invention of politics in early Stuart England. Cambridge University Press. 10.1017/CBO9781316343111
  45. Nurmi, A. (2010). The English language of the early modern period. In M. Hattaway (Ed.), A new companion to English Renaissance literature and culture (pp. 1526). Blackwell Publishing. 10.1002/9781444319019.ch2
  46. Open Knowledge Foundation. (n.d.). Open Data Editor. Retrieved February 27, 2026, from https://okfn.org/en/projects/open-data-editor/
  47. Open Refine. (n.d.). Open Refine. Retrieved February 27, 2026, from https://openrefine.org
  48. Oryx Digital Ltd. (2025). Transform data into information. Retrieved February 27, 2026, from https://www.easydatatransform.com
  49. Palmer, A., Smith, N. A., & Spirling, A. (2024). Using proprietary language models in academic research requires explicit justification. Nature computational science, 4(1), 23. 10.1038/s43588-023-00585-1
  50. Posner, M. (2022). Agile and the long crisis of software. Logic(s) magazine, 16. Retrieved 9 September 2025, from https://logicmag.io/clouds/agile-and-the-long-crisis-of-software/
  51. Rawson, K., & Muñoz, T. (2019). Against cleaning. In M. K. Gold & L. F. Klein (Eds.), Debates in the digital humanities 2019 (pp. 270292). University of Minnesota Press. 10.5749/j.ctvg251hk.26
  52. Raymond, J., & Moxham, N. (Eds.) (2016). News networks in early modern Europe. Library of the written word, vol. 47. Brill. 10.1163/9789004277199
  53. Rehbein, M. (2014). From the scholarly edition to visualization: Re-using encoded data for historical research. International journal of humanities and arts computing, 8, 81105. 10.3366/ijhac.2014.0121
  54. Ryan, Y., Ahnert, S. E., & Ahnert, R. (2020). Networking archives: quantitative history and the contingent archive. Proceedings of the Workshop on Computational Humanities Research, 2723, 385396. http://ceur-ws.org/Vol-2723/
  55. SAND. (n.d.). Retrieved February 27, 2026, from https://github.com/usc-isi-i2/sand
  56. Scott-Warren, J. (2000). Reconstructing manuscript networks: The textual transactions of Sir Stephen Powle. In A. Shepard & P. Withington (Eds.), Communities in early modern England: Networks, place, rhetoric (pp. 1837). Manchester University Press.
  57. Smith, D. S. (2014). John Donne and the Conway papers. Oxford University Press.
  58. Strocchia, S. T. (2014). Introduction: Women and healthcare in early modern Europe. Renaissance studies, 28, 496514. 10.1111/rest.12076
  59. Tanselle, G. T. (1989). A rationale of textual criticism. University of Pennsylvania Press.
  60. Verweij, S. (2016). The literary culture of early modern Scotland: Manuscript production and transmission, 1560–1625. Oxford University Press. 10.1093/acprof:oso/9780198757290.001.0001
  61. Vine, A. (2019). Miscellaneous order: Manuscript culture and the early modern organization of knowledge. Oxford University Press. 10.1093/oso/9780198809708.001.0001
  62. Volz, J., Bizer, C., Gaedke, M., & Kobilarov, G. (2009). Silk – A link discovery framework for the web of data. In Proceedings of the Linked Data on the Web Workshop (LDOW2009), Madrid, Spain, April 20, 2009, CEUR Workshop Proceedings, ISSN 1613–0073, online. https://ceur-ws.org/Vol-538/ldow2009_paper13.pdf
  63. Walford, A. (2020). Data aesthetics. In T. Carroll, A. Walford, & S. Walton (Eds.), Lineages and advancements in material culture studies (pp. 205217). Routledge. 10.4324/9781003085867-15
  64. Warren, C. N., Shore, D., Otis, J., Wang, L., Finegold, M., & Shalizi, C. (2016). Six degrees of Francis Bacon: A statistical method for reconstructing large historical social networks. Digital humanities quarterly, 10(3). 10.17613/mdwdc-tne88
  65. Weber, L. M., Saelens, W., Cannoodt, R., Soneson, C., Hapfelmeier, A., Gardner, P. P., Boulesteix, A.-L., Saeys, Y., & Robinson, M. D. (2019). Essential guidelines for computational method benchmarking. Genome biology, 20(1), 125. 10.1186/s13059-019-1738-8
  66. Wickham, H. (2014). Tidy data. Journal of statistical software, 59, 123. 10.18637/jss.v059.i10
  67. Wilcox, K. R. (2012). American women’s writing in the colonial period. In D. M. Bauer (Ed.), The Cambridge history of American women’s literature (pp. 5573). Cambridge University Press. 10.1017/CHOL9781107001374.005
  68. Woudhuysen, H. R. (1996). Sir Philip Sidney and the circulation of manuscripts 1558–1640. Clarendon Press. 10.1093/acprof:oso/9780198129660.001.0001
  69. Ziems, C., Held, W., Shaikh, O., Chen, J., Zhang, Z., & Yang, D. (2024). Can large language models transform computational social science? Computational linguistics, 50(1), 237291. 10.1162/coli_a_00502
  70. Zou, G. (2012). Sample size formulas for estimating intraclass correlation coefficients with precision and assurance. Statistics in medicine, 31(29), 39723981. 10.1002/sim.5466
DOI: https://doi.org/10.5334/johd.490 | Journal eISSN: 2059-481X
Language: English
Submitted on: Nov 27, 2025
|
Accepted on: Feb 20, 2026
|
Published on: Mar 18, 2026
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2026 Erin A. McCarthy, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.