Have a personal or library account? Click to login
A Dataset of American Poetry by Poets from Historically Underrepresented Groups in the HathiTrust Digital Library Cover

A Dataset of American Poetry by Poets from Historically Underrepresented Groups in the HathiTrust Digital Library

By: Gyuri Kang and  Kahyun Choi  
Open Access
|Mar 2026

References

  1. Bagga, S., & Piper, A. (2022). HATHI 1M: Introducing a Million Page Historical Prose Dataset in English from the Hathi Trust. Journal of Open Humanities Data, 8, 7. 10.5334/johd.71
  2. Choi, K., & Kang, G. (2025). An analysis of poet demographic and thematic diversity in a poetry collection for inclusive AI. Information Research an International Electronic Journal, 30(iConf), 610617. 10.47989/ir30iConf47263
  3. Hamilton, S., & Piper, A. (2023). MultiHATHI: A Complete Collection of Multilingual Prose Fiction in the HathiTrust Digital Library. Journal of Open Humanities Data, 9, 3. 10.5334/johd.95
  4. HathiTrust Research Center. (n.d.). Data capsules. https://analytics.hathitrust.org/staticcapsules
  5. Jiang, M., Dubnicek, R. C., Worthey, G., Underwood, T., & Downie, J. S. (2022). A prototype gutenberg-hathitrust sentence-level parallel corpus for OCR error analysis: Pilot investigations. Proceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries, 15. 10.1145/3529372.3533298
  6. Jiang, M., Hu, Y., Worthey, G., Capitanu, B., Kudeki, D., & Downie, J. S. (2021). The Gutenberg-HathiTrust Parallel Corpus: A Real-World Dataset for Noise Investigation in Uncorrected OCR Texts. iConference 2021. http://hdl.handle.net/2142/109695
  7. Lehmann, M., Heumann, A., Kuijpers, M. M., Lauer, G., & Lüdtke, J. (2023). The ChildPoeDE Corpus: 1082 German Children’s Poems for Computational and Experimental Studies on Poetry Reception. Journal of Open Humanities Data, 9, 6. 10.5334/johd.102
  8. Lucy, L., Griffiths, C., Ying, C., Kim-Ebio, J., Baur, S., Levine, S., Eberhardt, J., Bamman, D., & Demszky, D. (2025). Racial and Ethnic Representation in Literature Taught in US High Schools. Journal of Cultural Analytics 10(1). 10.22148/001c.131682
  9. Marco, G., De La Rosa, J., Gonzalo, J., Ros, S., & Gonzalez-Blanco, E. (2021). Automated Metric Analysis of Spanish Poetry: Two Complementary Approaches. IEEE Access, 9, 5173451746. 10.1109/ACCESS.2021.3069635
  10. Naaz, K., & Singh, N. K. (2022). Design and Development of Computational Tools for Analyzing Elements of Hindi Poetry. IEEE Access, 10, 9773397747. 10.1109/ACCESS.2022.3204388
  11. Parulian, N. N., Dubnicek, R., Evans, D. J., Hu, Y., Layne-Worthey, G., Downie, J. S., Heaton, R., Lu, K., Orr, R. I., Magni, I., & Walsh, J. A. (2023). Tuning Out the Noise: Benchmarking Entity Extraction for Digitized Native American Literature. Proceedings of the Association for Information Science and Technology, 60(1), 681685. 10.1002/pra2.839
  12. Saini, J. R., & Kaur, J. (2020). Kāvi: An Annotated Corpus of Punjabi Poetry with Emotion Detection Based on ‘Navrasa.’ Procedia Computer Science, 167, 12201229. 10.1016/j.procs.2020.03.436
  13. Schug, J., Gosin, M., & Alt, N. P. (2025). A historical psychology approach to gendered racial stereotypes: An examination of a multi-million book sample of 20th century texts. Current Research in Ecological and Social Psychology, 9. 10.1016/j.cresp.2025.100248
  14. Shang, W., & Underwood, T. (2024). Disentangling semantic and prosodic features of English poetry. Digital Scholarship in the Humanities, fqae008. 10.1093/llc/fqae008
  15. So, R. J. (2020). Redlining culture: A data history of racial inequality and postwar fiction. Columbia University Press. 10.7312/so--19772
  16. Sprugnoli, R., Mambrini, F., Passarotti, M., & Moretti, G. (2023). The Sentiment of Latin Poetry. Annotation and Automatic Analysis of the Odes of Horace. Italian Journal of Computational Linguistics, 9(1). 10.4000/ijcol.1125
  17. Timofeeva, M. (2021). Comparative Analysis of Reasoning in Russian Classic Poetry. Applied Sciences, 11(18), 8665. 10.3390/app11188665
  18. Underwood, T., Kimutis, P., & Witte, J. (2020). NovelTM Datasets for English-Language Fiction, 1700–2009. Journal of Cultural Analytics, 5(2). 10.22148/001c.13147
  19. Wessler, H. (2020). From marginalisation to rediscovery of identity: Dalit and Adivasi voices in Hindi literature. Studia Neophilologica, 92(2), 159174. 10.1080/00393274.2020.1751703
DOI: https://doi.org/10.5334/johd.508 | Journal eISSN: 2059-481X
Language: English
Submitted on: Jan 7, 2026
|
Accepted on: Feb 14, 2026
|
Published on: Mar 6, 2026
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2026 Gyuri Kang, Kahyun Choi, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.