Abstract
This dataset provides a collection of American poetry by poets from historically underrepresented groups in the HathiTrust Digital Library. It comprises 9,321 poems from 113 collections by 40 African Americans, 22 Asian Americans, 3 Pacific Islanders, 17 Latin Americans, and 31 Native American poets. We identified and recorded the start and end page numbers for each poem and released the annotations in CSV files. The dataset also reveals imbalances in the representation of poets from historically underrepresented groups within the HathiTrust corpus. We expect this dataset to support large-scale poetry analysis, uncover biases in natural language processing (NLP) models, assess their robustness when applied to culturally diverse poetic language, and promote the development of more inclusive models for diverse American poetry communities.
