A Dataset of American Poetry by Poets from Historically Underrepresented Groups in the HathiTrust Digital Library

Gyuri Kang; Kahyun Choi

doi:10.5334/johd.508

Full Article

(1) Overview

Repository location

CSV files are available via Zenodo (https://doi.org/10.5281/zenodo.18512641); Python file and HTRC download instructions are available in the GitHub repository (https://github.com/krorange/poem-boundary/).

Context

Representing diversity has become increasingly important in digital humanities (DH) research. Beyond English-language literature, many DH studies have examined literary texts in a wide range of languages (Lehmann et al., 2023; Marco et al., 2021; Naaz & Singh, 2022; Saini & Kaur, 2020; Sprugnoli et al., 2023; Timofeeva, 2021; Wessler, 2020). However, even within a single language, literary texts can exhibit substantial variation in style, form, and expression across different ethnic and cultural communities. For example, English is the most widely used language in global book publishing, enabling diverse literary practices across authors and groups.

Despite this diversity, prior DH research on English-language corpora has largely focused on canonical texts by Anglo-American authors. While a small number of studies have begun to examine literary corpora by minority groups in the United States (e.g., Lucy et al., 2025; Parulian et al., 2023; Schug et al., 2025; So, 2020), no existing poetry corpus comprehensively represents multiple marginalized groups in the United States. Creating multicultural corpora can contribute to re-evaluating the English literary canon, which has historically been shaped by racial and gendered ideologies. To address this gap between canonical and non-canonical American literature, we introduce a dataset of American poetry written by poets from historically underrepresented racial and ethnic groups in the United States, including African American (AA), Asian American (APA-AA), Pacific Islander (PA), Latin American (LA), and Native American (NA) poets, sourced from the HathiTrust Digital Library (HathiTrust).

HathiTrust contains approximately 19 million digitized volumes, of which about 8 million are categorized as books. Its holdings have been widely used in DH research spanning fiction, nonfiction, and monographs (Bagga & Piper, 2022; Hamilton & Piper, 2023; Jiang et al., 2021, 2022; Underwood et al., 2020). In contrast, poetry collections within HathiTrust have received comparatively little attention. To assess the coverage of poets from marginalized groups in HathiTrust, we compared the number of poets represented in HathiTrust with those listed on poets.org, a poetry website maintained by the nonprofit Academy of American Poets.

Because our analysis focuses on poet-level coverage, we searched for each poet listed on poets.org in HathiTrust. Our results show that HathiTrust includes fewer than half of the poets listed on poets.org across all five groups, with coverage rates ranging from 12.00% to 44.01% (see Table 1). Most groups (AA, APA-AA, and NA) exhibit higher recall rates (above 40.00%), whereas LA shows a lower recall (28.77%) and PA an extremely lower recall (12.00%). These results indicate uneven coverage with distinct representation patterns: while APA-PA and NA are the least represented groups in poets.org, LA and APA-PA are the least represented in the HathiTrust.

Table 1

Coverage of poets in HathiTrust, compared with that in poets.org.

GROUP	# OF POETS FOUND IN HT	# OF POETS IN POETS.ORG	COVERAGE (%)
AA	136	309	44.01
APA-AA	80	195	41.03
APA-PA	3	25	12.00
LA	42	146	28.77
NA	36	83	43.37

(2) Method

Steps

Based on the list of poets from underrepresented groups compiled by Choi and Kang (2025), we searched for each poet’s name in HathiTrust to locate their poetry collections. Because some poets publish in multiple languages, we restricted our searches to English-language volumes using HathiTrust’s language filter to enable consistent comparisons across groups.

For each poet, we collected the volume ID of the most recent poetry collection, as newer volumes tend to provide more reliable metadata and higher OCR quality. Poetry collections were generally easy to identify because most volume titles explicitly include the term poems (see Figure 1).

Poetry collection search example in the HathiTrust Digital Library.

In cases where the title did not clearly indicate a poetry collection, we manually inspected the volume by using HathiTrust’s “search in the text” feature. Specifically, we searched for terms such as poems, poetry, and selected poem titles listed on poets.org to verify whether the volume contained poetic content.

Because most volumes in our corpus are under copyright protection, direct access to the full texts is restricted. For research purposes, limited access is available through the HathiTrust Research Center (HTRC) Data Capsules, which provide “secure computing environments for performing researcher-driven text analysis on the HathiTrust corpus” (HathiTrust Research Center, n.d.). Using a Data Capsule, we downloaded the selected volumes and extracted them as individual page-level text files.

For each volume, we manually identified poems that appeared on a single page. Pages containing multiple poems were excluded to ensure consistent sectionalization across the dataset. In addition to the page numbers of digitized copies in the Data Capsule, we include page numbers of OCR-scanned print books when available for users who may consult print copies.

Sampling strategy

Our selection of poets follows the groupings proposed by Choi and Kang (2025), who annotated and published race and ethnicity metadata for poets in an American poetry collection curated by poets.org. Based on descriptive tags used by poets.org, including designations associated with cultural and heritage observances such as Asian/Pacific American Heritage Month, Black History Month, and Native American Heritage Month, they identified the five groups. Multiracial poets are counted in multiple groups to reflect their affiliation with more than one ethnic community.

Due to variation in availability across groups within HathiTrust, the number of accessible volumes and poems differs by group. Within a defined time period, we randomly selected poetry volumes and manually annotated poem boundaries to identify poems within each volume. To improve representation among groups with smaller holdings, we prioritized annotation for Pacific Islander and Native American poets. Following boundary annotations, we extracted a diverse set of poems across the selected volumes.

To better understand the dataset, we provide summary statistics on the number of available volumes, boundary-annotated volumes, and identified poems for each group (see Table 2).

Table 2

Number of volumes and poems per group.

GROUP	# OF VOLUMES	# OF BOUNDARY-ANNOTATED VOLUMES	# OF IDENTIFIED POEMS
AA	136	40	3,380
APA-AA	80	22	1,660
APA-PA	3	3	298
LA	42	17	1,563
NA	36	31	2,420

African American poets constitute the largest group in the dataset: we annotated poem boundaries for 40 of the 136 available volumes and identified 3,380 poems. Pacific Islander poets form the smallest group, represented by three volumes in HathiTrust, resulting in a total of 298 poems.

Quality control

The group assignments for underrepresented poets were manually verified using descriptive tags from poets.org, following the procedure outlined in Choi and Kang (2025), and supplemented with targeted web searches. During this verification process, we consulted representative websites, including poets’ official websites, Wikipedia, and the Poetry Foundation, to confirm that each poet was categorized into the appropriate racial and ethnic groups.

(3) Dataset Description

Repository name

A Dataset of American Poetry by Poets from Historically Underrepresented Groups in the HathiTrust Digital Library.

Object name

htrc_poetry_sections/aa_poets

htrc_poetry_sections/apa-aa_poets

htrc_poetry_sections/apa-pa_poets

htrc_poetry_sections/lxa_poets

htrc_poetry_sections/na_poets

poem-boundary.py

Format names and versions

CSV files and a Python file

Creation dates

Start date: 2024-06-01 End date: 2025-11-15

Dataset creators

Gyuri Kang (Indiana University Bloomington); Kahyun Choi (University of Illinois Urbana-Champaign).

Language

English

License

CC BY 4.0

Publication date

2026-02-11

(4) Reuse Potential

This dataset is designed to support computational analyses of American poetry and to address gaps in computational literary research. Prior DH efforts have sought to represent “historically under-resourced and marginalized textual communities” by constructing datasets focused on African American literature, Native American texts, Black fantastic writing, Latin American fiction, and African American health documents (HathiTrust Research Center, n.d.). Such datasets have been used to evaluate the effectiveness of text mining tools, such as Named Entity Recognition, on non-canonical literary texts (Parulian et al., 2023).

Beyond tool evaluation, the dataset enables the investigation of distinctive linguistic, stylistic, and thematic patterns across racial and ethnic groups of American poets. Future research may employ the dataset for large-scale comparative analyses across groups or for focused studies of specific communities using a range of computational methods. Existing computational poetry research has examined structural aspects of poetry, including rhythm, meter, and stanza (Marco et al., 2021; Naaz & Singh, 2022), syntactic and semantic features (Shang & Underwood, 2024; Timofeeva, 2021), and sentiment and emotion in poetic language (Saini & Kaur, 2020; Sprugnoli et al., 2023).

Several limitations should be considered when reusing this dataset. Firstly, some racial and ethnic groups of American poets are not included. Group selection was informed by comparisons between poets.org data and the 2020 U.S. Census (Choi & Kang, 2025), and therefore, the dataset does not represent all racial and ethnic communities in the United States. Incorporating additional groups, such as Alaskan American or Arab American poets, would improve the dataset’s coverage and representativeness. Secondly, variation in corpus size across groups may introduce bias in comparative analyses. We recommend balancing the number of poems per group or applying appropriate normalization techniques when conducting cross-group comparisons.

Competing Interests

The authors have no competing interests to declare.

Author Contributions

Gyuri Kang: data curation, software, formal analysis, investigation, visualization, methodology, writing – original draft, review & editing.

Kahyun Choi: conceptualization, data curation, methodology, investigation, writing – original draft (partial), review & editing, supervision, project administration, funding acquisition.