Have a personal or library account? Click to login
HATHI 1M: Introducing a Million Page Historical Prose Dataset in English from the Hathi Trust Cover

HATHI 1M: Introducing a Million Page Historical Prose Dataset in English from the Hathi Trust

By: Sunyam Bagga and  Andrew Piper  
Open Access
|Mar 2022

Figures & Tables

Table 1

The optimal feature-space and hyperparameters obtained using 10-fold cross-validation for our SVM classifier.

CLASSIFIERFEATURE-SPACEPROB-THRESHOLDSVM HYPERPARAMETERS
Fictiontop-1k word uni-, bi-, trigrams0.75C = 1 & Gaussian Kernel
Non-Fictiontop-100k word uni-, bi-, trigrams0.70C = 1 & Linear Kernel
johd-8-71-g1.png
Figure 1

The distribution of the number of fiction and non-fiction pages across the two hundred years of our dataset.

Table 2

A preview of the columns and rows in the metadata table released for Hathi-Fiction.

HTIDYEARTITLEAUTHORPAGE NUMBERS
nyp.334330900621531947Rebel halfbackArchibald, Joe, 1898-[50, 91, 96, 155, 159]
emu.0100026324161852The soldier of fortuneCurling, Henry, 1803–1864.[159, 91, 204, 155, 166]
emu.0100024260661886Virginia the AmericanEdwardes, Charles.[159, 204, 250, 166, 155]
emu.0100025889741895Moths/by Ouida.Ouida, 1839–1908[166, 155, 250, 204, 159]
johd-8-71-g2.png
Figure 2

The distribution of three BookNLP supersense categories – artifacts, contact verbs, motion verbs – for pages sampled from 1800 to 1999. The left column corresponds to our fiction data and the right column is for our non-fiction data.

johd-8-71-g3.png
Figure 3

The distribution of four features from our Enriched Feature set – average sentence length, Tuldava score, NRC positive score, and VADER positive score – across our dataset of fiction pages (red) and non-fiction pages (blue) sampled from 1800 to 1999.

johd-8-71-g4.png
Figure 4

The distribution of % dialog and Tuldava scores for pages sampled from 1800 to 1999. The left column corresponds to the dataset derived from Underwood et al. (2020) and the right column corresponds to our Hathi1M fiction data.

DOI: https://doi.org/10.5334/johd.71 | Journal eISSN: 2059-481X
Language: English
Published on: Mar 11, 2022
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2022 Sunyam Bagga, Andrew Piper, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.