HATHI 1M: Introducing a Million Page Historical Prose Dataset in English from the Hathi Trust

Figures & Tables

The optimal feature-space and hyperparameters obtained using 10-fold cross-validation for our SVM classifier.

CLASSIFIER	FEATURE-SPACE	PROB-THRESHOLD	SVM HYPERPARAMETERS
Fiction	top-1k word uni-, bi-, trigrams	0.75	C = 1 & Gaussian Kernel
Non-Fiction	top-100k word uni-, bi-, trigrams	0.70	C = 1 & Linear Kernel

A preview of the columns and rows in the metadata table released for Hathi-Fiction.

HTID	YEAR	TITLE	AUTHOR	PAGE NUMBERS
nyp.33433090062153	1947	Rebel halfback	Archibald, Joe, 1898-	[50, 91, 96, 155, 159]
emu.010002632416	1852	The soldier of fortune	Curling, Henry, 1803–1864.	[159, 91, 204, 155, 166]
emu.010002426066	1886	Virginia the American	Edwardes, Charles.	[159, 204, 250, 166, 155]
emu.010002588974	1895	Moths/by Ouida.	Ouida, 1839–1908	[166, 155, 250, 204, 159]