Table 1
The optimal feature-space and hyperparameters obtained using 10-fold cross-validation for our SVM classifier.
| CLASSIFIER | FEATURE-SPACE | PROB-THRESHOLD | SVM HYPERPARAMETERS |
|---|---|---|---|
| Fiction | top-1k word uni-, bi-, trigrams | 0.75 | C = 1 & Gaussian Kernel |
| Non-Fiction | top-100k word uni-, bi-, trigrams | 0.70 | C = 1 & Linear Kernel |

Figure 1
The distribution of the number of fiction and non-fiction pages across the two hundred years of our dataset.
Table 2
A preview of the columns and rows in the metadata table released for Hathi-Fiction.
| HTID | YEAR | TITLE | AUTHOR | PAGE NUMBERS |
|---|---|---|---|---|
| nyp.33433090062153 | 1947 | Rebel halfback | Archibald, Joe, 1898- | [50, 91, 96, 155, 159] |
| emu.010002632416 | 1852 | The soldier of fortune | Curling, Henry, 1803–1864. | [159, 91, 204, 155, 166] |
| emu.010002426066 | 1886 | Virginia the American | Edwardes, Charles. | [159, 204, 250, 166, 155] |
| emu.010002588974 | 1895 | Moths/by Ouida. | Ouida, 1839–1908 | [166, 155, 250, 204, 159] |

Figure 2
The distribution of three BookNLP supersense categories – artifacts, contact verbs, motion verbs – for pages sampled from 1800 to 1999. The left column corresponds to our fiction data and the right column is for our non-fiction data.

Figure 3
The distribution of four features from our Enriched Feature set – average sentence length, Tuldava score, NRC positive score, and VADER positive score – across our dataset of fiction pages (red) and non-fiction pages (blue) sampled from 1800 to 1999.

Figure 4
The distribution of % dialog and Tuldava scores for pages sampled from 1800 to 1999. The left column corresponds to the dataset derived from Underwood et al. (2020) and the right column corresponds to our Hathi1M fiction data.
