HATHI 1M: Introducing a Million Page Historical Prose Dataset in English from the Hathi Trust

Sunyam Bagga; Andrew Piper

doi:10.5334/johd.71

Abstract

We present a new dataset built on prior work consisting of 1,671,370 randomly sampled pages of English-language prose roughly divided between modes of fictional and non-fictional writing and published between the years 1800 and 2000. In addition to focusing on the “page’’ as the basic bibliographic unit, our work employs a single predictive model for the historical period under consideration in contrast to prior work. Besides publication metadata, we also provide an enriched feature set of 107 features including part-of-speech tags, sentiment scores, word supersenses and more. Our data is designed to give researchers in the digital humanities large yet portable random samples of historical writing across two foundational modes of English prose writing. We present initial insights into transformations of linguistic patterns across this historical period using our enriched features as possible pointers to future work. The data can be accessed at https://doi.org/10.7910/DVN/HAKKUA.

References

1Bamman, D., Underwood, T., & Smith, N. A. (2014). A bayesian mixed effects model of literary character. In Proceedings of the 52nd annual meeting of the association for computational linguistics (volume 1: Long papers) (pp. 370–379). DOI: 10.3115/v1/P14-1035
Back to article
2Bode, K. (2020). Why you can’t model away bias. Modern Language Quarterly, 81(1), 95–124. DOI: 10.1215/00267929-7933102
Back to article
3Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., et al. (2020). Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 8440–8451). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.acl-main.747. DOI: 10.18653/v1/2020.acl-main.747
Back to article
4Foucault, M. (2013). Archaeology of knowledge. London: Routledge. DOI: 10.4324/9780203604168
Back to article
5Fyfe, P., & Ge, Q. (2018). Image analytics and the nineteenth-century illustrated newspaper. Journal of Cultural Analytics, 1(2), 11032. DOI: 10.22148/16.026
Back to article
6Gil, A., & Ortega, É. (2016). Global outlooks in digital humanities: Multilingual practices and minimal computing. In Doing digital humanities (pp. 58–70). Routledge.
Back to article
7Hutto, C., & Gilbert, E. (2014). Vader: A parsimonious rule-based model for sentiment analysis of social media text. Retrieved from https://www.aaai.org/ocs/index.php/ICWSM/ICWSM14/paper/view/8109
Back to article
8Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In European conference on machine learning (ecml) (pp. 137–142). Berlin: Springer. DOI: 10.1007/BFb0026683
Back to article
9Koselleck, R. (2004). Futures past: on the semantics of historical time. Columbia University Press.
Back to article
10Luhmann, N. (1995). Social systems. Stanford: Stanford University Press.
Back to article
11Mak, B. (2011). How the page matters. Toronto: University of Toronto Press.
Back to article
12Mohammad, S. M., & Turney, P. D. (2013). Crowdsourcing a word-emotion association lexicon. Computational Intelligence, 29(3), 436–465. DOI: 10.1111/j.1467-8640.2012.00460.x
Back to article
13Organisciak, P., Schmidt, B. M., & Downie, J. S. (2022). Giving shape to large digital libraries through exploratory data analysis. Journal of the Association for Information Science and Technology, 73(2), 317–332. Retrieved from https://asistdl.onlinelibrary.wiley.com/doi/abs/10.1002/asi.24547. DOI: 10.1002/asi.24547
Back to article
14Piper, A., Wellmon, C., & Cheriet, M. (2020). The page image: Towards a visual history of digital documents. Book History, 23(1), 365–397. DOI: 10.1353/bh.2020.0010
Back to article
15Schmidt, B. (2018). Stable random projection: Lightweight, general-purpose dimensionality reduction for digitized libraries. Journal of Cultural Analytics, 3(1). DOI: 10.22148/16.025
Back to article
16Underwood, T. (2019). Distant horizons: digital evidence and literary change. Chicago: University of Chicago Press. DOI: 10.7208/chicago/9780226612973.001.0001
Back to article
17Underwood, T., Kimutis, P., & Witte, J. (2020). NovelTM datasets for english-language fiction, 1700–2009. Journal of Cultural Analytics, 5(2). DOI: 10.22148/001c.13147
Back to article
18Wilkens, M. (2021). Too isolated, too insular: American literature and the world. Journal of Cultural Analytics, 6(3). DOI: 10.22148/001c.25273
Back to article

HATHI 1M: Introducing a Million Page Historical Prose Dataset in English from the Hathi Trust

Abstract

Paradigm

My account