Have a personal or library account? Click to login
HATHI 1M: Introducing a Million Page Historical Prose Dataset in English from the Hathi Trust Cover

HATHI 1M: Introducing a Million Page Historical Prose Dataset in English from the Hathi Trust

By: Sunyam Bagga and  Andrew Piper  
Open Access
|Mar 2022

References

  1. 1Bamman, D., Underwood, T., & Smith, N. A. (2014). A bayesian mixed effects model of literary character. In Proceedings of the 52nd annual meeting of the association for computational linguistics (volume 1: Long papers) (pp. 370379). DOI: 10.3115/v1/P14-1035
  2. 2Bode, K. (2020). Why you can’t model away bias. Modern Language Quarterly, 81(1), 95124. DOI: 10.1215/00267929-7933102
  3. 3Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., et al. (2020). Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 84408451). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.acl-main.747. DOI: 10.18653/v1/2020.acl-main.747
  4. 4Foucault, M. (2013). Archaeology of knowledge. London: Routledge. DOI: 10.4324/9780203604168
  5. 5Fyfe, P., & Ge, Q. (2018). Image analytics and the nineteenth-century illustrated newspaper. Journal of Cultural Analytics, 1(2), 11032. DOI: 10.22148/16.026
  6. 6Gil, A., & Ortega, É. (2016). Global outlooks in digital humanities: Multilingual practices and minimal computing. In Doing digital humanities (pp. 5870). Routledge.
  7. 7Hutto, C., & Gilbert, E. (2014). Vader: A parsimonious rule-based model for sentiment analysis of social media text. Retrieved from https://www.aaai.org/ocs/index.php/ICWSM/ICWSM14/paper/view/8109
  8. 8Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In European conference on machine learning (ecml) (pp. 137142). Berlin: Springer. DOI: 10.1007/BFb0026683
  9. 9Koselleck, R. (2004). Futures past: on the semantics of historical time. Columbia University Press.
  10. 10Luhmann, N. (1995). Social systems. Stanford: Stanford University Press.
  11. 11Mak, B. (2011). How the page matters. Toronto: University of Toronto Press.
  12. 12Mohammad, S. M., & Turney, P. D. (2013). Crowdsourcing a word-emotion association lexicon. Computational Intelligence, 29(3), 436465. DOI: 10.1111/j.1467-8640.2012.00460.x
  13. 13Organisciak, P., Schmidt, B. M., & Downie, J. S. (2022). Giving shape to large digital libraries through exploratory data analysis. Journal of the Association for Information Science and Technology, 73(2), 317332. Retrieved from https://asistdl.onlinelibrary.wiley.com/doi/abs/10.1002/asi.24547. DOI: 10.1002/asi.24547
  14. 14Piper, A., Wellmon, C., & Cheriet, M. (2020). The page image: Towards a visual history of digital documents. Book History, 23(1), 365397. DOI: 10.1353/bh.2020.0010
  15. 15Schmidt, B. (2018). Stable random projection: Lightweight, general-purpose dimensionality reduction for digitized libraries. Journal of Cultural Analytics, 3(1). DOI: 10.22148/16.025
  16. 16Underwood, T. (2019). Distant horizons: digital evidence and literary change. Chicago: University of Chicago Press. DOI: 10.7208/chicago/9780226612973.001.0001
  17. 17Underwood, T., Kimutis, P., & Witte, J. (2020). NovelTM datasets for english-language fiction, 1700–2009. Journal of Cultural Analytics, 5(2). DOI: 10.22148/001c.13147
  18. 18Wilkens, M. (2021). Too isolated, too insular: American literature and the world. Journal of Cultural Analytics, 6(3). DOI: 10.22148/001c.25273
DOI: https://doi.org/10.5334/johd.71 | Journal eISSN: 2059-481X
Language: English
Published on: Mar 11, 2022
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2022 Sunyam Bagga, Andrew Piper, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.