Neural Language Models for Nineteenth-Century English

Kasra Hosseini; Kaspar Beelen; Giovanni Colavizza; Mariona Coll Ardanuy

doi:10.5334/johd.48

Abstract

We present four types of neural language models trained on a large historical dataset of books in English, published between 1760 and 1900, and comprised of ≈5.1 billion tokens. The language model architectures include word type embeddings (word2vec and fastText) and contextualized models (BERT and Flair). For each architecture, we trained a model instance using the whole dataset. Additionally, we trained separate instances on text published before 1850 for the type embeddings, and four instances considering different time slices for BERT. Our models have already been used in various downstream tasks where they consistently improved performance. In this paper, we describe how the models have been created and outline their reuse potential.

References

1Akbik, A., Bergmann, T., Blythe, D., Rasul, K., Schweter, S., & Vollgraf, R. (2019). Flair: An easy-to-use framework for state-of-the-art nlp. In NAACL 2019, 2019 annual conference of the north american chapter of the association for computational linguistics (demonstrations) (pp. 54–59).
Back to article
2Azarbonyad, H., Dehghani, M., Beelen, K., Arkut, A., Marx, M., & Kamps, J. (2017). Words are malleable: Computing semantic shifts in political and media discourse. In Proceedings of the 2017 acm on conference on information and knowledge management (pp. 1509–1518). DOI: 10.1145/3132847.3132878
Back to article
3Beelen, K., Nanni, F., Coll Ardanuy, M., Hosseini, K., Tolfo, G., & McGillivray, B. (2021). When time makes sense: A historically-aware approach to targeted sense disambiguation. In Findings of acl-ijcnlp. Bangkok, Thailand (Online): Association for Computational Linguistics. DOI: 10.18653/v1/2021.findings-acl.243
Back to article
4Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2016). Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606. DOI: 10.1162/tacl_a_00051
Back to article
5Coll Ardanuy, M., Nanni, F., Beelen, K., Hosseini, K., Ahnert, R., Lawrence, J., …, & McGillivray, B. (2020, December). Living machines: A study of atypical animacy. In Proceedings of the 28th international conference on computational linguistics (pp. 4534–4545). Barcelona, Spain (Online): International Committee on Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/2020.coling-main.400. DOI: 10.18653/v1/2020.coling-main.400
Back to article
6Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019, June). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers) (pp. 4171–4186). Minneapolis, Minnesota: Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/N19-1423. DOI: 10.18653/v1/N19-1423
Back to article
7Gonen, H., Jawahar, G., Seddah, D., & Goldberg, Y. (2020). Simple, interpretable and stable method for detecting words with usage change across corpora. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 538–555). DOI: 10.18653/v1/2020.acl-main.51
Back to article
8Hämäläinen, M., & Hengchen, S. (2019). From the paft to the fiiture: a fully automatic nmt and word embeddings method for ocr post-correction. arXiv preprint arXiv:1910.05535. DOI: 10.26615/978-954-452-056-4_051
Back to article
9Hengchen, S., Ros, R., & Marjanen, J. (2019). A data-driven approach to the changing vocabulary of the nation in english, dutch, swedish and finnish newspapers, 1750–1950. In Proceedings of the digital humanities (dh) conference.
Back to article
10Hengchen, S., & Tahmasebi, N. (2021). A collection of swedish diachronic word embedding models trained on historical newspaper data. Journal of Open Humanities Data, 7. DOI: 10.5334/johd.22
Back to article
11Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780. DOI: 10.1162/neco.1997.9.8.1735
Back to article
12Honnibal, M., Montani, I., Van Landeghem, S., & Boyd, A. (2020). spaCy: Industrial-strength Natural Language Processing in Python. Zenodo. DOI: 10.5281/zenodo.1212303
Back to article
13Hosseini, K., Nanni, F., & Coll Ardanuy, M. (2020, October). DeezyMatch: A flexible deep learning approach to fuzzy string matching. In Proceedings of the 2020 conference on empirical methods in natural language processing: System demonstrations (pp. 62–69). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.emnlp-demos.9. DOI: 10.18653/v1/2020.emnlp-demos.9
Back to article
14Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Back to article
15Kulkarni, V., Al-Rfou, R., Perozzi, B., & Skiena, S. (2015). Statistically significant detection of linguistic change. In Proceedings of the 24th international conference on world wide web (pp. 625–635). DOI: 10.1145/2736277.2741627
Back to article
16Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv preprint. Retrieved 2019-11-20, from http://arxiv.org/abs/1301.3781
Back to article
17Pechenick, E. A., Danforth, C. M., & Dodds, P. S. (2015). Characterizing the google books corpus: Strong limits to inferences of socio-cultural and linguistic evolution. PloS one, 10(10), e0137041. DOI: 10.1371/journal.pone.0137041
Back to article
18Rehurek, R., & Sojka, P. (2011). Gensim-python framework for vector space modelling. NLP Centre, Faculty of Informatics, Masaryk University, Brno, Czech Republic, 3(2).
Back to article
19Smith, N. A. (2019). Contextual word representations: A contextual introduction. arXiv preprint arXiv:1902.06006.
Back to article
20Tahmasebi, N., Borin, L., & Jatowt, A. (2018). Survey of computational approaches to lexical semantic change. arXiv preprint arXiv:1811.06278.
Back to article
21van Strien, D., Beelen, K., Ardanuy, M. C., Hosseini, K., McGillivray, B., & Colavizza, G. (2020). Assessing the impact of ocr quality on downstream nlp tasks. In Icaart, 1, 484–496. DOI: 10.5220/0009169004840496
Back to article
22Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., …, & Brew, J. (2019). HuggingFace’s Transformers: State-of-the-art Natural Language Processing. ArXiv, abs/1910.03771. DOI: 10.18653/v1/2020.emnlp-demos.6
Back to article

Neural Language Models for Nineteenth-Century English

Abstract

Paradigm

My account