Have a personal or library account? Click to login
MultiHATHI: A Complete Collection of Multilingual Prose Fiction in the HathiTrust Digital Library Cover

MultiHATHI: A Complete Collection of Multilingual Prose Fiction in the HathiTrust Digital Library

By: Sil Hamilton and  Andrew Piper  
Open Access
|Feb 2023

Abstract

This dataset provides detailed metadata on ca. 10.2 million works of fiction and non-fiction written after 1799 in 521 different languages available in the HathiTrust Digital Library. The dataset bolsters the May 2022 Hathifile by supplying missing predicted fiction tags with a bespoke BERT-based multilingual classifier. Our classifier completes the catalogue with an additional 400,000 non-English volumes predicted to be works of fiction, capturing 95% of all works presently provided by HathiTrust. We provide each work with metadata including the work’s genre at the level of fiction or non-fiction, length in pages, original language, and the year the work was published. With a total page count of ca. 1.4 billion pages, our dataset provides researchers with a substantial source of non-English modern literature. We also present insight into how multilingual classifiers can be trained with monolingual data, itself a discovery with implications for the study of lower resource languages. We hope our provisions will accelerate empirical research into non-English prose and literature.

DOI: https://doi.org/10.5334/johd.95 | Journal eISSN: 2059-481X
Language: English
Published on: Feb 8, 2023
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2023 Sil Hamilton, Andrew Piper, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.