Have a personal or library account? Click to login
Google Books Ngrams Recompressed and Searchable Cover
Open Access
|Dec 2012

Abstract

One of the research fields significantly affected by the emergence of “big data” is computational linguistics. A prominent example of a large dataset targeting this domain is the collection of Google Books Ngrams, made freely available, for several languages, in July 2009. There are two problems with Google Books Ngrams; the textual format (compressed with Deflate) in which they are distributed is highly inefficient; we are not aware of any tool facilitating search over those data, apart from the Google viewer, which, as a Web tool, has seriously limited use. In this paper we present a simple preprocessing scheme for Google Books Ngrams, enabling also search for an arbitrary n-gram (i.e., its associated statistics) in average time below 0.2 ms. The obtained compression ratio, with Deflate (zip) left as the backend coder, is over 3 times higher than in the original distribution.

DOI: https://doi.org/10.2478/v10209-011-0015-8 | Journal eISSN: 2300-3405 | Journal ISSN: 0867-6356
Language: English
Page range: 271 - 281
Published on: Dec 22, 2012
Published by: Poznan University of Technology
In partnership with: Paradigm Publishing Services
Publication frequency: 4 issues per year

© 2012 Szymon Grabowski, Jakub Swacha, published by Poznan University of Technology
This work is licensed under the Creative Commons License.