Google Books Ngrams Recompressed and Searchable

Szymon Grabowski; Jakub Swacha

doi:10.2478/v10209-011-0015-8

.blurhash-client-img { display: none !important; }

Google Books Ngrams Recompressed and Searchable

Foundations of Computing and Decision Sciences

Volume 37 (2012): Issue 4 (December 2012)

By: Szymon Grabowski and Jakub Swacha

Open Access

|Dec 2012

Abstract

One of the research fields significantly affected by the emergence of “big data” is computational linguistics. A prominent example of a large dataset targeting this domain is the collection of Google Books Ngrams, made freely available, for several languages, in July 2009. There are two problems with Google Books Ngrams; the textual format (compressed with Deflate) in which they are distributed is highly inefficient; we are not aware of any tool facilitating search over those data, apart from the Google viewer, which, as a Web tool, has seriously limited use. In this paper we present a simple preprocessing scheme for Google Books Ngrams, enabling also search for an arbitrary n-gram (i.e., its associated statistics) in average time below 0.2 ms. The obtained compression ratio, with Deflate (zip) left as the backend coder, is over 3 times higher than in the original distribution.

References

[1] Brants T., Popat A. C., Xu P., Och F. J., Dean J., Large language models in machine translation, in: Proceedings of the 2007 Joint Conference on Empirical Methods inNatural Language Processing and Computational Natural Language Learning, Prague, ACL 2007, 858-867.
Search in Google Scholar
[2] Gao J., Nguyen P., Li X., Thrasher C., Li M., Wang K., A Comparative Study of Bing Web N-gram Language Models for Web Search and Natural Language Processing, in: Workshop of the 33rd Annual International ACM SIGIR Conference on Research andDevelopment in Information Retrieval, Geneva 2010.
Search in Google Scholar
[3] Grabowski Sz., Swacha J., Compact Representation of URL Collections with Fast Access, Automatyka, 15, 3, 2011, 349-355.
Search in Google Scholar
[4] Guthrie D., Hepple M., Liu W., Efficient Minimal Perfect Hash Language Models, in: N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, M. Rosner, D. Tapias (eds.), Proceeding of the Seventh International Conference on LanguageResources and Evaluation, Valetta, ELRA 2010.
Search in Google Scholar
[5] Michel J.-B. B., Kui Y., Presser A., Veres A., Gray M. K., Google Books Team, Picket J. P., Hoiberg D., Clancy D., Norvig P., Orwant J., Pinker S., Nowak M. A., Lieberman Aider E., Quantitative Analysis of Culture Using Millions of Digitized Books, Science, 331, 6014, 2011, 176-182.10.1126/science.1199644327974221163965
Search in Google Scholar
[6] Microsoft Research, Spelling Alteration for Web Search Workshop, City Center - Bellevue, WA, July 19, 2011. Materials available at http://webngram. research.microsoft.com/Spellerchallenge/Docs/Spelling_Alteration_Workshop. pdf (last checked: June 2012).
Search in Google Scholar
[7] Pauls A., Klein D., Faster and Smaller N-Gram Language Models, in: Y. Matsumoto, R. Mihalcea (eds.), Proceedings of the 49th Annual Meeting of the Association forComputational Linguistics: Human Language Technologies - Volume 1, Stroudsburg, ACL 2011, 258-267.
Search in Google Scholar
[8] Procházka V., Pollák P., Analysis of Czech Web 1T 5-Gram Corpus and Its Comparison with Czech National Corpus Data, in: P. Sojka, A. Horák, I. Kopecek, K. Pala (eds.), Proceedings of the 13th International Conference Text, Speech andDialog, Brno, Springer 2010, 181-188.10.1007/978-3-642-15760-8_24
Search in Google Scholar
[9] Skibiński P., Grabowski Sz., Swacha J., Effective asymmetric XML compression, Software-Practice and Experience, 38, 10, 2008, 1027-1047.10.1002/spe.859
Search in Google Scholar
[10] Talbot D., Brants T., Randomized Language Models via Perfect Hash Functions, in: Proceedings of the 46th Annual Meeting of the Association for ComputationalLinguistics: Human Language Technologies, Columbus, ACL 2008, 505-513.
Search in Google Scholar
[11] Witten I. H., Moffat A., Bell T. C., Managing Gigabytes: Compressing and IndexingDocuments and Images, Morgan Kaufmann Publishers, Los Altos, 1999.
Search in Google Scholar
[12] Ziv, J., Lempel, A., A Universal Algorithm for Sequential Data Compression, IEEETransactions on Information Theory, 23, 3, 1977, 337-343.10.1109/TIT.1977.1055714
Search in Google Scholar
[13] http://books.google.com/ngrams (last checked: June 2012).
Search in Google Scholar
[14] http://books.google.com/ngrams/datasets (last checked: June 2012).
Search in Google Scholar
[15] http://books.google.com/ngrams/info (last checked: June 2012).
Search in Google Scholar
[16] http://iiwz.wneiz.pl/jakubs/progs/ngram_compressor.zip (last checked: June 2012).
Search in Google Scholar
[17] http://research.microsoft.com/en-us/collaboration/focus/cs/web-ngram.aspx (last checked: June 2012).
Search in Google Scholar
[18] http://www.base2ti.com (last checked: June 2012).
Search in Google Scholar
[19] http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13 (last checked: June 2012).
Search in Google Scholar
[20] http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2011T07 (last checked: June 2012).
Search in Google Scholar