Have a personal or library account? Click to login
The Tokenization Problem: Understanding Generative AI's Computational Language Bias Cover

The Tokenization Problem: Understanding Generative AI's Computational Language Bias

Open Access
|Aug 2024

Abstract

The revolutionary potential of ChatGPT raises a critical question: Revolutionary for whom? This paper examines potential inequalities in how ChatGPT “tokenizes” texts across different languages. Translated 3,000-5,000 character passages in 108 languages were analyzed using OpenAI's tokenizer. English was treated as the baseline, with a “token multiplier” calculated for each language. The analysis revealed English was over 13 times more efficient than some languages. Key findings showed while the pre-training dataset size and character counts have nuanced roles, the alphabet used and the prevalence of special characters significantly impact efficiency. These discrepancies have real-world implications regarding usage costs and model limitations across languages. Thus, consciously addressing the tokenization imbalance is critical for ensuring equitable access to AI systems across diverse languages.

DOI: https://doi.org/10.5334/uproc.123 | Journal eISSN: 2631-5602
Language: English
Published on: Aug 28, 2024
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services

© 2024 Marijana Asprovska, Nathan Hunter, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.