Abstract
The revolutionary potential of ChatGPT raises a critical question: Revolutionary for whom? This paper examines potential inequalities in how ChatGPT “tokenizes” texts across different languages. Translated 3,000-5,000 character passages in 108 languages were analyzed using OpenAI's tokenizer. English was treated as the baseline, with a “token multiplier” calculated for each language. The analysis revealed English was over 13 times more efficient than some languages. Key findings showed while the pre-training dataset size and character counts have nuanced roles, the alphabet used and the prevalence of special characters significantly impact efficiency. These discrepancies have real-world implications regarding usage costs and model limitations across languages. Thus, consciously addressing the tokenization imbalance is critical for ensuring equitable access to AI systems across diverse languages.
