The Script Tax: Measuring Tokenization-Driven Efficiency and Latency Disparities in Multilingual Language Models

The Script Tax: Measuring Tokenization-Driven Efficiency and Latency Disparities in Multilingual Language Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Pretrained multilingual language models are often assumed to be script-agnostic, yet their tokenizers can impose systematic costs on certain writing systems. We quantify this script tax by comparing two orthographic variants with identical linguistic content. Across mBERT and XLM-R, the higher-fragmentation orthography shows a ~3.4x increase in fertility (6.73-6.85 vs. 2.10-2.35 tokens/word), leading to a 16.5x inference slowdown (0.23 vs. 3.8 sentences/second) on identical hardware. Using bits per character (BPC) to avoid the “NLL paradox” from subword fragmentation, we find a substantial increase in information cost: +19.7% for mBERT (8.06->9.65) and +47.1% for XLM-R (12.19->17.94). A round-trip conversion check (CER_rt=0.31) suggests these gaps reflect orthography-conditioned processing rather than mapping noise. Our results highlight tokenization as a key source of inequity in multilingual NLP and motivate script-aware tokenization and pretraining.


💡 Research Summary

The paper introduces the notion of a “script tax,” a systematic penalty imposed by tokenizers on certain writing systems, and quantifies it across three dimensions: tokenization fragmentation (fertility), computational overhead (latency), and information cost (bits per character, BPC). Using paired sentences that convey identical linguistic content but are rendered in two orthographic variants (A and B), the authors evaluate two widely used multilingual masked language models, mBERT and XLM‑R, each with its pretrained subword tokenizer.

Methodology: For each sentence pair, the tokenizer produces a token sequence whose length Lₘ(x) and word count W(x) are measured. Fertility is defined as Fₘ(x) = Lₘ(x)/W(x); the average difference ΔFₘ between the two orthographies quantifies fragmentation. To avoid the “NLL paradox” where heavy fragmentation can artificially lower token‑level loss, the authors compute BPC = (NLL · log₂)/C, where C is the number of characters (excluding spaces). Computational cost is measured as median inference latency on identical hardware, with a latency ratio ρ_lat = Lat_B/Lat_A. Robustness of the orthography conversion pipeline is checked via a round‑trip character error rate (CER_rt).

Results: The higher‑fragmentation orthography (B) requires 6.73–6.85 tokens per word, compared to 2.10–2.35 tokens per word for the lower‑fragmentation orthography (A), a roughly 3.4× increase in sequence length. This translates into a dramatic slowdown: throughput drops from ~3.8 sentences/second for A to ~0.23 sentences/second for B, a 16.5× latency tax. When normalized by characters, BPC rises substantially—by +19.7 % for mBERT (8.06 → 9.65) and by +47.1 % for XLM‑R (12.19 → 17.94). The round‑trip CER of 0.31 indicates that while the conversion process introduces some noise, the primary source of disparity is tokenizer‑driven fragmentation rather than mapping errors.

Discussion: The authors argue that fertility is the upstream driver of both compute and information inefficiencies; the quadratic scaling of Transformer attention with sequence length amplifies the impact of modest token‑level inflation. They highlight that token‑level negative log‑likelihood can be misleading under heavy fragmentation, and that BPC provides a more faithful measure of information efficiency. Practical implications include the need for script‑aware tokenization strategies—such as vocabulary augmentation, script‑specific tokenizers, or tokenizer‑free models—to mitigate the script tax. Reporting metrics like fertility, BPC, and latency alongside traditional perplexity is advocated to expose hidden inequities.

Limitations: The study is confined to two orthographic variants and two model families; other scripts (e.g., Chinese characters, Arabic) may exhibit different magnitudes of tax. The latency measurements depend on a specific hardware configuration, though the relative slowdown should generalize qualitatively. The round‑trip conversion pipeline, despite the CER check, may still leave residual artifacts.

Conclusion: By formalizing and empirically measuring the script tax, the paper demonstrates that pretrained tokenizers can create substantial, systematic disparities in both computational cost and modeling efficiency for languages that are tokenized more finely. The findings call for more equitable tokenizer design and for evaluation practices that incorporate character‑normalized loss and compute metrics, thereby promoting fairness and efficiency in multilingual NLP.


Comments & Academic Discussion

Loading comments...

Leave a Comment