Tokenization Multiplicity Leads to Arbitrary Price Variation in LLM-as-a-service
Providers of LLM-as-a-service have predominantly adopted a simple pricing model: users pay a fixed price per token. Consequently, one may think that the price two different users would pay for the same output string under the same input prompt is the same. In our work, we show that, surprisingly, this is not (always) true. We find empirical evidence that, particularly for non-english outputs, both proprietary and open-weights LLMs often generate the same (output) string with multiple different tokenizations, even under the same input prompt, and this in turn leads to arbitrary price variation. To address the problem of tokenization multiplicity, we introduce canonical generation, a type of constrained generation that restricts LLMs to only generate canonical tokenizations – the unique tokenization in which each string is tokenized during the training process of an LLM. Further, we introduce an efficient sampling algorithm for canonical generation based on the Gumbel-Max trick. Experiments on a variety of natural language tasks demonstrate that our sampling algorithm for canonical generation is comparable to standard sampling in terms of performance and runtime, and it solves the problem of tokenization multiplicity.
💡 Research Summary
The paper “Tokenization Multiplicity Leads to Arbitrary Price Variation in LLM-as-a-service” investigates a subtle but economically significant issue in the current pay‑per‑token pricing model used by most large‑language‑model (LLM) APIs. While it is natural to assume that two users who submit the same prompt and receive the same output string will be billed the same amount, the authors demonstrate that this assumption fails, especially for non‑English outputs. Because modern tokenizers (e.g., BPE, Unigram, WordPiece) can map a single string to multiple distinct token sequences, the same textual result can be counted as a different number of tokens, leading to price discrepancies that are arbitrary from the user’s perspective.
The authors first formalize deterministic tokenizers as a tuple (Σ, V, enc, dec) where enc maps a character string to a token sequence and dec reverses the process. Although the encoder is deterministic, multiple tokenizations may exist for a given string; the one selected during training is called the canonical tokenization. They then describe the standard autoregressive generation process, highlighting that the model’s probability distribution over tokens (d_s) may differ from the true distribution (p_s) due to finite training data, allowing non‑canonical token sequences to be generated.
Empirically, the paper examines three tasks—translation, spell‑checking, and re‑phrasing—across 100 prompts per task, each sampled 100 times with different random seeds. Models evaluated include proprietary APIs (GPT‑4, Gemini, Claude) and open‑weight models (Llama‑8B, Qwen‑7B). For German outputs, they find that many models produce identical strings with token counts that differ (e.g., 26 vs. 28 tokens for the same translation), causing up to a 7.7 % cost difference. The conditional probability that two identical strings have different token lengths ranges from near zero for some models to several percent for others, and the phenomenon appears across languages (see Appendix G.1).
To eliminate this source of price variance, the authors introduce canonical generation, a constrained decoding method that forces the model to emit only the canonical tokenization of any string. They develop an efficient sampling algorithm based on the Gumbel‑Max trick: at each generation step, the algorithm draws Gumbel noise for each token, adds it to the log‑probabilities, and selects the token that yields a canonical continuation. The key theoretical insight is that generating a canonical tokenization requires the model to stay within the set of partial canonical tokenizations at every step.
Experimental results show that canonical generation matches standard sampling in terms of downstream performance (BLEU, ROUGE, accuracy) across all three tasks and models, while runtime overhead is negligible. Moreover, after applying canonical generation, the token length discrepancy disappears entirely, guaranteeing that identical outputs incur identical charges under a pay‑per‑token scheme.
The paper also situates the problem within the economics of LLM‑as‑a‑service. Prior work suggested pay‑per‑character pricing to avoid token‑based arbitrage, but this approach reduces provider margins and introduces implementation challenges. Canonical generation preserves the simplicity of token‑based billing while removing the incentive for providers (or users) to exploit tokenization multiplicity.
In conclusion, the study provides the first systematic evidence that tokenization multiplicity can cause real‑world billing inconsistencies, proposes a theoretically grounded and practically efficient solution, and demonstrates that the solution does not sacrifice model quality. The work opens avenues for further research on multilingual tokenizers, integration with fine‑tuning pipelines, and broader economic analyses of fair pricing mechanisms for AI services.
Comments & Academic Discussion
Loading comments...
Leave a Comment