Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Modern tokenizers employ deterministic algorithms to map text into a single “canonical” token sequence, yet the same string can be encoded as many non-canonical tokenizations using the tokenizer vocabulary. In this work, we investigate the robustness of LMs to text encoded with non-canonical tokenizations entirely unseen during training. Surprisingly, when evaluated across 20 benchmarks, we find that instruction-tuned models retain up to 93.4% of their original performance when given a randomly sampled tokenization, and 90.8% with character-level tokenization. We see that overall stronger models tend to be more robust, and robustness diminishes as the tokenization departs farther from the canonical form. Motivated by these results, we then identify settings where non-canonical tokenization schemes can improve performance, finding that character-level segmentation improves string manipulation and code understanding tasks by up to +14%, and right-aligned digit grouping enhances large-number arithmetic by +33%. Finally, we investigate the source of this robustness, finding that it arises in the instruction-tuning phase. We show that while both base and post-trained models grasp the semantics of non-canonical tokenizations (perceiving them as containing misspellings), base models try to mimic the imagined mistakes and degenerate into nonsensical output, while post-trained models are committed to fluent responses. Overall, our findings suggest that models are less tied to their tokenizer than previously believed, and demonstrate the promise of intervening on tokenization at inference time to boost performance.

💡 Research Summary

Modern large language models (LLMs) such as LLaMA‑3, OLMo‑2 and Qwen‑2.5 are typically trained with a deterministic sub‑word tokenizer, most often Byte‑Pair Encoding (BPE). By design, a given string is mapped to a single “canonical” token sequence, and this deterministic mapping has long been assumed to be a core part of the model’s understanding of text. The paper “Broken Tokens? Your Language Model can Secretly Handle Non‑Canonical Tokenizations” challenges this assumption by systematically probing how LLMs behave when presented with tokenizations that were never seen during training.

Key Contributions

Definition of Non‑Canonical Tokenizations
- Random tokenization: Starting from the canonical token sequence, each token is recursively split into a valid pair of sub‑tokens (as in BPE‑dropout). The split is chosen uniformly at random, producing a token sequence that is more granular than the canonical one.
- Character‑level tokenization: The input string is broken down into individual characters (bytes), yielding the most fine‑grained possible tokenization for English text.
Robustness Evaluation Across 20 Benchmarks
The authors evaluate three instruction‑tuned models on a diverse suite of tasks (multiple‑choice, short‑answer, math, code understanding, etc.). When fed random tokenizations, Qwen‑2.5‑7B‑INSTRUCT retains 93.4 % of its original performance; LLaMA‑3‑8B‑INSTRUCT retains 87.7 %, and OLMo‑2‑7B‑INSTRUCT retains 73.1 %. With character‑level tokenization, the retention drops modestly to 90.8 %, 79.4 %, and 62.0 %, respectively. Stronger models (higher base accuracy) consistently show higher retention, indicating that model capacity correlates with tokenization robustness.
Granularity vs. Performance
By varying the BPE‑dropout probability (p) from 0.0 to 0.9, the authors create tokenizations with different “length ratios” (the token count relative to the canonical sequence). A clear negative correlation emerges: the more granular the tokenization, the larger the performance drop. Kendall’s τ confirms statistical significance (p = 0.003). This suggests that while models can handle non‑canonical inputs, excessive fragmentation hampers their ability to maintain context.
Performance Gains from Purpose‑Built Tokenizations
The paper demonstrates that deliberately chosen non‑canonical tokenizations can improve performance on tasks that benefit from fine‑grained orthographic information:
- Character‑level tokenization yields +6.99 % on a “count the most frequent letter” task, +7.74 % on acronym generation, and a striking +14.3 % on a code‑description multiple‑choice benchmark.
- Right‑aligned digit grouping (segmenting numbers from right to left in groups of three) boosts large‑number arithmetic accuracy from 36.5 % to 70.2 % (+33.7 %), confirming prior observations that the default left‑to‑right grouping is sub‑optimal for arithmetic reasoning.
Origin of Robustness: The Role of Instruction‑Tuning
The authors compare base (pre‑training only) models with models that have undergone supervised fine‑tuning (SFT), reinforcement learning from human feedback (DPO), and final instruction tuning. Base models treat non‑canonical tokens as misspellings and attempt to reproduce them, leading to nonsensical continuations. In contrast, instruction‑tuned models have learned a clear separation between the “question” turn (input) and the “answer” turn (output). This turn‑based structure enables them to interpret non‑canonical tokens as noisy input while still generating fluent, correct responses. Ablation experiments pinpoint the SFT stage—particularly the explicit question‑answer formatting—as the critical factor that endows models with tokenization robustness.

Implications

Tokenizer as a Dynamic Inference‑Time Control: The findings overturn the belief that a model’s tokenizer is a static, immutable component. Instead, tokenization can be altered at inference without any additional training, opening a new axis for performance optimization.
Potential for Automated Tokenization Search: Future work could develop meta‑learning or reinforcement‑learning methods that automatically discover the optimal tokenization for a given downstream task, akin to prompt‑engineering but at the token‑level.
Broader Applicability: While the experiments focus on English text and numeric data, the principle likely extends to morphologically rich languages where sub‑word boundaries are more ambiguous. Investigating non‑canonical tokenizations for languages such as Korean, Arabic, or Hindi could reveal further gains.
Re‑thinking Model‑Tokenizer Coupling: The paper suggests that the coupling between model and tokenizer is looser than previously assumed. This may inspire new training paradigms where models are deliberately exposed to a distribution of tokenizations, or where tokenizers are learned jointly with the model in a more flexible manner.

Conclusion

“Broken Tokens?” provides compelling empirical evidence that instruction‑tuned LLMs are surprisingly resilient to unseen, non‑canonical tokenizations and can even benefit from carefully crafted alternative segmentations. The robustness originates from the instruction‑tuning phase, where the model learns to separate input noise from output generation. This work reframes tokenization from a fixed preprocessing step to a tunable inference‑time lever, paving the way for novel methods that adapt tokenization to the needs of each task, potentially unlocking further performance improvements without any model weight updates.

Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations

💡 Research Summary

Comments & Academic Discussion

Leave a Comment