Compressed code: the hidden effects of quantization and distillation on programming tokens
Large Language Models (LLMs) have demonstrated exceptional code generation capabilities, yet their token-level mechanisms remain underexplored, particularly in compressed models. Through systematic analysis of programming language token representations, we characterize how programming languages are encoded in LLM tokenizers by analyzing their vocabulary distribution and keyword coverage patterns. We introduce a novel cold-start probability analysis method that provides insights into model behavior without requiring explicit prompts. Additionally, we present a comprehensive evaluation of how different model optimization techniques - including quantization, distillation, model scaling, and task-specific fine-tuning - affect token-level representations and code generation quality. Our experiments, supported by comprehensive probability distribution analysis and evaluation metrics, reveal critical insights into token-level behavior and provide empirically-validated guidelines for maintaining code generation quality under various optimization constraints. These findings advance both theoretical understanding of LLM code generation and practical implementation of optimized models in production environments.
💡 Research Summary
The paper “Compressed code: the hidden effects of quantization and distillation on programming tokens” investigates how large language models (LLMs) encode programming languages at the token level and how various model compression techniques—quantization, distillation, scaling, and task‑specific fine‑tuning—affect these token‑level representations and downstream code generation quality.
Model and Tokenizer Selection
The authors start by selecting seven open‑source LLMs that excel at coding tasks: Qwen2.5‑Coder, Qwen2.5, Athene‑V2‑Chat, DeepSeek‑V2.5, DeepSeek‑V3, DeepSeek‑R1, and Llama 3.1, together with several distilled variants of DeepSeek‑R1. By inspecting the tokenizers, they discover that only three distinct vocabularies are actually used: (1) the Qwen2.5 family (including Qwen2.5‑Coder and Athene‑V2‑Chat), (2) the DeepSeek‑R1/DeepSeek‑V3 family, and (3) the Llama 3.1/DeepSeek‑R1‑Llama family. This finding highlights that many “code‑specialized” models share the same token set as general‑purpose models.
Programming‑Keyword Coverage and Rank Analysis
Using the ten most popular GitHub languages (Python, Java, Go, JavaScript, C++, TypeScript, PHP, Ruby, C, C#) plus Rust and the React framework, the authors compile a list of 276 reserved keywords. They then measure (a) whether each keyword appears in a tokenizer’s vocabulary and (b) the rank of that token within the BPE vocabulary (lower rank = higher frequency). Results show high coverage (>90%) for modern languages such as Python, TypeScript, Rust, Go, and C#, but markedly lower coverage for C, C++, and especially React. DeepSeek‑R1’s tokenizer consistently assigns higher ranks (i.e., lower frequency) to keywords, suggesting that many programming tokens are split into sub‑words, which can increase sequence length and degrade syntactic understanding.
The authors argue that mere presence/absence is insufficient; token rank directly reflects how often a token was seen during pre‑training and therefore how efficiently the model can process it. They also note that a substantial fraction of each vocabulary consists of “structural” tokens (brackets, parentheses, indentation symbols) rather than semantic words: 14.6 % for Llama 3.1, 12.1 % for Qwen2.5, and 5.1 % for DeepSeek‑V3. This allocation influences the model’s baseline ability to generate syntactically correct code.
Cold‑Start Probability Metric
To probe model behavior without any prompt, the authors introduce a “cold‑start” analysis. They compute the probability that a model spontaneously emits (i) programming keywords (PKP), (ii) special programming symbols (STP), (iii) the average keyword probability (KAP), and (iv) a control set of natural‑language tokens (NLP). This is evaluated on several distilled versions of DeepSeek‑R1 (quantized to 1.5 B–32 B) and on Qwen2.5‑Coder‑7B.
Key observations:
- STP dominates the probability mass, ranging from 0.10 to 0.72 across models, confirming that structural tokens are the most “ready” outputs in a blank context.
- Moderate quantization (e.g., 4‑bit) surprisingly raises PKP, indicating a non‑linear effect where reducing precision can actually increase the relative likelihood of core programming tokens.
- Distillation, especially into smaller student models, skews the distribution: certain keywords become overly probable while others are suppressed, reflecting a loss of the teacher’s token‑level calibration.
- The KAP metric reveals that some compressed models develop a bias toward a subset of frequently‑used constructs (e.g., “def”, “if”), potentially limiting diversity in generated code.
Surprising Findings and Practical Implications
One of the most unexpected results is that code‑focused models do not possess a distinct token set; they reuse the same BPE vocabulary as general‑purpose LLMs. Consequently, improvements in code generation stem more from training data and model architecture than from tokenizer specialization.
The authors also uncover that token‑level compression effects are highly non‑linear. Light quantization can improve the balance between keyword and special‑token probabilities, while aggressive quantization or aggressive distillation can degrade the model’s ability to emit rare but syntactically important tokens.
Based on these insights, the paper offers actionable guidelines for practitioners:
- Tokenizer Choice – Prefer tokenizers with low‑rank (high‑frequency) representations of target language keywords; verify coverage for domain‑specific libraries (e.g., React) before deployment.
- Quantization Strategy – Target moderate precision (4‑6 bits) for production models to retain or even boost keyword probabilities; avoid extreme low‑bit regimes unless the downstream task tolerates reduced syntactic fidelity.
- Distillation Practices – When distilling, incorporate auxiliary losses that preserve the teacher’s token‑distribution (e.g., KL‑divergence on keyword logits) to mitigate bias.
- Vocabulary Augmentation – If a target language or framework is under‑represented, consider extending the existing BPE vocab with dedicated tokens rather than building a brand‑new tokenizer, thereby preserving compatibility with pretrained weights.
Conclusion
The study provides the first systematic, token‑level examination of how compression techniques reshape programming‑token distributions in LLMs. By coupling vocabulary‑rank analysis with a novel cold‑start probability framework, the authors reveal that (a) tokenizers across code‑specialized and general models are largely identical, (b) keyword coverage varies dramatically across languages and is tightly linked to token rank, and (c) compression can both help and hurt code generation depending on the degree of quantization or the specifics of the distillation pipeline. These findings bridge the gap between theoretical understanding of LLM token mechanics and the practical demands of deploying efficient, high‑quality code‑generation services in resource‑constrained environments.
Comments & Academic Discussion
Loading comments...
Leave a Comment