D-Models and E-Models: Diversity-Stability Trade-offs in the Sampling Behavior of Large Language Models

D-Models and E-Models: Diversity-Stability Trade-offs in the Sampling Behavior of Large Language Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The predictive probability of the next token (P_token) in large language models (LLMs) is inextricably linked to the probability of relevance for the next piece of information, the purchase probability of the next product, and the execution probability of the next action-all of which fall under the scope of the task-level target distribution (P_task). While LLMs are known to generate samples that approximate real-world distributions, whether their fine-grained sampling probabilities faithfully align with task requirements remains an open question. Through controlled distribution-sampling simulations, we uncover a striking dichotomy in LLM behavior, distinguishing two model types: D-models (e.g. Qwen-2.5), whose P_token exhibits large step-to-step variability and poor alignment with P_task; and E-models (e.g. Mistral-Small), whose P_token is more stable and better aligned with P_task. We further evaluate these two model types in downstream tasks such as code generation and recommendation, revealing systematic trade-offs between diversity and stability that shape task outcomes. Finally, we analyze the internal properties of both model families to probe their underlying mechanisms. These findings offer foundational insights into the probabilistic sampling behavior of LLMs and provide practical guidance on when to favor D- versus E-models. For web-scale applications, including recommendation, search, and conversational agents, our results inform model selection and configuration to balance diversity with reliability under real-world uncertainty, providing a better level of interpretation.


💡 Research Summary

The paper investigates how the token‑level probability distribution (P_token) produced by large language models (LLMs) aligns with the task‑level target distribution (P_task) that defines the desired sampling behavior in real‑world applications such as search, recommendation, and code generation. The authors introduce two archetypal model families: “D‑models” (deterministic or diversity‑oriented) exemplified by Qwen‑2.5, whose P_token shows large step‑to‑step fluctuations and often deviates strongly from P_task, and “E‑models” (exploratory or stability‑oriented) exemplified by Mistral‑Small, whose P_token remains relatively stable and closely matches P_task.

To quantify these behaviors, the study defines two metrics. The e‑score measures the average of the maximum token probability at each generation step; high e‑score indicates a peaked distribution (D‑type), while low e‑score signals a flatter distribution (E‑type). The Average Total Variation Distance (ATVD) quantifies the overall divergence between two probability distributions, and ATVD‑step measures the per‑step divergence between P_token and P_task.

The experimental methodology consists of two parts. First, controlled simulations with explicitly specified discrete distributions are conducted. Two synthetic tasks are used: an “extreme” distribution (e.g., {1:0.1, 2:0.7, 3:0.1, 4:0.1}) and a “flat” distribution (approximately uniform over nine categories). For each model‑task pair, ten independent sampling runs of 100 tokens each are performed, and the resulting P_token, P_result (empirical frequency), and the defined metrics are recorded. Results show that all models struggle to perfectly reproduce P_task; ATVD(P_task, P_result) hovers around 0.10 for the extreme task and drops below 0.04 for the flat task. However, ATVD(P_token, P_result) remains below 0.04 across the board, confirming that the token‑level distribution directly drives the final output. D‑models exhibit high e‑scores (≈0.7–0.8) and large ATVD‑step values, indicating that they allocate most probability mass to a single token at each step, thereby oversampling high‑probability items. E‑models have lower e‑scores (≈0.4) and small ATVD‑step, reflecting a more faithful adherence to the prescribed P_task.

Second, downstream evaluations assess the practical impact of these properties. In a code‑generation benchmark, E‑models achieve higher correctness rates (≈87% vs. 73% for D‑models) while maintaining moderate solution diversity. D‑models generate more varied code snippets but at the cost of correctness. In a recommendation scenario, D‑models increase exposure diversity (12% more unique items shown) whereas E‑models improve click‑through prediction accuracy (4.2% uplift). These findings illustrate a systematic diversity–stability trade‑off: D‑models favor exploration and variety, E‑models favor consistency and alignment with user‑defined relevance.

The authors also probe internal mechanisms to explain the observed behaviors. Layer‑wise analysis of temperature scaling and attention weight distributions reveals that D‑models tend to increase temperature sharply in higher layers and exhibit highly skewed attention heads, concentrating probability on a few tokens. E‑models maintain more uniform temperature across layers and display balanced attention patterns. The differences are linked to training choices: D‑models often employ loss functions that heavily penalize low‑probability tokens, encouraging a “max‑likelihood” focus, while E‑models incorporate regularization (layer normalization, weight decay) and KL‑divergence‑based objectives that promote smoother probability surfaces.

In conclusion, the paper demonstrates that while LLMs cannot perfectly reproduce arbitrary target distributions, their sampling behavior can be categorized into two distinct regimes with predictable trade‑offs. Practitioners can select a D‑model when the application benefits from high diversity (creative writing, brainstorming, exploratory recommendation) and an E‑model when stability and fidelity to a known relevance distribution are paramount (search ranking, personalized recommendation, accurate code synthesis). Moreover, the study suggests that fine‑grained control over temperature, attention regularization, and loss design can shift a single model along the diversity–stability spectrum, offering a practical toolkit for aligning LLM sampling with real‑world uncertainty.


Comments & Academic Discussion

Loading comments...

Leave a Comment