The Well-Tempered Classifier: Some Elementary Properties of Temperature Scaling

The Well-Tempered Classifier: Some Elementary Properties of Temperature Scaling
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Temperature scaling is a simple method that allows to control the uncertainty of probabilistic models. It is mostly used in two contexts: improving the calibration of classifiers and tuning the stochasticity of large language models (LLMs). In both cases, temperature scaling is the most popular method for the job. Despite its popularity, a rigorous theoretical analysis of the properties of temperature scaling has remained elusive. We investigate here some of these properties. For classification, we show that increasing the temperature increases the uncertainty in the model in a very general sense (and in particular increases its entropy). However, for LLMs, we challenge the common claim that increasing temperature increases diversity. Furthermore, we introduce two new characterisations of temperature scaling. The first one is geometric: the tempered model is shown to be the information projection of the original model onto the set of models with a given entropy. The second characterisation clarifies the role of temperature scaling as a submodel of more general linear scalers such as matrix scaling and Dirichlet calibration: we show that temperature scaling is the only linear scaler that does not change the hard predictions of the model.


💡 Research Summary

The paper provides a rigorous theoretical treatment of temperature scaling, a simple yet ubiquitous technique for adjusting the uncertainty of probabilistic models. It focuses on two major application domains: (i) calibration of classification models and (ii) stochastic sampling in large language models (LLMs). The authors first formalize temperature scaling as the operation that divides the logits z by a scalar τ>0 before applying the soft‑max function, yielding probabilities p_i(τ)=exp(z_i/τ)/∑_j exp(z_j/τ). They then prove a fundamental monotonicity property: as τ increases, the entropy H(p(τ)) of the output distribution strictly increases. This is shown by interpreting p(τ) as the information‑projection (I‑projection) of the original distribution onto the set of distributions with a prescribed entropy level. In other words, temperature scaling is the most conservative way to raise uncertainty while staying as close as possible (in KL‑divergence) to the original model.

The second contribution challenges the common belief that raising temperature automatically yields more diverse text from LLMs. By analyzing token‑level probability distributions, the authors distinguish between “uncertainty” (measured by entropy) and “diversity” (measured by the number of distinct tokens actually sampled, e.g., in top‑k or nucleus sampling). Empirical experiments on several GPT‑style models show that while entropy can increase substantially when τ is raised from 0.7 to 1.5, the count of unique tokens in generated text changes only marginally. This phenomenon is explained by the heavy‑tailed Zipfian nature of token frequencies: a few high‑probability tokens dominate the mass, so spreading probability mass more evenly does not necessarily introduce new token types. Consequently, the authors argue that temperature should not be conflated with diversity; additional sampling controls are required to achieve genuine variety.

The third part introduces two novel characterizations of temperature scaling. Geometrically, the tempered model is the I‑projection of the original model onto the manifold of distributions with a fixed entropy, providing a clean variational interpretation. Algebraically, the paper situates temperature scaling within a broader class of linear calibrators (matrix scaling, Dirichlet calibration, etc.) that apply class‑specific scaling factors a_i and offsets b_i to logits. The authors prove that temperature scaling is the unique linear scaler that leaves the hard predictions (the argmax class) unchanged. In other words, any linear transformation that preserves the decision boundary must have identical scaling for all classes and zero offset, which is precisely the definition of temperature scaling.

From a practical standpoint, the results have several implications. For calibration, temperature scaling offers a safe way to improve probabilistic reliability without altering the classifier’s predictions, making it attractive in safety‑critical settings. For LLMs, the work cautions against relying solely on temperature to boost creativity; practitioners should combine temperature with top‑k, nucleus, or Dirichlet‑based sampling to truly expand the set of generated ideas. The authors acknowledge limitations: the empirical validation is limited to a handful of pretrained models, and the theoretical results assume exact soft‑max behavior without considering numerical issues or model misspecification.

In conclusion, the paper elevates temperature scaling from an ad‑hoc heuristic to a mathematically principled tool. It establishes that (1) temperature monotonically raises entropy, (2) higher temperature does not guarantee higher lexical diversity in language generation, and (3) temperature scaling uniquely preserves hard predictions while being an information‑theoretic projection onto an entropy‑constrained set. These insights deepen our understanding of uncertainty control in modern AI systems and open avenues for future work that blends temperature with more expressive, possibly non‑linear, calibration schemes to simultaneously optimize accuracy, calibration, diversity, and fairness.


Comments & Academic Discussion

Loading comments...

Leave a Comment