Temperature scaling is a simple method that allows to control the uncertainty of probabilistic models. It is mostly used in two contexts: improving the calibration of classifiers and tuning the stochasticity of large language models (LLMs). In both cases, temperature scaling is the most popular method for the job. Despite its popularity, a rigorous theoretical analysis of the properties of temperature scaling has remained elusive. We investigate here some of these properties. For classification, we show that increasing the temperature increases the uncertainty in the model in a very general sense (and in particular increases its entropy). However, for LLMs, we challenge the common claim that increasing temperature increases diversity. Furthermore, we introduce two new characterisations of temperature scaling. The first one is geometric: the tempered model is shown to be the information projection of the original model onto the set of models with a given entropy. The second characterisation clarifies the role of temperature scaling as a submodel of more general linear scalers such as matrix scaling and Dirichlet calibration: we show that temperature scaling is the only linear scaler that does not change the hard predictions of the model.
Reliable uncertainty quantification in deep learning remains an open problem, poised between strong theoretical limits (Foygel Barber et al., 2021) and a flurry of proposed solutions (see e.g. Silva Filho et al., 2023;Angelopoulos et al., 2023;Ulmer et al., 2023;Papamarkou et al., 2024 for presentations of the different paradigms at play).
In a seminal paper, Guo et al. (2017) advocated the use of an old and very simple uncertainty quantification method called temperature scaling. This technique employs a single scalar parameter (the temperature) to tune the confidence of a trained neural network. In the complex landscape of uncertainty quantification, temperature scaling imposed itself as an immensely popular method. Its use goes from calibrating models in industrial machine learning-it is implemented in the Scikit-learn library (Pedregosa et al., 2011)-to controlling inference in large language models (LLMs)-its use is pervasive in virtually all LLMs, and is explicitly mentioned in the technical reports of GPT-4 (Achiam et al., 2023), Gemini (Gemini Team, 2025), DeepSeek (Liu et al., 2024), and Mistral (Rastogi et al., 2025). Often, other uncertainty quantification techniques use temperature scaling as a building block (e.g. Berta et al., 2025a;Gibbs et al., 2025).
In spite of this popularity, there is a surprising lack of theoretical investigations of temperature scaling. Exceptions include Clarté et al. (2023a;b), who studied in particular its asymptotic behaviour under model misspecification, Dabah and Tirer (2025) who looked at the interplay between temperature scaling and conformal prediction, and Berta et al. (2025b), who studied in which cases it can (and cannot) be optimal. This relative absence may be due to the fact that the model is quite simple, and resembles many well-studied models in statistics, machine learning, and statistical physics. Nevertheless, we believe a thorough yet elementary inspection of temperature scaling to be both timely and interesting. After reviewing and revisiting temperature scaling in Section 2 (both for classification and LLMs), our main contributions are:
• In Section 3, we show that for classification, increasing the temperature has the expected effect of increasing the uncertainty of the model. However, we highlight that this may not be the case for LLMs.
• We provide in Section 4 a geometric interpretation of temperature scaling as an information projection: a tempered model is the model closest to the original one that has the required level of entropy.
• An important property of temperature scaling is that it is accuracy-preserving: it does not change the ordering of the classes. We show in Section 5 that is actually a defining property, in the sense that it is the only accuracy-preserving linear scaler.
Notation. Vectors are denoted by boldfaces. We denote by ∆ K = {π ∈ [0, 1] K |π 1 + . . . + π K = 1} the K-simplex and by H the Gibbs-Shannon entropy of a discrete distribution.
2 What is temperature scaling?
We start with a pretrained model f : X → R K where K is the number of classes -for instance, a neural network trained on a classification problem. The output of this model are the logits, i.e. the softmax preactivations z = f (x). Standard predictive probabilities are obtained by then applying a softmax layer:
where y ∈ {1, . . . , K} is the class label and Softmax(z) = e z1 e z1 + . . . + e z K , . . . , e z K e z1 + . . . + e z K .
(3)
Here, we mean “model” in a very broad sense: while f could be a standard classifier (with features x and label y), it could also be, for instance, an autoregressive language model (in that case, x would be the previous tokens, y the next token(s), and K the size of the vocabulary). Each π k ∈ [0, 1] corresponds to the probability that the model assigns to class k. Sometimes, one may not be too happy with these probabilities. In particular, in the context of classification tasks with neural networks, a widely reported issue is that these probabilities are generally overconfident, leading to poor calibration (Guo et al., 2017;Minderer et al., 2021) and difficult distillation (Hinton et al., 2015). For LLMs, it is desirable to let users sharpen or smoothen these probabilities to give them control of the amount of stochasticity of the generations, also referred to as “diversity” in this context. Temperature scaling is a simple and efficient method that allows us to slightly alter the probabilities in order to meet one of these goals.
The idea is to multiply all logits by a learnable parameter β > 0, called the inverse temperature in analogy to a Boltzmann-Gibbs distributions in statistical physics. This leads to a new conditional distribution p β (y|z) = Categorical(y|π β ) with π β = Softmax(βz).
(4)
Since β is constrained to be positive, the coefficients of βz are ordered in the same way as those of z. Thus, the hard predictions of p β will be identical to the ones of the original model. In particular, this im
This content is AI-processed based on open access ArXiv data.