Temperature Scaling

Jan 14, 2025 permanent LargeLanguageModels MachineLearning

Temperature scaling controls how "confident" a model is when making predictions by adjusting the sharpness of probability distributions produced by the Softmax Function.

Softmax is a function that converts a neural network's raw outputs (logits) into probabilities that sum to 1. For example, in a dog breed classifier, the model might output logits representing its confidence for different breeds, and the Softmax function would convert those into probability-like values:

	logit	softmax
Golden Retriever	5.23	0.975007
Labrador	1.54	0.024348
Husky	-2.37	0.000488
German Shepherd	-3.50	0.000158

The basic Softmax formula is:

$Softmax(logits) = \frac{\exp(logits)}{\Sigma \exp(logits)}$

By introducing a temperature parameter $T$ , we can control how "confident" the model is in its predictions:

$Softmax(logits, T) = \frac{\exp(logits/T)}{\Sigma \exp(logits/T)}$

For numerical stability, we apply temperature scaling to logits before Softmax:

def scaled_softmax(logits, temperature=1.0):
    scaled_logits = logits/temperature
    return softmax(scaled_logits)

When $T = 1$ , we have plain old Softmax, which maintains the original relative differences between probabilities.

$T < 1$ creates a sharper distribution, making the model more confident. The highest probability becomes even higher, and the lower probabilities become even lower. At $T$ approaching 0, it becomes deterministic (100% confident).

When $T > 1$ , it creates a flatter distribution, making the model less confident, and the differences between probabilities become smaller; predictions become more evenly distributed.

Try it for yourself to see how adjusting the temperature setting affects the Softmax probabilities in a dog breed classification problem:

Temperature: 1.0

Temperature Scaling in Language Models

In a Language Model, which predicts a token at a time based on the previous tokens in a sequence, each token is predicted by creating a Softmax probability distribution across the vocabulary and then randomly sampling from that distribution.

The temperature parameter, therefore, affects how much randomness is injected at inference time.

$T = 0$ : Deterministic, always selects the highest probability token. However, in practice, you can only approximate it with a very small temperature (since dividing by zero is undefined). Good for math, coding, and fact-based responses.

$T \approx 0.7$ : Balanced between coherence and creativity. Common default for chat models. Maintains context while allowing natural variation

$T > 1$ : Increases randomness. Can generate more creative/diverse outputs. Risk of incoherent or off-topic responses

Temperature is typically applied during inference only. During training, models use $T = 1$ to learn the true probability distribution of the data. However, in the paper, Distilling the Knowledge in a Neural Network, they experiment with using higher temperatures during training to help the model distinguish between similar classes of items.

Tags

Notes by Lex Toumbourou

Temperature Scaling

Temperature Scaling in Language Models