Temperature Scaling

Temperature scaling controls how "confident" a model is when making predictions by adjusting the sharpness of probability distributions produced by the Softmax Function.

Softmax is a function that converts a neural network's raw outputs (logits) into probabilities that sum to 1. For example, in a dog breed classifier, the model might output logits representing its confidence for different breeds, and the Softmax function would convert those into probability-like values:

logit softmax
Golden Retriever 5.23 0.975007
Labrador 1.54 0.024348
Husky -2.37 0.000488
German Shepherd -3.50 0.000158

The basic Softmax formula is:

Softmax(logits)=exp(logits)Σexp(logits)Softmax(logits) = \frac{\exp(logits)}{\Sigma \exp(logits)}

By introducing a temperature parameter TT, we can control how "confident" the model is in its predictions:

Softmax(logits,T)=exp(logits/T)Σexp(logits/T)Softmax(logits, T) = \frac{\exp(logits/T)}{\Sigma \exp(logits/T)}

For numerical stability, we apply temperature scaling to logits before Softmax:

def scaled_softmax(logits, temperature=1.0):
    scaled_logits = logits/temperature
    return softmax(scaled_logits)

When T=1T = 1, we have plain old Softmax, which maintains the original relative differences between probabilities.

T<1T < 1 creates a sharper distribution, making the model more confident. The highest probability becomes even higher, and the lower probabilities become even lower. At TT approaching 0, it becomes deterministic (100% confident).

When T>1T > 1, it creates a flatter distribution, making the model less confident, and the differences between probabilities become smaller; predictions become more evenly distributed.

Try it for yourself to see how adjusting the temperature setting affects the Softmax probabilities in a dog breed classification problem:

Temperature Scaling in Language Models

In a Language Model, which predicts a token at a time based on the previous tokens in a sequence, each token is predicted by creating a Softmax probability distribution across the vocabulary and then randomly sampling from that distribution.

The temperature parameter, therefore, affects how much randomness is injected at inference time.

T=0T = 0: Deterministic, always selects the highest probability token. However, in practice, you can only approximate it with a very small temperature (since dividing by zero is undefined). Good for math, coding, and fact-based responses.

T0.7T \approx 0.7: Balanced between coherence and creativity. Common default for chat models. Maintains context while allowing natural variation

T>1T > 1: Increases randomness. Can generate more creative/diverse outputs. Risk of incoherent or off-topic responses

Temperature is typically applied during inference only. During training, models use T=1T = 1 to learn the true probability distribution of the data. However, in the paper, Distilling the Knowledge in a Neural Network, they experiment with using higher temperatures during training to help the model distinguish between similar classes of items.