Temperature Scaling
Temperature scaling controls how "confident" a model is when making predictions by adjusting the sharpness of probability distributions produced by the Softmax Function.
Softmax is a function that converts a neural network's raw outputs (logits) into probabilities that sum to 1. For example, in a dog breed classifier, the model might output logits representing its confidence for different breeds, and the Softmax function would convert those into probability-like values:
logit | softmax | |
---|---|---|
Golden Retriever | 5.23 | 0.975007 |
Labrador | 1.54 | 0.024348 |
Husky | -2.37 | 0.000488 |
German Shepherd | -3.50 | 0.000158 |
The basic Softmax formula is:
By introducing a temperature parameter , we can control how "confident" the model is in its predictions:
For numerical stability, we apply temperature scaling to logits before Softmax:
def scaled_softmax(logits, temperature=1.0):
scaled_logits = logits/temperature
return softmax(scaled_logits)
When , we have plain old Softmax, which maintains the original relative differences between probabilities.
creates a sharper distribution, making the model more confident. The highest probability becomes even higher, and the lower probabilities become even lower. At approaching 0, it becomes deterministic (100% confident).
When , it creates a flatter distribution, making the model less confident, and the differences between probabilities become smaller; predictions become more evenly distributed.
Try it for yourself to see how adjusting the temperature setting affects the Softmax probabilities in a dog breed classification problem:
Temperature Scaling in Language Models
In a Language Model, which predicts a token at a time based on the previous tokens in a sequence, each token is predicted by creating a Softmax probability distribution across the vocabulary and then randomly sampling from that distribution.
The temperature parameter, therefore, affects how much randomness is injected at inference time.
: Deterministic, always selects the highest probability token. However, in practice, you can only approximate it with a very small temperature (since dividing by zero is undefined). Good for math, coding, and fact-based responses.
: Balanced between coherence and creativity. Common default for chat models. Maintains context while allowing natural variation
: Increases randomness. Can generate more creative/diverse outputs. Risk of incoherent or off-topic responses
Temperature is typically applied during inference only. During training, models use to learn the true probability distribution of the data. However, in the paper, Distilling the Knowledge in a Neural Network, they experiment with using higher temperatures during training to help the model distinguish between similar classes of items.