Softmax Function

Converts a vector of numbers into probabilities that sum to 1.

The Softmax Activation Function converts a vector of numbers into a vector of probabilities that sum to 1. It's applied to a model's outputs (or Logits) in Multi-class Classification.

It is the multi-class extension of the Sigmoid Activation Function.

The equation is:

σ(z)i=ezij=1Kezj\sigma(\vec{z})_{i} = \frac{e^{z_i}}{\sum\limits_{j=1}^{K}e^{z_j}}

The intuition for it is that exie^{x_i} is always positive and increases fast, amplifying more significant numbers. Therefore, it tends to find a single result and is less useful for problems where you are unsure if inputs will always contain a label. For that, use multiple binary columns with the Sigmoid Activation Function.

Howard et al. (2020) (pg. 223-227)

Code example:

import numpy as np
import pandas as pd

def softmax(x):
    return np.exp(x) / np.exp(x).sum()

logits =  np.array([-3.5, -2.37, 1.54, 5.23]) # some arbitrary numbers I made up that could have come out of a neural network
probs = softmax(logits)

df = pd.DataFrame({'logit': logits, 'prob': probs}, index=['woman', 'man', 'camera', 'tv'])
df
logit prob
woman -3.50 0.000158
man -2.37 0.000488
camera 1.54 0.024348
tv 5.23 0.975007

Softmax is part of the Categorical Cross-Entropy Loss, applied before passing results to Negative Log-Likelihood function.

Temperature Scaling can be applied to the logits to adjust the distribution sharpness (how confident it is about high values) or make it flatter and more diverse.

References

Jeremy Howard, Sylvain Gugger, and Soumith Chintala. Deep Learning for Coders with Fastai and PyTorch: AI Applications without a PhD. O'Reilly Media, Inc., Sebastopol, California, 2020. ISBN 978-1-4920-4552-6.