# Softmax function

In mathematics, the softmax function, also known as softargmax[1] or normalized exponential function,[2]:198 is a function that takes as input a vector of K real numbers, and normalizes it into a probability distribution consisting of K probabilities. That is, prior to applying softmax, some vector components could be negative, or greater than one; and might not sum to 1; but after applying softmax, each component will be in the interval ${\displaystyle (0,1)}$, and the components will add up to 1, so that they can be interpreted as probabilities. Furthermore, the larger input components will correspond to larger probabilities. Softmax is often used in neural networks, to map the non-normalized output of a network to a probability distribution over predicted output classes.

The standard (unit) softmax function ${\displaystyle \sigma :\mathbb {R} ^{K}\to \mathbb {R} ^{K}}$is defined by the formula

${\displaystyle \sigma (\mathbf {z} )_{j}={\frac {e^{z_{j}}}{\sum _{k=1}^{K}e^{z_{k}}}}}$    for j = 1, …, K and ${\displaystyle {\mathbf {z}}=(z_{1},\ldots ,z_{K})\in \mathbb {R} ^{K}}$

In words: we apply the standard exponential function to each element ${\displaystyle z_{j}}$ of the input vector ${\displaystyle {\mathbf {z}}}$ and normalize these values by dividing by the sum of all these exponentials; this normalization ensures that the sum of the components of the output vector ${\displaystyle \sigma ({\mathbf {z}})}$is 1.

Instead of e, a different base b>0 can be used; choosing a larger value of b will create a probability distribution that is more concentrated around the positions of the largest input values. Writing ${\displaystyle b=e^{\beta }}$ or ${\displaystyle b=e^{-\beta }}$[a] (for real β)[b] yields the expressions:[c]

${\displaystyle \sigma (\mathbf {z} )_{j}={\frac {e^{\beta z_{j}}}{\sum _{k=1}^{K}e^{\beta z_{k}}}}}$    or    ${\displaystyle \sigma (\mathbf {z} )_{j}={\frac {e^{-\beta z_{j}}}{\sum _{k=1}^{K}e^{-\beta z_{k}}}}}$    for j = 1, …, K.

In some fields, the base is fixed, corresponding to a fixed scale,[d] while in others the parameter β is varied.

## Interpretations

### Smooth arg max

The name "softmax" is misleading; the function is not a smooth maximum (a smooth approximation to the maximum function), but is rather a smooth approximation to the arg max function: the function whose value is which index has the maximum. In fact, the term "softmax" is also used for the closely related LogSumExp function, which is a smooth maximum. For this reason, some prefer the more accurate term "softargmax", but the term "softmax" is conventional in machine learning.[3][4] For this section, the term "softargmax" is used to emphasize this interpretation.

Formally, instead of considering the arg max as a function with categorical output ${\displaystyle 1,\dots ,n}$ (corresponding to the index), consider the arg max function with one-hot representation of the output (assuming there is a unique max arg):

${\displaystyle \operatorname {arg\,max} (z_{1},\dots ,z_{n})=(y_{1},\dots ,y_{n})=(0,\dots ,0,1,0,\dots ,0),}$

where the output coordinate ${\displaystyle y_{i}=1}$ if and only if ${\displaystyle i}$ is the arg max of ${\displaystyle (z_{1},\dots ,z_{n})}$, meaning ${\displaystyle z_{i}}$ is the unique maximum value of ${\displaystyle (z_{1},\dots ,z_{n})}$. For example, in this encoding ${\displaystyle \operatorname {arg\,max} (1,5,10)=(0,0,1),}$ since the third argument is the maximum.

This can be generalized to multiple arg max values (multiple equal ${\displaystyle z_{i}}$ being the maximum) by dividing the 1 between all max args; formally 1/k where k is the number of arguments assuming the maximum. For example, ${\displaystyle \operatorname {arg\,max} (1,5,5)=(0,1/2,1/2),}$ since the second and third argument are both the maximum. In case all arguments are equal, this is simply ${\displaystyle \operatorname {arg\,max} (z,\dots ,z)=(1/n,\dots ,1/n).}$ Points z with multiple arg max values are singular points (or singularities, and form the singular set) – these are the points where arg max is discontinuous (with a jump discontinuity) – while points with a single arg max are known as non-singular or regular points.

With this representation, softargmax is now a smooth approximation of arg max: as ${\displaystyle \beta \to \infty }$, softargmax converges to arg max. There are various notions of convergence of a function; softargmax converges to arg max pointwise, meaning for each fixed input z as ${\displaystyle \beta \to \infty }$, ${\displaystyle \sigma _{\beta }(\mathbf {z} )\to \operatorname {arg\,max} (\mathbf {z} ).}$ However, softargmax does not converge uniformly to arg max, meaning intuitively that different points converge at different rates, and may converge arbitrarily slowly. In fact, softargmax is continuous, but arg max is not continuous at the singular set where two coordinates are equal, while the uniform limit of continuous functions is continuous. The failure to converge uniformly is because for inputs where two coordinates are almost equal (and one is the maximum), the arg max is the index of one or the other, so a small change in input yields a large change in output. For example, ${\displaystyle \sigma _{\beta }(1,1.0001)\to (0,1),}$ but ${\displaystyle \sigma _{\beta }(1,0.9999)\to (1,0),}$ and ${\displaystyle \sigma _{\beta }(1,1)=1/2}$ for all inputs: the closer the points are to the singular set ${\displaystyle (x,x)}$, the slower they converge. However, softargmax does converge compactly on the non-singular set.

Conversely, as ${\displaystyle \beta \to -\infty }$, softargmax converges to arg min in the same way, where here the singular set is points with two arg min values. In the language of tropical analysis, the softmax is a deformation or "quantization" of arg max and arg min, corresponding to using the log semiring instead of the max-plus semiring (respectively min-plus semiring), and recovering the arg max or arg min by taking the limit is called "tropicalization" or "dequantization".

It is also the case that, for any fixed β, if one input ${\displaystyle z_{i}}$ is much larger that the others relative to the temperature, ${\displaystyle {1}}$, the output is approximately the arg max. For example, a difference of 10 is large relative to a temperature of 1:

${\displaystyle \sigma (0,10):=\sigma _{1}(0,10)=\left(1/(1+e^{10}),e^{10}/(1+e^{10})\right)\approx (0.00005,0.99995)}$

However, if the difference is small relative to the temperature, the value is not close to the arg max. For example, a difference of 10 is small relative to a temperature of 100:

${\displaystyle \sigma _{1/100}(0,10)\left(1/(1+e^{1/10}),e^{1/10}/(1+e^{1/10})\right)\approx (0.475,0.525).}$

As ${\displaystyle \beta \to \infty }$, temperature goes to zero, ${\displaystyle T=1/\beta \to 0}$, so eventually all differences become large (relative to a shrinking temperature), which gives another interpretation for the limit behavior.

### Probability theory

In probability theory, the output of the softmax function can be used to represent a categorical distribution – that is, a probability distribution over K different possible outcomes.

### Statistical mechanics

In statistical mechanics, the softmax function is known as the Boltzmann distribution (or Gibbs distribution):[5] the index set ${\displaystyle {1,\dots ,k}}$ are the microstates of the system; the inputs ${\displaystyle z_{i}}$ are the energies of that state; the denominator is known as the partition function, often denoted by Z; and the factor β is called the coldness (or thermodynamic beta, or inverse temperature).

## Applications

The softmax function is used in various multiclass classification methods, such as multinomial logistic regression (also known as softmax regression)[2]:206–209 [1], multiclass linear discriminant analysis, naive Bayes classifiers, and artificial neural networks.[6] Specifically, in multinomial logistic regression and linear discriminant analysis, the input to the function is the result of K distinct linear functions, and the predicted probability for the j'th class given a sample vector x and a weighting vector w is:

${\displaystyle P(y=j\mid \mathbf {x} )={\frac {e^{\mathbf {x} ^{\mathsf {T}}\mathbf {w} _{j}}}{\sum _{k=1}^{K}e^{\mathbf {x} ^{\mathsf {T}}\mathbf {w} _{k}}}}}$

This can be seen as the composition of K linear functions ${\displaystyle \mathbf {x} \mapsto \mathbf {x} ^{\mathsf {T}}\mathbf {w} _{1},\ldots ,\mathbf {x} \mapsto \mathbf {x} ^{\mathsf {T}}\mathbf {w} _{K}}$ and the softmax function (where ${\displaystyle \mathbf {x} ^{\mathsf {T}}\mathbf {w} }$ denotes the inner product of ${\displaystyle \mathbf {x} }$ and ${\displaystyle \mathbf {w} }$). The operation is equivalent to applying a linear operator defined by ${\displaystyle \mathbf {w} }$ to vectors ${\displaystyle \mathbf {x} }$, thus transforming the original, probably highly-dimensional, input to vectors in a K-dimensional space ${\displaystyle \mathbb {R} ^{K}}$.

### Neural networks

The softmax function is often used in the final layer of a neural network-based classifier. Such networks are commonly trained under a log loss (or cross-entropy) regime, giving a non-linear variant of multinomial logistic regression.

Since the function maps a vector and a specific index i to a real value, the derivative needs to take the index into account:

${\displaystyle {\frac {\partial }{\partial q_{k}}}\sigma ({\textbf {q}},i)=\cdots =\sigma ({\textbf {q}},i)(\delta _{ik}-\sigma ({\textbf {q}},k))}$

Here, the Kronecker delta is used for simplicity (cf. the derivative of a sigmoid function, being expressed via the function itself).

See Multinomial logit for a probability model which uses the softmax activation function.

### Reinforcement learning

In the field of reinforcement learning, a softmax function can be used to convert values into action probabilities. The function commonly used is:[7]

${\displaystyle P_{t}(a)={\frac {\exp(q_{t}(a)/\tau )}{\sum _{i=1}^{n}\exp(q_{t}(i)/\tau )}}{\text{,}}}$

where the action value ${\displaystyle q_{t}(a)}$ corresponds to the expected reward of following action a and ${\displaystyle \tau }$ is called a temperature parameter (in allusion to statistical mechanics). For high temperatures (${\displaystyle \tau \to \infty }$), all actions have nearly the same probability and the lower the temperature, the more expected rewards affect the probability. For a low temperature (${\displaystyle \tau \to 0^{+}}$), the probability of the action with the highest expected reward tends to 1.

## Properties

Geometrically the softmax function maps the vector space ${\displaystyle \mathbb {R} ^{K}}$ to the interior of the standard ${\displaystyle (K-1)}$-simplex, cutting the dimension by one (the range is a ${\displaystyle (K-1)}$-dimensional simplex in ${\displaystyle K}$-dimensional space), due to the linear constraint that all output sum to 1 meaning it lies on a hyperplane.

Along the main diagonal ${\displaystyle (x,x,\dots ,x),}$ softmax is just the uniform distribution on outputs, ${\displaystyle (1/n,\dots ,1/n)}$: equal scores yield equal probabilities.

More generally, softmax is invariant under translation by the same value in each coordinate: adding ${\displaystyle \mathbf {c} =(c,\dots ,c)}$ to the inputs ${\displaystyle \mathbf {z} }$ yields ${\displaystyle \sigma (\mathbf {z} +\mathbf {c} )=\sigma (\mathbf {z} )}$, because it multiplies each exponent by the same factor, ${\displaystyle e^{c}}$ (because ${\displaystyle e^{z_{i}+c}=e^{z_{i}}\cdot e^{c}}$), so the ratios do not change:

${\displaystyle \sigma (\mathbf {z} +\mathbf {c} )_{j}={\frac {e^{z_{j}+c}}{\sum _{k=1}^{K}e^{z_{k}+c}}}={\frac {e^{z_{j}}\cdot e^{c}}{\sum _{k=1}^{K}e^{z_{k}}\cdot e^{c}}}=\sigma (\mathbf {z} )_{j}.}$

Geometrically, softmax is constant along diagonals: this is the dimension that is eliminated, and corresponds to the softmax output being independent of a translation in the input scores (a choice of 0 score). One can normalize input scores by assuming that the sum is zero (subtract the average: ${\displaystyle \mathbf {c} }$ where ${\textstyle c={\frac {1}{n}}\sum z_{i}}$), and then the softmax takes the hyperplane of points that sum to zero, ${\textstyle \sum z_{i}=0}$, to the open simplex of positive values that sum to 1${\textstyle \sum \sigma (\mathbf {z} )_{i}=1}$, analogously to how the exponent takes 0 to 1, ${\displaystyle e^{0}=1}$ and is positive.

By contrast, softmax is not invariant under scaling. For instance, ${\displaystyle \sigma {\bigl (}(0,1){\bigr )}={\bigl (}1/(1+e),e/(1+e){\bigr )}}$ but ${\displaystyle \sigma {\bigl (}(0,2){\bigr )}={\bigl (}1/(1+e^{2}),e^{2}/(1+e^{2}){\bigr )}.}$

The standard logistic function is the special case for a 1-dimensional axis in 2-dimensional space, say the x-axis in the (x, y) plane. One variable is fixed at 0 (say ${\displaystyle z_{2}=0}$), so ${\displaystyle e^{0}=1}$, and the other variable can vary, denote it ${\displaystyle z_{1}=x}$, so ${\textstyle e^{z_{1}}/\sum _{k=1}^{2}e^{z_{k}}=e^{x}/(e^{x}+1),}$ the standard logistic function, and ${\textstyle e^{z_{2}}/\sum _{k=1}^{2}e^{z_{k}}=1/(e^{x}+1),}$ its complement (meaning they add up to 1). The 1-dimensional input could alternatively be expressed as the line ${\displaystyle (x/2,-x/2)}$, with outputs ${\displaystyle e^{x/2}/(e^{x/2}+e^{-x/2})=e^{x}/(e^{x}+1)}$ and ${\displaystyle e^{-x/2}/(e^{x/2}+e^{-x/2})=1/(e^{x}+1).}$

The softmax function is also the gradient of the LogSumExp function, a smooth maximum; defining:

${\displaystyle \mathrm {LSE} (z_{1},\dots ,z_{n})=\log \left(\exp(z_{1})+\cdots +\exp(z_{n})\right),}$

the partial derivatives are:

${\textstyle \partial _{i}LSE(\mathbf {x} )=\exp x_{i}/{\bigl (}\sum _{i}\exp x_{i}{\bigr )}.}$

Expressing the partial derivatives as a vector with the gradient yields the softmax.

## History

The softmax function was used in statistical mechanics as the Boltzmann distribution in the foundational paper Boltzmann (1868), formalized and popularized in the influential textbook Gibbs (1902).

The use of the softmax in decision theory is credited to Luce (1959),[8] who used the axiom of independence of irrelevant alternatives in rational choice theory to deduce the softmax in Luce's choice axiom for relative preferences.

In machine learning, the term "softmax" is credited to John S. Bridle in two 1989 conference papers, Bridle (1990a):[8] and Bridle (1990b):[3]

We are concerned with feed-forward non-linear networks (multi-layer perceptrons, or MLPs) with multiple outputs. We wish to treat the outputs of the network as probabilities of alternatives (e.g. pattern classes), conditioned on the inputs. We look for appropriate output non-linearities and for appropriate criteria for adaptation of the parameters of the network (e.g. weights). We explain two modifications: probability scoring, which is an alternative to squared error minimisation, and a normalised exponential (softmax) multi-input generalisation of the logistic non-linearity.[9]

For any input, the outputs must all be positive and they must sum to unity. ...

Given a set of unconstrained values, ${\displaystyle V_{j}(x)}$, we can ensure both conditions by using a Normalised Exponential transformation:

${\displaystyle Q_{j}(x)=e^{V_{j}(x)}/\sum _{k}e^{V_{k}(x)}}$

This transformation can be considered a multi-input generalisation of the logistic, operating on the whole output layer. It preserves the rank order of its input values, and is a differentiable generalisation of the ‘winner-take-all’ operation of picking the maximum value. For this reason we like to refer to it as softmax.[10]

## Example

If we take an input of [1, 2, 3, 4, 1, 2, 3], the softmax of that is [0.024, 0.064, 0.175, 0.475, 0.024, 0.064, 0.175]. The output has most of its weight where the '4' was in the original input. This is what the function is normally used for: to highlight the largest values and suppress values which are significantly below the maximum value. But note: softmax is not scale invariant, so if the input were [0.1, 0.2, 0.3, 0.4, 0.1, 0.2, 0.3] (which sums to 1.6) the softmax would be [0.125, 0.138, 0.153, 0.169, 0.125, 0.138, 0.153]. This shows that for values between 0 and 1 softmax, in fact, de-emphasizes the maximum value (note that 0.169 is not only less than 0.475, it is also less than the initial proportion of 0.4/1.6=0.25).

Computation of this example using simple Python code:

>>> import math
>>> z = [1.0, 2.0, 3.0, 4.0, 1.0, 2.0, 3.0]
>>> z_exp = [math.exp(i) for i in z]
>>> print([round(i, 2) for i in z_exp])
[2.72, 7.39, 20.09, 54.6, 2.72, 7.39, 20.09]
>>> sum_z_exp = sum(z_exp)
>>> print(round(sum_z_exp, 2))
114.98
>>> softmax = [i / sum_z_exp for i in z_exp]
>>> print([round(i, 3) for i in softmax])
[0.024, 0.064, 0.175, 0.475, 0.024, 0.064, 0.175]


Another example with python using Numpy:

>>> import numpy as np
>>> z = [1.0, 2.0, 3.0, 4.0, 1.0, 2.0, 3.0]
>>> softmax = lambda z:np.exp(z)/np.sum(np.exp(z))
>>> softmax(z)
array([0.02364054, 0.06426166, 0.1746813 , 0.474833  , 0.02364054,
0.06426166, 0.1746813 ])


Here is an example of Julia code:

julia> A = [1.0, 2.0, 3.0, 4.0, 1.0, 2.0, 3.0]
7-element Array{Float64,1}:
1.0
2.0
3.0
4.0
1.0
2.0
3.0

julia> exp.(A) ./ sum(exp.(A))
7-element Array{Float64,1}:
0.0236405
0.0642617
0.174681
0.474833
0.0236405
0.0642617
0.174681


Here is an example of R code:

> z <- c(1.0, 2.0, 3.0, 4.0, 1.0, 2.0, 3.0)
> softmax <- exp(z)/sum(exp(z))
> softmax
[1] 0.02364054 0.06426166 0.17468130 0.47483300 0.02364054 0.06426166 0.17468130


## Notes

1. ^ Positive β corresponds to the maximum convention, and is usual in machine learning, corresponding to the highest score having highest probability. The negative β corresponds to the minimum convention, and is conventional in thermodynamics, corresponding to the lowest energy state having the highest probability; this matches the convention in the Gibbs distribution, interpreting β as coldness.
2. ^ The notation β is for the thermodynamic beta, which is inverse temperature: ${\displaystyle \beta =1/T}$, ${\displaystyle T=1/\beta .}$
3. ^ For ${\displaystyle \beta =0}$ (coldness zero, infinite temperature), ${\displaystyle b=e^{\beta }=e^{0}=1}$, and this becomes the constant function ${\displaystyle (1/n,\dots ,1/n)}$, corresponding to the discrete uniform distribution.
4. ^ In statistical mechanics, fixing β is interpreted as having coldness and temperature of 1.

## References

1. ^ Goodfellow, Bengio & Courville 2016, p. 184.
2. ^ a b Bishop, Christopher M. (2006). Pattern Recognition and Machine Learning. Springer.
3. ^ a b Sako, Yusaku (2018-06-02). "Is the term "softmax" driving you nuts?". Medium.
4. ^ Goodfellow, Bengio & Courville 2016, pp. 183–184: The name “softmax” can be somewhat confusing. The function is more closely related to the arg max function than the max function. The term “soft” derives from the fact that the softmax function is continuous and differentiable. The arg max function, with its result represented as a one-hot vector, is not continuous or differentiable. The softmax function thus provides a “softened” version of the arg max. The corresponding soft version of the maximum function is ${\displaystyle \operatorname {softmax} (\mathbf {z} )^{\top }\mathbf {z} }$. It would perhaps be better to call the softmax function “softargmax,” but the current name is an entrenched convention.
5. ^ LeCun et al., p. 7.
6. ^
7. ^ Sutton, R. S. and Barto A. G. Reinforcement Learning: An Introduction. The MIT Press, Cambridge, MA, 1998. Softmax Action Selection
8. ^ a b Gao & Pavel 2017, p. 1.
9. ^ Bridle 1990a, p. 227.
10. ^ Bridle 1990b, p. 213.