# Softmax function

(Redirected from Softmax activation function)

In mathematics, the softmax function, or normalized exponential function,[1]:198 is a generalization of the logistic function that "squashes" a K-dimensional vector ${\displaystyle \mathbf {z} }$ of arbitrary real values to a K-dimensional vector ${\displaystyle \sigma (\mathbf {z} )}$ of real values, where each entry is in the range (0, 1),[a] and all the entries add up to 1.

${\displaystyle \sigma :\mathbb {R} ^{K}\to \left\{\sigma \in \mathbb {R} ^{K}|\sigma _{i}>0,\sum _{i=1}^{K}\sigma _{i}=1\right\}}$
${\displaystyle \sigma (\mathbf {z} )_{j}={\frac {e^{z_{j}}}{\sum _{k=1}^{K}e^{z_{k}}}}}$    for j = 1, …, K.

In probability theory, the output of the softmax function can be used to represent a categorical distribution – that is, a probability distribution over K different possible outcomes.

The softmax function is used in various multiclass classification methods, such as multinomial logistic regression (also known as softmax regression)[1]:206–209 [1], multiclass linear discriminant analysis, naive Bayes classifiers, and artificial neural networks.[2] Specifically, in multinomial logistic regression and linear discriminant analysis, the input to the function is the result of K distinct linear functions, and the predicted probability for the j'th class given a sample vector x and a weighting vector w is:

${\displaystyle P(y=j\mid \mathbf {x} )={\frac {e^{\mathbf {x} ^{\mathsf {T}}\mathbf {w} _{j}}}{\sum _{k=1}^{K}e^{\mathbf {x} ^{\mathsf {T}}\mathbf {w} _{k}}}}}$

This can be seen as the composition of K linear functions ${\displaystyle \mathbf {x} \mapsto \mathbf {x} ^{\mathsf {T}}\mathbf {w} _{1},\ldots ,\mathbf {x} \mapsto \mathbf {x} ^{\mathsf {T}}\mathbf {w} _{K}}$ and the softmax function (where ${\displaystyle \mathbf {x} ^{\mathsf {T}}\mathbf {w} }$ denotes the inner product of ${\displaystyle \mathbf {x} }$ and ${\displaystyle \mathbf {w} }$). The operation is equivalent to applying a linear operator defined by ${\displaystyle \mathbf {w} }$ to vectors ${\displaystyle \mathbf {x} }$, thus transforming the original, probably highly-dimensional, input to vectors in a K-dimensional space ${\displaystyle \mathbb {R} ^{K}}$.

## Properties

Geometrically the softmax function maps the vector space ${\displaystyle \mathbb {R} ^{K}}$ to the interior of the standard ${\displaystyle (K-1)}$-simplex, cutting the dimension by one (the range is a ${\displaystyle (K-1)}$-dimensional simplex in ${\displaystyle K}$-dimensional space), due to the linear constraint that all output sum to 1 meaning it lies on a hyperplane.

Along the main diagonal ${\displaystyle (x,x,\dots ,x),}$ softmax is just the uniform distribution on outputs, ${\displaystyle (1/n,\dots ,1/n)}$: equal scores yield equal probabilities.

More generally, softmax is invariant under translation by the same value in each coordinate: adding ${\displaystyle \mathbf {c} =(c,\dots ,c)}$ to the inputs ${\displaystyle \mathbf {z} }$ yields ${\displaystyle \sigma (\mathbf {z} +\mathbf {c} )=\sigma (\mathbf {z} )}$, because it multiplies each exponent by the same factor, ${\displaystyle e^{c}}$ (because ${\displaystyle e^{z_{i}+c}=e^{z_{i}}\cdot e^{c}}$), so the ratios do not change:

${\displaystyle \sigma (\mathbf {z} +\mathbf {c} )_{j}={\frac {e^{z_{j}+c}}{\sum _{k=1}^{K}e^{z_{k}+c}}}={\frac {e^{z_{j}}\cdot e^{c}}{\sum _{k=1}^{K}e^{z_{k}}\cdot e^{c}}}=\sigma (\mathbf {z} )_{j}.}$

Geometrically, softmax is constant along diagonals: this is the dimension that is eliminated, and corresponds to the softmax output being independent of a translation in the input scores (a choice of 0 score). One can normalize input scores by assuming that the sum is zero (subtract the average: ${\displaystyle \mathbf {c} }$ where ${\textstyle c={\frac {1}{n}}\sum z_{i}}$), and then the softmax takes the hyperplane of points that sum to zero, ${\textstyle \sum z_{i}=0}$, to the open simplex of positive values that sum to 1${\textstyle \sum \sigma (\mathbf {z} )_{i}=1}$, analogously to how the exponent takes 0 to 1, ${\displaystyle e^{0}=1}$ and is positive.

By contrast, softmax is not in invariant under scaling. For instance, ${\displaystyle \sigma {\bigl (}(0,1){\bigr )}={\bigl (}1/(1+e),e/(1+e){\bigr )}}$ but ${\displaystyle \sigma {\bigl (}(0,2){\bigr )}={\bigl (}1/(1+e^{2}),e^{2}/(1+e^{2}){\bigr )}.}$

The standard logistic function is the special case for a 1-dimensional axis in 2-dimensional space, say the x-axis in the (x, y) plane. One variable is fixed at 0 (say ${\displaystyle z_{2}=0}$), so ${\displaystyle e^{0}=1}$, and the other variable can vary, denote it ${\displaystyle z_{1}=x}$, so ${\textstyle e^{z_{1}}/\sum _{k=1}^{2}e^{z_{k}}=e^{x}/(e^{x}+1),}$ the standard logistic function, and ${\textstyle e^{z_{2}}/\sum _{k=1}^{2}e^{z_{k}}=1/(e^{x}+1),}$ its complement (meaning they add up to 1). The 1-dimensional input could alternatively be expressed as the line ${\displaystyle (x/2,-x/2)}$, with outputs ${\displaystyle e^{x/2}/(e^{x/2}+e^{-x/2})=e^{x}/(e^{x}+1)}$ and ${\displaystyle e^{-x/2}/(e^{x/2}+e^{-x/2})=1/(e^{x}+1).}$

The softmax function is also the gradient of the LogSumExp function, a smooth maximum; defining:

${\displaystyle \mathrm {LSE} (z_{1},\dots ,z_{n})=\log \left(\exp(z_{1})+\cdots +\exp(z_{n})\right),}$

the partial derivatives are:

${\textstyle \partial _{i}LSE(\mathbf {x} )=\exp x_{i}/{\bigl (}\sum _{i}\exp x_{i}{\bigr )}.}$

Expressing the partial derivatives as a vector with the gradient yields the softmax.

## Example

If we take an input of [1, 2, 3, 4, 1, 2, 3], the softmax of that is [0.024, 0.064, 0.175, 0.475, 0.024, 0.064, 0.175]. The output has most of its weight where the '4' was in the original input. This is what the function is normally used for: to highlight the largest values and suppress values which are significantly below the maximum value. But note: softmax is not scale invariant, so if the input were [0.1, 0.2, 0.3, 0.4, 0.1, 0.2, 0.3] (which sums to 1.6) the softmax would be [0.125, 0.138, 0.153, 0.169, 0.125, 0.138, 0.153]. This shows that for values between 0 and 1 softmax, in fact, de-emphasizes the maximum value (note that 0.169 is not only less than 0.475, it is also less than the initial proportion of 0.4/1.6=0.25).

Computation of this example using simple Python code:

>>> import math
>>> z = [1.0, 2.0, 3.0, 4.0, 1.0, 2.0, 3.0]
>>> z_exp = [math.exp(i) for i in z]
>>> print([round(i, 2) for i in z_exp])
[2.72, 7.39, 20.09, 54.6, 2.72, 7.39, 20.09]
>>> sum_z_exp = sum(z_exp)
>>> print(round(sum_z_exp, 2))
114.98
>>> softmax = [i / sum_z_exp for i in z_exp]
>>> print([round(i, 3) for i in softmax])
[0.024, 0.064, 0.175, 0.475, 0.024, 0.064, 0.175]


Another example with python using Numpy:

>>> import numpy as np
>>> z = [1.0, 2.0, 3.0, 4.0, 1.0, 2.0, 3.0]
>>> softmax = lambda z:np.exp(z)/np.sum(np.exp(z))
>>> softmax(z)
array([0.02364054, 0.06426166, 0.1746813 , 0.474833  , 0.02364054,
0.06426166, 0.1746813 ])


Here is an example of Julia code:

julia> A = [1.0, 2.0, 3.0, 4.0, 1.0, 2.0, 3.0]
7-element Array{Float64,1}:
1.0
2.0
3.0
4.0
1.0
2.0
3.0

julia> exp.(A) ./ sum(exp.(A))
7-element Array{Float64,1}:
0.0236405
0.0642617
0.174681
0.474833
0.0236405
0.0642617
0.174681


Here is an example of R code:

> z <- c(1.0, 2.0, 3.0, 4.0, 1.0, 2.0, 3.0)
> softmax <- exp(z)/sum(exp(z))
> softmax
[1] 0.02364054 0.06426166 0.17468130 0.47483300 0.02364054 0.06426166 0.17468130


## Neural networks

The softmax function is often used in the final layer of a neural network-based classifier. Such networks are commonly trained under a log loss (or cross-entropy) regime, giving a non-linear variant of multinomial logistic regression.

Since the function maps a vector and a specific index i to a real value, the derivative needs to take the index into account:

${\displaystyle {\frac {\partial }{\partial q_{k}}}\sigma ({\textbf {q}},i)=\cdots =\sigma ({\textbf {q}},i)(\delta _{ik}-\sigma ({\textbf {q}},k))}$

Here, the Kronecker delta is used for simplicity (cf. the derivative of a sigmoid function, being expressed via the function itself).

See Multinomial logit for a probability model which uses the softmax activation function.

## Reinforcement learning

In the field of reinforcement learning, a softmax function can be used to convert values into action probabilities. The function commonly used is:[3]

${\displaystyle P_{t}(a)={\frac {\exp(q_{t}(a)/\tau )}{\sum _{i=1}^{n}\exp(q_{t}(i)/\tau )}}{\text{,}}}$

where the action value ${\displaystyle q_{t}(a)}$ corresponds to the expected reward of following action a and ${\displaystyle \tau }$ is called a temperature parameter (in allusion to statistical mechanics). For high temperatures (${\displaystyle \tau \to \infty }$), all actions have nearly the same probability and the lower the temperature, the more expected rewards affect the probability. For a low temperature (${\displaystyle \tau \to 0^{+}}$), the probability of the action with the highest expected reward tends to 1.

## Softmax normalization

Sigmoidal or Softmax normalization is a way of reducing the influence of extreme values or outliers in the data without removing them from the dataset. It is useful given outlier data, which we wish to include in the dataset while still preserving the significance of data within a standard deviation of the mean. The data are nonlinearly transformed using one of the sigmoidal functions.

The logistic sigmoid function:[4]

${\displaystyle x_{i}'\equiv {\frac {1}{1+e^{-(x_{i}-\mu _{i})/\sigma _{i}}}}}$

The hyperbolic tangent function, tanh:[4]

${\displaystyle x_{i}'\equiv {\frac {1-e^{-(x_{i}-\mu _{i})/\sigma _{i}}}{1+e^{-(x_{i}-\mu _{i})/\sigma _{i}}}}}$

The sigmoid function limits the range of the normalized data to values between 0 and 1. The sigmoid function is almost linear near the mean and has smooth nonlinearity at both extremes, ensuring that all data points are within a limited range. This maintains the resolution of most values within a standard deviation of the mean.

The hyperbolic tangent function, tanh, limits the range of the normalized data to values between −1 and 1. The hyperbolic tangent function is almost linear near the mean, but has a slope of half that of the sigmoid function. Like sigmoid, it has smooth, monotonic nonlinearity at both extremes. Also, like the sigmoid function, it remains differentiable everywhere and the sign of the derivative (slope) is unaffected by the normalization. This ensures that optimization and numerical integration algorithms can continue to rely on the derivative to estimate changes to the output (normalized value) that will be produced by changes to the input in the region near any linearisation point.

## Relation with the Boltzmann distribution

The softmax function also happens to be the probability of an atom being found in a quantum state of energy ${\displaystyle \varepsilon _{i}}$ when the atom is part of an ensemble that has reached thermal equilibrium at temperature ${\displaystyle T}$. This is known as the Boltzmann distribution. The expected relative occupancy of each state is ${\displaystyle e^{-\varepsilon _{i}/k_{B}T}}$, and this is normalised so that the sum over energy levels sums to 1. In this analogy, the input to the softmax function is the negative energy of each quantum state divided by ${\displaystyle k_{B}T}$.

1. ^ Except if ${\displaystyle K=1}$, in which case the range is just 1.