# Softmax function

(Redirected from Softmax activation function)

In mathematics, in particular probability theory and related fields, the softmax function is a generalization of the logistic function that maps a length-p vector of real values to a length-K vector of values, defined as:

$\sigma(\mathbf{z})_j = \frac{e^{\mathbf{z}_j}}{\sum_{k=1}^K e^{\mathbf{z}_k}}$

Since the vector $\sigma(\mathbf{z})$ sums to one and all its elements are strictly between zero and one, they represent a categorical probability distribution. For this reason, the softmax function is used in various probabilistic multiclass classification methods including multinomial logistic regression,[1] multiclass linear discriminant analysis, naive Bayes classifiers and neural networks.[2] Specifically, in multinomial LR and LDA, the input to the function is the result of K distinct linear functions, and the predicted probability for the j'th class given a sample vector x is:

$P(y=j|\mathbf{x}) = \frac{e^{\mathbf{x}^\mathsf{T}\mathbf{w}_j}}{\sum_{k=1}^K e^{\mathbf{x}^\mathsf{T}\mathbf{w}_k}}$

## Artificial neural networks

In neural network simulations, the softmax function is often implemented at the final layer of a network used for classification. Such networks are then trained under a log loss (or cross-entropy) regime, giving a non-linear variant of multinomial logistic regression.

Since the function maps a vector and a specific index i to a real value, the derivative needs to take the index into account:

$\frac{\partial}{\partial q_k}\sigma(\textbf{q}, i) = \dots = \sigma(\textbf{q}, i)(\delta_{ik} - \sigma(\textbf{q}, k))$

Here, the Kronecker delta is used for simplicity (cf. the derivative of a sigmoid function, being expressed via the function itself).

See Multinomial logit for a probability model which uses the softmax activation function.

## Reinforcement learning

In the field of reinforcement learning, a softmax function can be used to convert values into action probabilities. The function commonly used is:[3]

$P_t(a) = \frac{\exp(q_t(a)/\tau)}{\sum_{i=1}^n\exp(q_t(i)/\tau)} \text{,}$

where the action value $q_t(a)$ corresponds to the expected reward of following action a and $\tau$ is called a temperature parameter (in allusion to chemical kinetics). For high temperatures ($\tau\to \infty$), all actions have nearly the same probability and the lower the temperature, the more expected rewards affect the probability. For a low temperature ($\tau\to 0^+$), the probability of the action with the highest expected reward tends to 1.

## Smooth approximation of maximum

When parameterized by some constant, $\alpha > 0$, the following formulation becomes a smooth, differentiable approximation of the maximum function:

$\mathcal{S}_{\alpha}\left(\left\{x_i\right\}_{i=1}^{n}\right) = \frac{\sum_{i=1}^{n}x_i e^{\alpha x_i}}{\sum_{i=1}^{n}e^{\alpha x_i}}$

$\mathcal{S}_{\alpha}$ has the following properties:

1. $\mathcal{S}_{\alpha}\to \max$ as $\alpha\to\infty$
2. $\mathcal{S}_{0}$ is the average of its inputs
3. $\mathcal{S}_{\alpha}\to \min$ as $\alpha\to -\infty$

The gradient of softmax is given by:

$\nabla_{x_i}\mathcal{S}_{\alpha}\left(\left\{x_i\right\}_{i=1}^{n}\right) = \frac{e^{\alpha x_i}}{\sum_{j=1}^{n}e^{\alpha x_j}}\left[1 + \alpha\left(x_i - \mathcal{S}_{\alpha}\left(\left\{x_i\right\}_{i=1}^{n}\right)\right)\right] \text{,}$

which makes the softmax function useful for optimization techniques that use gradient descent.

## Softmax transformation

The softmax function is also used to standardize data which is positively skewed and includes many values around zero.[citation needed] It will take a variable such as revenue or age and transform the values to a scale from zero to one.[4] This type of data transformation is needed especially when the data spans many magnitudes.

## References

1. ^ Bishop, Christopher M. (2006). Pattern Recognition and Machine Learning. Springer. pp. 206–209.
2. ^
3. ^ Sutton, R. S. and Barto A. G. Reinforcement Learning: An Introduction. The MIT Press, Cambridge, MA, 1998.Softmax Action Selection
4. ^ Pyle (1999). Data Preparation for Data Mining. pp. 271–274, 355–359.