# Softmax function

(Redirected from Softmax activation function)

In mathematics, in particular probability theory and related fields, the softmax function, or normalized exponential,[1]:198 is a generalization of the logistic function that "squashes" a K-dimensional vector $\mathbf{z}$ of arbitrary real values to a K-dimensional vector $\sigma(\mathbf{z})$ of real values in the range (0, 1). The function is given by

$\sigma(\mathbf{z})_j = \frac{e^{\mathbf{z}_j}}{\sum_{k=1}^K e^{\mathbf{z}_k}}$    for j=1,...,K.

Since the components of the vector $\sigma(\mathbf{z})$ sum to one and are all strictly between zero and one, they represent a categorical probability distribution. For this reason, the softmax function is used in various probabilistic multiclass classification methods including multinomial logistic regression,[1]:206–209 multiclass linear discriminant analysis, naive Bayes classifiers and artificial neural networks.[2] Specifically, in multinomial logistic regression and linear discriminant analysis, the input to the function is the result of K distinct linear functions, and the predicted probability for the j'th class given a sample vector x is:

$P(y=j|\mathbf{x}) = \frac{e^{\mathbf{x}^\mathsf{T}\mathbf{w}_j}}{\sum_{k=1}^K e^{\mathbf{x}^\mathsf{T}\mathbf{w}_k}}$

This can be seen as the composition of K linear functions $\mathbf{x} \mapsto \mathbf{x}^\mathsf{T}\mathbf{w}_1, \ldots, \mathbf{x} \mapsto \mathbf{x}^\mathsf{T}\mathbf{w}_K$ and the softmax function.

## Artificial neural networks

In neural network simulations, the softmax function is often implemented at the final layer of a network used for classification. Such networks are then trained under a log loss (or cross-entropy) regime, giving a non-linear variant of multinomial logistic regression.

Since the function maps a vector and a specific index i to a real value, the derivative needs to take the index into account:

$\frac{\partial}{\partial q_k}\sigma(\textbf{q}, i) = \dots = \sigma(\textbf{q}, i)(\delta_{ik} - \sigma(\textbf{q}, k))$

Here, the Kronecker delta is used for simplicity (cf. the derivative of a sigmoid function, being expressed via the function itself).

See Multinomial logit for a probability model which uses the softmax activation function.

## Reinforcement learning

In the field of reinforcement learning, a softmax function can be used to convert values into action probabilities. The function commonly used is:[3]

$P_t(a) = \frac{\exp(q_t(a)/\tau)}{\sum_{i=1}^n\exp(q_t(i)/\tau)} \text{,}$

where the action value $q_t(a)$ corresponds to the expected reward of following action a and $\tau$ is called a temperature parameter (in allusion to chemical kinetics). For high temperatures ($\tau\to \infty$), all actions have nearly the same probability and the lower the temperature, the more expected rewards affect the probability. For a low temperature ($\tau\to 0^+$), the probability of the action with the highest expected reward tends to 1.

## Smooth approximation of maximum

When parameterized by some constant, $\alpha > 0$, the following formulation becomes a smooth, differentiable approximation of the maximum function:

$\mathcal{S}_{\alpha}\left(\left\{x_i\right\}_{i=1}^{n}\right) = \frac{\sum_{i=1}^{n}x_i e^{\alpha x_i}}{\sum_{i=1}^{n}e^{\alpha x_i}}$

$\mathcal{S}_{\alpha}$ has the following properties:

1. $\mathcal{S}_{\alpha}\to \max$ as $\alpha\to\infty$
2. $\mathcal{S}_{0}$ is the average of its inputs
3. $\mathcal{S}_{\alpha}\to \min$ as $\alpha\to -\infty$

The gradient of $\mathcal{S}_{\alpha}$ is closely related to softmax and is given by:

$\nabla_{x_i}\mathcal{S}_{\alpha}\left(\left\{x_i\right\}_{i=1}^{n}\right) = \frac{e^{\alpha x_i}}{\sum_{j=1}^{n}e^{\alpha x_j}}\left[1 + \alpha\left(x_i - \mathcal{S}_{\alpha}\left(\left\{x_i\right\}_{i=1}^{n}\right)\right)\right] \text{,}$

which makes the softmax function useful for optimization techniques that use gradient descent.

## Softmax Normalization

Sigmoidal or Softmax normalization is a way of reducing the influence of extreme values or outliers in the data without removing them from the dataset. It is useful given outlier data, which we wish to include in the dataset while still preserving the significance of data within a standard deviation of the mean. The data are nonlinearly transformed using a sigmoidal function, either the logistic sigmoid function or the hyperbolic tangent function:[4]

$x_i' \equiv \frac{1}{1+e^{-(\frac{x_i - \mu_i}{\sigma_i})}}$

or

$x_i' \equiv \frac{1-e^{-(\frac{x_i - \mu_i}{\sigma_i})}}{1+e^{-(\frac{x_i - \mu_i}{\sigma_i})}}$

This puts the normalized data in the range of 0 to 1. The transformation is almost linear near the mean and has smooth nonlinearity at both extremes, ensuring that all data points are within a limited range. This maintains the resolution of most values within a standard deviation of the mean.

## References

1. ^ a b Bishop, Christopher M. (2006). Pattern Recognition and Machine Learning. Springer.
2. ^
3. ^ Sutton, R. S. and Barto A. G. Reinforcement Learning: An Introduction. The MIT Press, Cambridge, MA, 1998.Softmax Action Selection
4. ^ Artificial Neural Networks: An Introduction. 2005. pp. 16–17.