Softmax function

From Wikipedia, the free encyclopedia
  (Redirected from Softmax activation function)
Jump to: navigation, search

In mathematics, in particular probability theory and related fields, the softmax function is a generalization of the logistic function that maps a length-p vector of real values to a length-K vector of values, defined as:

\sigma(\mathbf{z})_j = \frac{e^{\mathbf{z}_j}}{\sum_{k=1}^K e^{\mathbf{z}_k}}

Since the vector \sigma(\mathbf{z}) sums to one and all its elements are strictly between zero and one, they represent a categorical probability distribution. For this reason, the softmax function is used in various probabilistic multiclass classification methods including multinomial logistic regression,[1] multiclass linear discriminant analysis, naive Bayes classifiers and neural networks.[2] Specifically, in multinomial LR and LDA, the input to the function is the result of K distinct linear functions, and the predicted probability for the j'th class given a sample vector x is:

P(y=j|\mathbf{x}) = \frac{e^{\mathbf{x}^\mathsf{T}\mathbf{w}_j}}{\sum_{k=1}^K e^{\mathbf{x}^\mathsf{T}\mathbf{w}_k}}

Artificial neural networks[edit]

In neural network simulations, the softmax function is often implemented at the final layer of a network used for classification. Such networks are then trained under a log loss (or cross-entropy) regime, giving a non-linear variant of multinomial logistic regression.

Since the function maps a vector and a specific index i to a real value, the derivative needs to take the index into account:

 \frac{\partial}{\partial q_k}\sigma(\textbf{q}, i) = \dots =  \sigma(\textbf{q}, i)(\delta_{ik} - \sigma(\textbf{q}, k))

Here, the Kronecker delta is used for simplicity (cf. the derivative of a sigmoid function, being expressed via the function itself).

See Multinomial logit for a probability model which uses the softmax activation function.

Reinforcement learning[edit]

In the field of reinforcement learning, a softmax function can be used to convert values into action probabilities. The function commonly used is:[3]

P_t(a) = \frac{\exp(q_t(a)/\tau)}{\sum_{i=1}^n\exp(q_t(i)/\tau)} \text{,}

where the action value q_t(a) corresponds to the expected reward of following action a and \tau is called a temperature parameter (in allusion to chemical kinetics). For high temperatures (\tau\to \infty), all actions have nearly the same probability and the lower the temperature, the more expected rewards affect the probability. For a low temperature (\tau\to 0^+), the probability of the action with the highest expected reward tends to 1.

Smooth approximation of maximum[edit]

When parameterized by some constant, \alpha > 0, the following formulation becomes a smooth, differentiable approximation of the maximum function:

\mathcal{S}_{\alpha}\left(\left\{x_i\right\}_{i=1}^{n}\right) = \frac{\sum_{i=1}^{n}x_i e^{\alpha x_i}}{\sum_{i=1}^{n}e^{\alpha x_i}}

\mathcal{S}_{\alpha} has the following properties:

  1. \mathcal{S}_{\alpha}\to \max as \alpha\to\infty
  2. \mathcal{S}_{0} is the average of its inputs
  3. \mathcal{S}_{\alpha}\to \min as \alpha\to -\infty

The gradient of softmax is given by:

\nabla_{x_i}\mathcal{S}_{\alpha}\left(\left\{x_i\right\}_{i=1}^{n}\right) = \frac{e^{\alpha x_i}}{\sum_{j=1}^{n}e^{\alpha x_j}}\left[1 + \alpha\left(x_i - \mathcal{S}_{\alpha}\left(\left\{x_i\right\}_{i=1}^{n}\right)\right)\right] \text{,}

which makes the softmax function useful for optimization techniques that use gradient descent.

Softmax transformation[edit]

The softmax function is also used to standardize data which is positively skewed and includes many values around zero.[citation needed] It will take a variable such as revenue or age and transform the values to a scale from zero to one.[4] This type of data transformation is needed especially when the data spans many magnitudes.


  1. ^ Bishop, Christopher M. (2006). Pattern Recognition and Machine Learning. Springer. pp. 206–209. 
  2. ^ ai-faq What is a softmax activation function?
  3. ^ Sutton, R. S. and Barto A. G. Reinforcement Learning: An Introduction. The MIT Press, Cambridge, MA, 1998.Softmax Action Selection
  4. ^ Pyle (1999). Data Preparation for Data Mining. pp. 271–274, 355–359. 

See also[edit]