# Softmax activation function

The softmax activation function is a neural transfer function. In neural networks, transfer functions calculate a layer's output from its net input. It is a biologically plausible approximation to the maximum operation.[1] It is used to simulate an invariance operation of complex cells [2] where it is defined as

$y=g \left( \frac{\sum_{j=1}^n x_j^{q+1}} {k+\left( \sum_{j=1}^n x_j^q \right)} \right) \text{,}$

where g is a sigmoid function, x represents the value of input nodes, k a small constant to avoid division by zero, and the exponent q a parameter to control the non-linearity.

## Artificial neural networks

In neural network simulations, the term softmax activation function refers to a similar function defined by[3]

$\sigma \colon \mathbb{R}^n\times\mathbb{N} \to \mathbb{R}$
$\sigma(\textbf{q}, i) = \frac{\exp(q_i)}{\sum_{j=1}^n\exp(q_j)} \text{,}$

where the vector q is the net input to a softmax node, and n is the number of nodes in the softmax layer. It ensures all of the output values are between 0 and 1, and that their sum is 1. This is a generalization of the logistic function to multiple variables.

Since the function maps a vector and a specific index i to a real value, the derivative needs to take the index into account:

$\frac{\partial}{\partial q_k}\sigma(\textbf{q}, i) = \dots = \sigma(\textbf{q}, i)(\delta_{ik} - \sigma(\textbf{q}, k))$

Here, the Kronecker delta is used for simplicity (cf. the derivative of a sigmoid function, being expressed via the function itself).

See Multinomial logit for a probability model which uses the softmax activation function.

## Reinforcement learning

In the field of reinforcement learning, a softmax function can be used to convert values into action probabilities. The function commonly used is:[4]

$P_t(a) = \frac{\exp(q_t(a)/\tau)}{\sum_{i=1}^n\exp(q_t(i)/\tau)} \text{,}$

where the action value $q_t(a)$ corresponds to the expected reward of following action a and $\tau$ is called a temperature parameter (in allusion to chemical kinetics). For high temperatures ($\tau\to \infty$), all actions have nearly the same probability and the lower the temperature, the more expected rewards affect the probability. For a low temperature ($\tau\to 0^+$), the probability of the action with the highest expected reward tends to 1.

## Smooth approximation of maximum

When parameterized by some constant, $\alpha > 0$, the following formulation becomes a smooth, differentiable approximation of the maximum function:

$\mathcal{S}_{\alpha}\left(\left\{x_i\right\}_{i=1}^{n}\right) = \frac{\sum_{i=1}^{n}x_i e^{\alpha x_i}}{\sum_{i=1}^{n}e^{\alpha x_i}}$

$\mathcal{S}_{\alpha}$ has the following properties:

1. $\mathcal{S}_{\alpha}\to \max$ as $\alpha\to\infty$
2. $\mathcal{S}_{0}$ is the average of its inputs
3. $\mathcal{S}_{\alpha}\to \min$ as $\alpha\to -\infty$

The gradient of softmax is given by:

$\nabla_{x_i}\mathcal{S}_{\alpha}\left(\left\{x_i\right\}_{i=1}^{n}\right) = \frac{e^{\alpha x_i}}{\sum_{j=1}^{n}e^{\alpha x_j}}\left[1 + \alpha\left(x_i - \mathcal{S}_{\alpha}\left(\left\{x_i\right\}_{i=1}^{n}\right)\right)\right] \text{,}$

which makes the softmax function useful for optimization techniques that use gradient descent.

## Softmax transformation

The softmax function is also used to standardize data which is positively skewed and includes many values around zero.[citation needed] It will take a variable such as revenue or age and transform the values to a scale from zero to one.[5] This type of data transformation is needed especially when the data spans many magnitudes.

For example, revenues for customers could span anywhere from 0 to 300.000. Let's say we have a range of revenue numbers between 3 and 300.000. If these numbers are expressed in powers of 10, then 3 becomes $3 \times 10^0$ and 300.000 becomes $3 \times 10^5$. The number 10 when raised to the power 0 becomes 1, so these two numbers expressed as powers of 10 span 5 orders of magnitude. Range scaling is a typical reason to use a function such as softmax.[citation needed]

## References

1. ^ Cadieu C, Kouh M, Pasupathy A, Conner CE, Riesenhuber M, and Poggio T. A Model of V4 Shape Selectivity and Invariance. J Neurophysiol 98: 1733-1750, 2007.
2. ^ Serre T, Kouh M, Cadieu C, Knoblich U, Kreiman G, and Poggio T. A theory of object recognition: computations and circuits in the feedforward path of the ventral stream in primate visual cortex. CBCL Paper 259/AI Memo 2005-036. Cambridge, MA: MIT, 2005.
3. ^
4. ^ Sutton, R. S. and Barto A. G. Reinforcement Learning: An Introduction. The MIT Press, Cambridge, MA, 1998.Softmax Action Selection
5. ^ Pyle (1999). Data Preparation for Data Mining. pp. 271–274, 355–359.