Softmax activation function
The softmax activation function is a neural transfer function. In neural networks, transfer functions calculate a layer's output from its net input. It is a biologically plausible approximation to the maximum operation. It is used to simulate an invariance operation of complex cells  where it is defined as
where g is a sigmoid function, x represents the value of input nodes, k a small constant to avoid division by zero, and the exponent q a parameter to control the non-linearity.
Artificial neural networks
In neural network simulations, the term softmax activation function refers to a similar function defined by
where the vector q is the net input to a softmax node, and n is the number of nodes in the softmax layer. It ensures all of the output values are between 0 and 1, and that their sum is 1. This is a generalization of the logistic function to multiple variables.
Since the function maps a vector and a specific index i to a real value, the derivative needs to take the index into account:
See Multinomial logit for a probability model which uses the softmax activation function.
where the action value corresponds to the expected reward of following action a and is called a temperature parameter (in allusion to chemical kinetics). For high temperatures (), all actions have nearly the same probability and the lower the temperature, the more expected rewards affect the probability. For a low temperature (), the probability of the action with the highest expected reward tends to 1.
Smooth approximation of maximum
When parameterized by some constant, , the following formulation becomes a smooth, differentiable approximation of the maximum function:
has the following properties:
- is the average of its inputs
The gradient of softmax is given by:
which makes the softmax function useful for optimization techniques that use gradient descent.
||This section possibly contains original research. (June 2013)|
The softmax function is also used to standardize data which is positively skewed and includes many values around zero. It will take a variable such as revenue or age and transform the values to a scale from zero to one. This type of data transformation is needed especially when the data spans many magnitudes.
For example, revenues for customers could span anywhere from 0 to 300.000. Let's say we have a range of revenue numbers between 3 and 300.000. If these numbers are expressed in powers of 10, then 3 becomes and 300.000 becomes . The number 10 when raised to the power 0 becomes 1, so these two numbers expressed as powers of 10 span 5 orders of magnitude. Range scaling is a typical reason to use a function such as softmax.
- Cadieu C, Kouh M, Pasupathy A, Conner CE, Riesenhuber M, and Poggio T. A Model of V4 Shape Selectivity and Invariance. J Neurophysiol 98: 1733-1750, 2007.
- Serre T, Kouh M, Cadieu C, Knoblich U, Kreiman G, and Poggio T. A theory of object recognition: computations and circuits in the feedforward path of the ventral stream in primate visual cortex. CBCL Paper 259/AI Memo 2005-036. Cambridge, MA: MIT, 2005.
- ai-faq What is a softmax activation function?
- Sutton, R. S. and Barto A. G. Reinforcement Learning: An Introduction. The MIT Press, Cambridge, MA, 1998.Softmax Action Selection
- Pyle (1999). Data Preparation for Data Mining. pp. 271–274, 355–359.