# LogSumExp

The LogSumExp (LSE) (also called RealSoftMax[1] or multivariable softplus) function is a smooth maximum – a smooth approximation to the maximum function, mainly used by machine learning algorithms.[2] It is defined as the logarithm of the sum of the exponentials of the arguments:

${\displaystyle \mathrm {LSE} (x_{1},\dots ,x_{n})=\log \left(\exp(x_{1})+\cdots +\exp(x_{n})\right).}$

## Properties

The LogSumExp function domain is ${\displaystyle \mathbb {R} ^{n}}$, the real coordinate space, and its codomain is ${\displaystyle \mathbb {R} }$, the real line. It is an approximation to the maximum ${\displaystyle \max _{i}x_{i}}$ with the following bounds

${\displaystyle \max {\{x_{1},\dots ,x_{n}\}}\leq \mathrm {LSE} (x_{1},\dots ,x_{n})\leq \max {\{x_{1},\dots ,x_{n}\}}+\log(n).}$

The first inequality is strict unless ${\displaystyle n=1}$. The second inequality is strict unless all arguments are equal. (Proof: Let ${\displaystyle m=\max _{i}x_{i}}$. Then ${\displaystyle \exp(m)\leq \sum _{i=1}^{n}\exp(x_{i})\leq n\exp(m)}$. Applying the logarithm to the inequality gives the result.)

In addition, we can scale the function to make the bounds tighter. Consider the function ${\displaystyle {\frac {1}{t}}\mathrm {LSE} (tx)}$. Then

${\displaystyle \max {\{x_{1},\dots ,x_{n}\}}<{\frac {1}{t}}\mathrm {LSE} (tx)\leq \max {\{x_{1},\dots ,x_{n}\}}+{\frac {\log(n)}{t}}.}$

(Proof: Replace each ${\displaystyle x_{i}}$ with ${\displaystyle tx_{i}}$ for some ${\displaystyle t>0}$ in the inequalities above, to give

${\displaystyle \max {\{tx_{1},\dots ,tx_{n}\}}<\mathrm {LSE} (tx_{1},\dots ,tx_{n})\leq \max {\{tx_{1},\dots ,tx_{n}\}}+\log(n).}$

and, since ${\displaystyle t>0}$

${\displaystyle t\max {\{x_{1},\dots ,x_{n}\}}<\mathrm {LSE} (tx_{1},\dots ,tx_{n})\leq t\max {\{x_{1},\dots ,x_{n}\}}+\log(n).}$

finally, dividing by ${\displaystyle t}$ gives the result.)

Also, if we multiply by a negative number instead, we of course find a comparison to the ${\displaystyle \min }$ function:

${\displaystyle \min {\{x_{1},\dots ,x_{n}\}}-{\frac {\log(n)}{t}}\leq {\frac {1}{-t}}\mathrm {LSE} (-tx)<\min {\{x_{1},\dots ,x_{n}\}}.}$

The LogSumExp function is convex, and is strictly increasing everywhere in its domain[3] (but not strictly convex everywhere[4]).

Writing ${\displaystyle \mathbf {x} =(x_{1},\dots ,x_{n}),}$ the partial derivatives are:

${\displaystyle {\frac {\partial }{\partial x_{i}}}{\mathrm {LSE} (\mathbf {x} )}={\frac {\exp x_{i}}{\sum _{j}\exp {x_{j}}}},}$

which means the gradient of LogSumExp is the softmax function.

The convex conjugate of LogSumExp is the negative entropy.

## log-sum-exp trick for log-domain calculations

The LSE function is often encountered when the usual arithmetic computations are performed on a logarithmic scale, as in log probability.[5]

Similar to multiplication operations in linear-scale becoming simple additions in log-scale, an addition operation in linear-scale becomes the LSE in log-scale:

${\displaystyle \mathrm {LSE} (\log(x_{1}),...,\log(x_{n}))=\log(x_{1}+\dots +x_{n})}$

A common purpose of using log-domain computations is to increase accuracy and avoid underflow and overflow problems when very small or very large numbers are represented directly (i.e. in a linear domain) using limited-precision floating point numbers.[6]

Unfortunately, the use of LSE directly in this case can again cause overflow/underflow problems. Therefore, the following equivalent must be used instead (especially when the accuracy of the above 'max' approximation is not sufficient). Therefore, many math libraries such as IT++ provide a default routine of LSE and use this formula internally.

${\displaystyle \mathrm {LSE} (x_{1},\dots ,x_{n})=x^{*}+\log \left(\exp(x_{1}-x^{*})+\cdots +\exp(x_{n}-x^{*})\right)}$

where ${\displaystyle x^{*}=\max {\{x_{1},\dots ,x_{n}\}}}$

## A strictly convex log-sum-exp type function

LSE is convex but not strictly convex. We can define a strictly convex log-sum-exp type function[7] by adding an extra argument set to zero:

${\displaystyle \mathrm {LSE} _{0}^{+}(x_{1},...,x_{n})=\mathrm {LSE} (0,x_{1},...,x_{n})}$

This function is a proper Bregman generator (strictly convex and differentiable). It is encountered in machine learning, for example, as the cumulant of the multinomial/binomial family.

In tropical analysis, this is the sum in the log semiring.

6. ^ "Practical issues: Numeric stability". CS231n Convolutional Neural Networks for Visual Recognition.{{cite web}}: CS1 maint: url-status (link)
7. ^ Nielsen, Frank; Hadjeres, Gaetan (2018). "Monte Carlo Information Geometry: The dually flat case". arXiv:1803.07225. Bibcode:2018arXiv180307225N. {{cite journal}}: Cite journal requires |journal= (help)