# Huber loss

In statistics, the Huber loss is a loss function used in robust regression, that is less sensitive to outliers in data than the squared error loss. A variant for classification is also sometimes used.

## Definition Huber loss (green, $\delta =1$ ) and squared error loss (blue) as a function of $y-f(x)$ The Huber loss function describes the penalty incurred by an estimation procedure f. Huber (1964) defines the loss function piecewise by

$L_{\delta }(a)={\begin{cases}{\frac {1}{2}}{a^{2}}&{\text{for }}|a|\leq \delta ,\\\delta \cdot \left(|a|-{\frac {1}{2}}\delta \right),&{\text{otherwise.}}\end{cases}}$ This function is quadratic for small values of a, and linear for large values, with equal values and slopes of the different sections at the two points where $|a|=\delta$ . The variable a often refers to the residuals, that is to the difference between the observed and predicted values $a=y-f(x)$ , so the former can be expanded to

$L_{\delta }(y,f(x))={\begin{cases}{\frac {1}{2}}(y-f(x))^{2}&{\text{for }}|y-f(x)|\leq \delta ,\\\delta \ \cdot \left(|y-f(x)|-{\frac {1}{2}}\delta \right),&{\text{otherwise.}}\end{cases}}$ The Huber loss is the convolution of the absolute value function with the rectangular function, scaled and translated. Thus it "smoothens out" the former's corner at the origin.

## Motivation

Two very commonly used loss functions are the squared loss, $L(a)=a^{2}$ , and the absolute loss, $L(a)=|a|$ . The squared loss function results in an arithmetic mean-unbiased estimator, and the absolute-value loss function results in a median-unbiased estimator (in the one-dimensional case, and a geometric median-unbiased estimator for the multi-dimensional case). The squared loss has the disadvantage that it has the tendency to be dominated by outliers—when summing over a set of $a$ 's (as in ${\textstyle \sum _{i=1}^{n}L(a_{i})}$ ), the sample mean is influenced too much by a few particularly large $a$ -values when the distribution is heavy tailed: in terms of estimation theory, the asymptotic relative efficiency of the mean is poor for heavy-tailed distributions.

As defined above, the Huber loss function is strongly convex in a uniform neighborhood of its minimum $a=0$ ; at the boundary of this uniform neighborhood, the Huber loss function has a differentiable extension to an affine function at points $a=-\delta$ and $a=\delta$ . These properties allow it to combine much of the sensitivity of the mean-unbiased, minimum-variance estimator of the mean (using the quadratic loss function) and the robustness of the median-unbiased estimator (using the absolute value function).

## Pseudo-Huber loss function

The Pseudo-Huber loss function can be used as a smooth approximation of the Huber loss function. It combines the best properties of L2 squared loss and L1 absolute loss by being strongly convex when close to the target/minimum and less steep for extreme values. The scale at which the Pseudo-Huber loss function transitions from L2 loss for values close to the minimum to L1 loss for extreme values and the steepness at extreme values can be controlled by the $\delta$ value. The Pseudo-Huber loss function ensures that derivatives are continuous for all degrees. It is defined as

$L_{\delta }(a)=\delta ^{2}\left({\sqrt {1+(a/\delta )^{2}}}-1\right).$ As such, this function approximates $a^{2}/2$ for small values of $a$ , and approximates a straight line with slope $\delta$ for large values of $a$ .

While the above is the most common form, other smooth approximations of the Huber loss function also exist.

## Variant for classification

For classification purposes, a variant of the Huber loss called modified Huber is sometimes used. Given a prediction $f(x)$ (a real-valued classifier score) and a true binary class label $y\in \{+1,-1\}$ , the modified Huber loss is defined as

$L(y,f(x))={\begin{cases}\max(0,1-y\,f(x))^{2}&{\textrm {for}}\,\,y\,f(x)\geq -1,\\-4y\,f(x)&{\textrm {otherwise.}}\end{cases}}$ The term $\max(0,1-y\,f(x))$ is the hinge loss used by support vector machines; the quadratically smoothed hinge loss is a generalization of $L$ .

## Applications

The Huber loss function is used in robust statistics, M-estimation and additive modelling.