Hinge loss

The vertical axis represents the value of the Hinge loss (in blue) and zero-one loss (in green) for fixed t = 1, while the horizontal axis represents the value of the prediction y. The plot shows that the Hinge loss penalizes predictions y < 1, corresponding to the notion of a margin in a support vector machine.

In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs).[1]

For an intended output t = ±1 and a classifier score y, the hinge loss of the prediction y is defined as

${\displaystyle \ell (y)=\max(0,1-t\cdot y)}$

Note that ${\displaystyle y}$ should be the "raw" output of the classifier's decision function, not the predicted class label. For instance, in linear SVMs, ${\displaystyle y=\mathbf {w} \cdot \mathbf {x} +b}$, where ${\displaystyle (\mathbf {w} ,b)}$ are the parameters of the hyperplane and ${\displaystyle \mathbf {x} }$ is the input variable(s).

When t and y have the same sign (meaning y predicts the right class) and ${\displaystyle |y|\geq 1}$, the hinge loss ${\displaystyle \ell (y)=0}$. When they have opposite signs, ${\displaystyle \ell (y)}$ increases linearly with y, and similarly if ${\displaystyle |y|<1}$, even if it has the same sign (correct prediction, but not by enough margin).

Extensions

While binary SVMs are commonly extended to multiclass classification in a one-vs.-all or one-vs.-one fashion,[2] it is also possible to extend the hinge loss itself for such an end. Several different variations of multiclass hinge loss have been proposed.[3] For example, Crammer and Singer[4] defined it for a linear classifier as[5]

${\displaystyle \ell (y)=\max(0,1+\max _{y\neq t}\mathbf {w} _{y}\mathbf {x} -\mathbf {w} _{t}\mathbf {x} )}$

Where ${\displaystyle t}$ is the target label, ${\displaystyle \mathbf {w} _{t}}$ and ${\displaystyle \mathbf {w} _{y}}$ are the model parameters.

Weston and Watkins provided a similar definition, but with a sum rather than a max:[6][3]

${\displaystyle \ell (y)=\sum _{y\neq t}\max(0,1+\mathbf {w} _{y}\mathbf {x} -\mathbf {w} _{t}\mathbf {x} )}$

In structured prediction, the hinge loss can be further extended to structured output spaces. Structured SVMs with margin rescaling use the following variant, where w denotes the SVM's parameters, y the SVM's predictions, φ the joint feature function, and Δ the Hamming loss:

{\displaystyle {\begin{aligned}\ell (\mathbf {y} )&=\max(0,\Delta (\mathbf {y} ,\mathbf {t} )+\langle \mathbf {w} ,\phi (\mathbf {x} ,\mathbf {y} )\rangle -\langle \mathbf {w} ,\phi (\mathbf {x} ,\mathbf {t} )\rangle )\\&=\max(0,\max _{y\in {\mathcal {Y}}}\left(\Delta (\mathbf {y} ,\mathbf {t} )+\langle \mathbf {w} ,\phi (\mathbf {x} ,\mathbf {y} )\rangle \right)-\langle \mathbf {w} ,\phi (\mathbf {x} ,\mathbf {t} )\rangle )\end{aligned}}}

Optimization

The hinge loss is a convex function, so many of the usual convex optimizers used in machine learning can work with it. It is not differentiable, but has a subgradient with respect to model parameters w of a linear SVM with score function ${\displaystyle y=\mathbf {w} \cdot \mathbf {x} }$ that is given by

${\displaystyle {\frac {\partial \ell }{\partial w_{i}}}={\begin{cases}-t\cdot x_{i}&{\text{if }}t\cdot y<1\\0&{\text{otherwise}}\end{cases}}}$
Plot of three variants of the hinge loss as a function of z = ty: the "ordinary" variant (blue), its square (green), and the piece-wise smooth version by Rennie and Srebro (red). The y-axis is the l(y) hinge loss, and the x-axis is the parameter t

However, since the derivative of the hinge loss at ${\displaystyle ty=1}$ is undefined, smoothed versions may be preferred for optimization, such as Rennie and Srebro's[7]

${\displaystyle \ell (y)={\begin{cases}{\frac {1}{2}}-ty&{\text{if}}~~ty\leq 0,\\{\frac {1}{2}}(1-ty)^{2}&{\text{if}}~~0

${\displaystyle \ell _{\gamma }(y)={\begin{cases}{\frac {1}{2\gamma }}\max(0,1-ty)^{2}&{\text{if}}~~ty\geq 1-\gamma \\1-{\frac {\gamma }{2}}-ty&{\text{otherwise}}\end{cases}}}$

suggested by Zhang.[8] The modified Huber loss ${\displaystyle L}$ is a special case of this loss function with ${\displaystyle \gamma =2}$, specifically ${\displaystyle L(t,y)=4\ell _{2}(y)}$.