# Hinge loss

In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs).[1] For an intended output t = ±1 and a classifier score y, the hinge loss of the prediction y is defined as

$\ell(y) = \max(0, 1-t \cdot y)$

Note that y should be the "raw" output of the SVM's decision function, not the predicted class label. E.g., in linear SVMs, $y = \mathbf{w} \cdot \mathbf{x} + b$.

It can be seen that when $t$ and $y$ have the same sign (meaning $y$ predicts the right class) and $y \ge 1$, $\ell(y) = 0$, but when they have opposite sign, $\ell(y)$ increases linearly with $y$ (one-sided error).

## Extensions

While SVMs are commonly extended to multiclass classification in a one-vs.-all or one-vs.-one fashion,[2] there exists a "true" multiclass version of the hinge loss due to Crammer and Singer,[3] defined for a linear classifier as[4]

$\ell(y) = \max(0, 1 + \max_{y \ne t} \mathbf{w}_y \mathbf{x} - \mathbf{w}_t \mathbf{x})$

In structured prediction, the hinge loss can be further extended to structured output spaces. Structured SVMs use the following variant, where w denotes the SVM's parameters, φ the joint feature function, and Δ the Hamming loss:[5]

\begin{align} \ell(\mathbf{y}) & = \Delta(\mathbf{y}, \mathbf{t}) + \langle \mathbf{w}, \phi(\mathbf{x}, \mathbf{y}) \rangle - \langle \mathbf{w}, \phi(\mathbf{x}, \mathbf{t}) \rangle \\ & = \max_{y \in \mathcal{Y}} \left( \Delta(\mathbf{y}, \mathbf{t} + \langle \mathbf{w}, \phi(\mathbf{x}, \mathbf{y}) \rangle) \right) - \langle \mathbf{w}, \phi(\mathbf{x}, \mathbf{t}) \rangle \end{align}

## Optimization

The hinge loss is a convex function, so many of the usual convex optimizers used in machine learning can work with it. It is not differentiable, but has a subgradient with respect to model parameters $\mathbf{w}$ of a linear SVM with score function $y = \mathbf{w} \cdot \mathbf{x}$ that is given by

$\frac{\partial\ell}{\partial w_i} = \begin{cases} -t \cdot x_i & \text{if } t \cdot y < 1 \\ 0 & \text{otherwise} \end{cases}$

## References

1. ^ Rosasco, L.; De Vito, E. D.; Caponnetto, A.; Piana, M.; Verri, A. (2004). "Are Loss Functions All the Same?". Neural Computation 16 (5): 1063–1076. doi:10.1162/089976604773135104. PMID 15070510. edit
2. ^ Duan, K. B.; Keerthi, S. S. (2005). "Which Is the Best Multiclass SVM Method? An Empirical Study". Multiple Classifier Systems. Lecture Notes in Computer Science 3541. p. 278. doi:10.1007/11494683_28. ISBN 978-3-540-26306-7. edit
3. ^ Crammer, Koby; and Singer, Yoram (2001). "On the algorithmic implementation of multiclass kernel-based vector machines". J. Machine Learning Research 2: 265–292.
4. ^ Robert C. Moore and John DeNero (2011). "L1 and L2 regularization for multiclass hinge loss models". Proc. Symp. on Machine Learning in Speech and Language Processing.
5. ^ Alex Spengler, Antoine Bordes and Patrick Gallinari (2010). "A comparison of discriminative classifiers for web news content extraction". RIAO.