In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs). For an intended output t = ±1 and a classifier score y, the hinge loss of the prediction y is defined as
Note that y should be the "raw" output of the classifier's decision function, not the predicted class label. E.g., in linear SVMs, .
It can be seen that when t and y have the same sign (meaning y predicts the right class) and , the hinge loss , but when they have opposite sign, increases linearly with y (one-sided error).
While SVMs are commonly extended to multiclass classification in a one-vs.-all or one-vs.-one fashion, there exists a "true" multiclass version of the hinge loss due to Crammer and Singer, defined for a linear classifier as
In structured prediction, the hinge loss can be further extended to structured output spaces. Structured SVMs use the following variant, where y denotes the SVM's parameters, φ the joint feature function, and Δ the Hamming loss:
The hinge loss is a convex function, so many of the usual convex optimizers used in machine learning can work with it. It is not differentiable, but has a subgradient with respect to model parameters w of a linear SVM with score function that is given by
However, since the derivative of the hinge loss at is non-deterministic, smoothed versions may be preferred for optimization, such as the quadratically smoothed
- Rosasco, L.; De Vito, E. D.; Caponnetto, A.; Piana, M.; Verri, A. (2004). "Are Loss Functions All the Same?". Neural Computation 16 (5): 1063–1076. doi:10.1162/089976604773135104. PMID 15070510.
- Duan, K. B.; Keerthi, S. S. (2005). "Which Is the Best Multiclass SVM Method? An Empirical Study". Multiple Classifier Systems. LNCS 3541. pp. 278–285. doi:10.1007/11494683_28. ISBN 978-3-540-26306-7.
- Crammer, Koby; Singer, Yoram (2001). "On the algorithmic implementation of multiclass kernel-based vector machines". J. Machine Learning Research 2: 265–292.
- Moore, Robert C.; DeNero, John (2011). "Proc. Symp. on Machine Learning in Speech and Language Processing".
- Zhang, Tong (2004). "Solving large scale linear prediction problems using stochastic gradient descent algorithms". ICML.