In machine learning, the bias–variance dilemma or bias–variance tradeoff is the problem of simultaneously minimizing the bias (how accurate a model is across different training sets) and variance of the model error (how sensitive the model is to small changes in training set). This tradeoff applies to all forms of supervised learning: classification, function fitting, and structured output learning.
The bias–variance tradeoff is a central problem in supervised learning. Ideally, one wants to choose a model that at the same time captures the regularities in its training data, but also generalizes well to unseen data. Models with high bias are intuitively simple models: they impose restrictions on the kind of regularities that can be learned (examples include linear classifiers). The problem with these models is that they underfit, i.e. not learn the relationship between predicted (target) variables and features. Models with high variance are those that can learn many kinds of complex regularities—but that includes the possibility to learn noise in the training data, i.e. overfitting. To achieve good performance on data outside the training set, a tradeoff must be made.
This trade-off is related to issues of overfitting and underfitting. Models that reliably have small deviations from the training set are typically bad predictors for non-training set inputs and will change significantly if the set changes (making them sensitive to outliers and noise). Meanwhile models that systematically deviate can resist noise and generalise well, although too strong deviation decreases performance.
Typically the goal is to find an optimal trade-off between bias and variance. A common model selection criterion is that the decrease of bias with increasing model complexity becomes equal to the increase of variance. There are however other possible considerations, such as error losses or complexity costs, that may lead to other trade-offs. The choice of model may also introduce biases that correspond to useful prior information, for example that the output must be within a given interval.
Suppose we have data derived from the true model , where is a random variable with .
Given the data, we train a model to approximate . For brevity and clarity will be written as below. Similarly .
The mean-squared error is .
The value of interest is the expectation of the MSE across different realisations of data:
This formula can be split apart:
Now, since f is deterministic and , , and . Hence the last term above sums to zero:
The second term can be decomposed in the same way using the same trick:
The expected MSE is hence composed of an irreducible error due to the noise (the first term), a bias due to the model systematically deviating from f (the second term), and a variance due to the model not perfectly predicting the particular realization of the data (the third term).
Cross-validation is widely used to check model error by testing on data not part of the training set. Multiple rounds with randomly selected test sets are averaged together to reduce variability of the cross-validation; high variability of the model will produce high average errors on the test set.
Dimensionality reduction and feature selection can decrease variance by simplifying models. Similarly, a larger training set tends to decrease variance. Adding features (predictors) tends to decrease bias, at the expense of introducing additional variance. Learning algorithms typically have some tunable parameters that control bias and variance, e.g.:
- (Generalized) linear models can be regularized to increase their bias.
- In neural nets, deeper models (with more layers) will have stronger variance. Like in GLMs, regularization is typically applied.
- In k-nearest neighbor models, a high value of k leads to high variance.
- In Instance-based learning, regularization can be achived varying the mixture of prototypes and exemplars.
- In decision trees, the depth of the tree determines the variance. Decision trees are commonly pruned to control variance.:307
One way of resolving the trade-off is to use mixture models and ensemble learning. For example, boosting combines many "weak" (high bias) models in an ensemble that has greater variance than the individual models, while bagging combines "strong" learners in a way that reduces their variance.
- S. Geman; E. Bienenstock; R. Doursat (1992). "Neural networks and the bias/variance dilemma". Neural Computation 4: 1–58.
- Bias–Value decomposition, In Encyclopedia of Machine Learning. Eds. Claude Sammut, Geoffrey I. Webb. Springer 2011. pp. 100-101
- Sethu Vijayakumar. The Bias–Variance Tradeoff. Lecture notes, University Edinburgh 2007. http://www.inf.ed.ac.uk/teaching/courses/mlsc/Notes/Lecture4/BiasVariance.pdf
- Gagliardi, F. (2011) “Instance-based classifiers applied to medical databases: diagnosis and knowledge extraction”. Artificial Intelligence in Medicine. Volume 52, Issue 3 , Pages 123-139. http://dx.doi.org/10.1016/j.artmed.2011.04.002
- Gareth James; Daniela Witten; Trevor Hastie; Robert Tibshirani (2013). An Introduction to Statistical Learning. Springer.
- Jo-Anne Ting, Sethu Vijaykumar, Stefan Schaal, Locally Weighted Regression for Control. In Encyclopedia of Machine Learning. Eds. Claude Sammut, Geoffrey I. Webb. Springer 2011. p. 615
- Scott Fortmann-Roe. Understanding the Bias–Variance Tradeoff. 2012. http://scott.fortmann-roe.com/docs/BiasVariance.html