User:ALaughingHorse/sandbox

From Wikipedia, the free encyclopedia

Cite error: There are <ref> tags on this page without content in them (see the help page).

Diagram of k-fold cross-validation with k=4.

Cross-validation, sometimes called rotation estimation,[1][2][3] is a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. In a prediction problem, the data used in the model will be split into a training set and a testing set (or the validation set). The training set is used to build the model. Once a model is built, the explanatory variables (the design matrix) of the test set are put into the model and the output is treated as "predictions" that will then be compared with the response variables (or labels) of the test set to see how similar they are. The goal of cross validation is to provide a measurement of accuracy for a model while limiting problems like overfitting. It also gives an insight into how the model will generalize to an independent dataset (i.e., an unknown dataset, for instance from a real problem), etc.

Measures of fit[edit]

The goal of cross-validation is to estimate the expected level of fit of a model to a dataset that is independent of the data that were used to train the model. It can be used to estimate any quantitative measure of fit that is appropriate for the data and model. For example, for binary classification problems, each case in the validation set is either predicted correctly or incorrectly. In this situation, the misclassification error rate can be used to summarize the fit, although other measures like positive predictive value could also be used. However, in this case, the interpretation of the misclassification error rate and the positive predictive value are highly related to the proportions of each class in the response variable. For example, a model that is validated to have a 5% misclassification rate might be considered decent when predicting the results of fair coin tosses (where we expect 50% of heads and 50% of tails). Yet if a model has a 5% misclassification rate while predicting whether a patient has some rare disease where we expect 95% of the response values to have the same value, then it's hard to draw any conclusions regarding the performance of the model based only on the measures of fit. When the value being predicted is continuously distributed, the mean squared error, root mean squared error or median absolute deviation could be used to summarize the errors.

Cross validation for time-series models[edit]

In time-series analysis, observations in the data are often correlated. Therefore, the order of data matters. In this case, cross-validation might be problematic as the data is often split randomly into training and testing sets while assuming independence among different observations. One possible way to resolve this problem for time-series data is to define training and test sets based on time periods. For example, if the data is recorded throughout a month, the training set could be observations from the first 3 weeks and the test set could be observations of the last week. A more appropriate approach might be to use forward chaining.

  1. ^ Geisser, Seymour (1993). Predictive Inference. New York, NY: Chapman and Hall. ISBN 0-412-03471-9.
  2. ^ Kohavi, Ron (1995). "A study of cross-validation and bootstrap for accuracy estimation and model selection". Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence. 2 (12). San Mateo, CA: Morgan Kaufmann: 1137–1143. CiteSeerX 10.1.1.48.529.
  3. ^ Devijver, Pierre A.; Kittler, Josef (1982). Pattern Recognition: A Statistical Approach. London, GB: Prentice-Hall.