In (supervised) machine learning, specifically when learning from data, there are situations when the data values cannot be modeled. This may arise if there are random fluctuations or measurement errors in the data which are not modeled, and can be appropriately called stochastic noise; or, when the phenomenon being modeled (or learned) is too complex, and so the data contains this added complexity that is not modeled. This added complexity in the data has been called deterministic noise. Though these two types of noise arise from different causes, their adverse effect on learning is similar. The overfitting occurs because the model attempts to fit the (stochastic or deterministic) noise (that part of the data that it cannot model) at the expense of fitting that part of the data which it can model. When either type of noise is present, it is usually advisable to regularize the learning algorithm to prevent overfitting the model to the data and getting inferior performance. Regularization typically results in a lower variance model at the expense of bias.
One may also try to alleviate the effects of noise by detecting and removing the noisy training examples prior to training the supervised learning algorithm. There are several algorithms that identify noisy training examples, and removing the suspected noisy training examples prior to training will usually improve the performance.
- Yaser S.Abu-Mostafa, Malik Magdon-Ismail, Hsuan-Tien Lin (March 2012). Learning From Data. amlbook.
- C.E. Brodely and M.A. Friedl (1999). Identifying and Eliminating Mislabeled Training Instances, Journal of Artificial Intelligence Research 11, 131-167. (http://jair.org/media/606/live-606-1803-jair.pdf)
- M.R. Smith and T. Martinez (2011). "Improving Classification Accuracy by Identifying and Removing Instances that Should Be Misclassified". Proceedings of International Joint Conference on Neural Networks (IJCNN 2011). pp. 2690–2697.