Oversampling and undersampling in data analysis
|
|
This article provides insufficient context for those unfamiliar with the subject. Please help improve the article with a good introductory style. (January 2012) |
| This article is an orphan, as few or no other articles link to it. Please introduce links to this page from related articles; suggestions may be available. (February 2012) |
Oversampling and undersampling in data analysis are techniques used to adjust the class distribution of a data set (i.e. the ratio between the different classes/categories represented).
Oversampling and undersampling are opposite and roughly equivalent techniques. They both involve using a bias to select more samples from one class than from another.
The usual reason for oversampling is to correct for a bias in the original dataset. One scenario where it is useful is when training a classifier using labelled training data from a biased source, since labelled training data is valuable but often comes from un-representative sources.
For example, suppose we have a sample of 1000 people of which 66% are male (perhaps the sample was collected at a football match). We know the general population is 50% female, and we may wish to adjust our dataset to represent this. Simple oversampling will select each female example twice, and this copying will produce a balanced dataset of 1333 samples with 50% female. Simple undersampling will drop some of the male samples at random to give a balanced dataset of 667 samples, again with 50% female.[clarification needed]
There are also more complex oversampling techniques, including the creation of artificial data points.
[edit] See also
- Oversampling in signal processing, which is no relation.
|
|
This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed. (April 2011) |
|
|
This article includes a list of references, related reading or external links, but its sources remain unclear because it lacks inline citations. Please improve this article by introducing more precise citations. (April 2011) |
[edit] References
- Chawla, Nitesh V. (2010) Data Mining for Imbalanced Datasets: An Overview doi:10.1007/978-0-387-09823-4_45 In: Maimon, Oded; Rokach, Lior (Eds) Data Mining and Knowledge Discovery Handbook, Springer ISBN 978-0-387-09823-4 (pages 875-886)
| This statistics-related article is a stub. You can help Wikipedia by expanding it. |