Oversampling and undersampling in data analysis

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Oversampling and undersampling in data analysis are techniques used to adjust the class distribution of a data set (i.e. the ratio between the different classes/categories represented).

Oversampling and undersampling are opposite and roughly equivalent techniques. They both involve using a bias to select more samples from one class than from another.

The usual reason for oversampling is to correct for a bias in the original dataset. One scenario where it is useful is when training a classifier using labelled training data from a biased source, since labelled training data is valuable but often comes from un-representative sources.

For example, suppose we have a sample of 1000 people of which 66.7% are male (perhaps the sample was collected at a football match). We know the general population is 50% female, and we may wish to adjust our dataset to represent this. Simple oversampling will select each female example twice, and this copying will produce a balanced dataset of 1333 samples with 50% female. Simple undersampling will drop some of the male samples at random to give a balanced dataset of 667 samples, again with 50% female.[clarification needed] The virtues of this exercise rely on the agenda of the sampler or pollster. The oversampling or undersampling bias could be used to compensate adverse conditions to a certain gender or age group and/or a race, in order to design and implement a good public policy. But also can be used to rick or mislead public opinion to approve a certain program that could benefit special interests. So, it is important to be aware of the companies or sponsors of certain statistical sampling, in order to know if they are positively compensating the sampling or they are trying to push a specific agenda.

There are also more complex oversampling techniques, including the creation of artificial data points.

See also[edit]