Sampling bias

In statistics sampling bias is causing some members of the population to be less likely to be included than others. It results in a biased sample, a non-random sample^[1] of a population (or non-human factors) in which all participants are not equally balanced or objectively represented.^[2] If the bias makes estimation of population parameters impossible, the sample is a non-probability sample.

It is also called ascertainment bias.^[3]^[4] Ascertainment bias has basically the same definition,^[5]^[6] but is still sometimes classified as a separate type of bias.^[5]

Distinction from selection bias

Sampling bias is mostly classified as a subtype of selection bias^[7], sometimes specifically termed sample selection bias^[8]^[9], but some classify it as a separate type of bias^[10]. A distinction, albeit not universally accepted, of sampling bias is that it undermines the external validity of a test (the ability of its results to be generalized to the rest of the population), while selection bias mainly addresses internal validity for differences or similarities found in the sample at hand. In this sense, errors occurring in the process of gathering the sample or cohort cause sampling bias, while errors in any process thereafter cause selection bias.

However, selection bias and sampling bias are often used synonymously.^[11]

Types of sampling bias

Selection from a specific area. For example, a survey of high school students to measure teenage use of illegal drugs will be a biased sample because it does not include home-schooled students or dropouts. A sample is also biased if certain members are underrepresented or overrepresented relative to others in the population. For example, a "man on the street" interview which selects people who walk by a certain location is going to have an over-representation of healthy individuals who are more likely to be out of the home than individuals with a chronic illness. This may be an extreme form of biased sampling, because certain members of the population are totally excluded from the sample (that is, they have zero probability of being selected).
Self-selection bias, which is possible whenever the group of people being studied has any form of control over whether to participate. Participants' decision to participate may be correlated with traits that affect the study, making the participants a non-representative sample. For example, people who have strong opinions or substantial knowledge may be more willing to spend time answering a survey than those who do not. Another example is online and phone-in polls, which are biased samples because the respondents are self-selected. Those individuals who are highly motivated to respond, typically individuals who have strong opinions, are overrepresented, and individuals that are indifferent or apathetic are less likely to respond. This often leads to a polarization of responses with extreme perspectives being given a disproportionate weight in the summary. As a result, these types of polls are regarded as unscientific.
Pre-screening of trial participants, or advertising for volunteers within particular groups. For example a study to "prove" that smoking does not affect fitness might recruit at the local fitness centre, but advertise for smokers during the advanced aerobics class, and for non-smokers during the weight loss sessions.
Exclusion bias results from exclusion of particular groups from the sample. E. g. exclusion of subjects who have recently migrated into the study area (this may occur when newcomers are not available in a register used to identify the source population). Excluding subjects who move out of the study area during follow-up is rather equivalent of dropout or nonresponse, a selection bias in that it rather affects the internal validity of the study.
Healthy user bias, when the study population is likely healthier than the general population, e.g. workers (ie. someone in ill-health is unlikely to have a job as manual laborer).
Overmatching, matching for an apparent confounder that actually is a result of the exposure. The control group becomes more similar to the cases in regard to exposure than the general population.

Problems caused by a biased sample

A biased sample causes problems because any statistic computed from that sample has the potential to be consistently erroneous. The bias can lead to an over- or under-representation of the corresponding parameter in the population. Almost every sample in practice is biased because it is practically impossible to ensure a perfectly random sample. If the degree of under-representation is small, the sample can be treated as a reasonable approximation to a random sample. Also, if the group that is under-represented does not differ markedly from the other groups in the quantity being measured, then a random sample can still be a reasonable approximation.

The word bias in common usage has a strong negative word connotation, and implies a deliberate intent to mislead or other scientific fraud. In statistical usage, bias merely represents a mathematical property, no matter if it is deliberate or either unconscious or due to imperfections in the instruments used for observation. While some individuals might deliberately use a biased sample to produce misleading results, more often, a biased sample is just a reflection of the difficulty in obtaining a truly representative sample.

Some samples use a biased statistical design which nevertheless allows the estimation of parameters. The U.S. National Center for Health Statistics. for example, deliberately oversamples from minority populations in many of its nationwide surveys in order to gain sufficient precision for estimates within these groups.^[12] These surveys require the use of sample weights (see below) to produce proper estimates across all racial and ethnic groups. Provided that certain conditions are met (chiefly that the sample is drawn randomly from the entire sample) these samples permit accurate estimation of population parameters.

Historical examples

Example of biased sample, claiming as of June 2008, that only 54% of web browsers (Internet Explorer) in use do not pass the Acid2 test. The statistics are from visitors to one website comprising mostly web developers.^[13]

A classic example of a biased sample and the misleading results it produced occurred in 1936. In the early days of opinion polling, the American Literary Digest magazine collected over two million postal surveys and predicted that the Republican candidate in the U.S. presidential election, Alf Landon, would beat the incumbent president, Franklin Roosevelt by a large margin. The result was the exact opposite. The Literary Digest survey represented a sample collected from readers of the magazine, supplemented by records of registered automobile owners and telephone users. This sample included an over-representation of individuals who were rich, who, as a group, were more likely to vote for the Republican candidate. In contrast, a poll of only 50 thousand citizens selected by George Gallup's organization successfully predicted the result, leading to the popularity of the Gallup poll.

Another classic example occurred in the 1948 Presidential Election. On Election night, the Chicago Tribune printed the headline DEWEY DEFEATS TRUMAN, which turned out to be mistaken. In the morning the grinning President-Elect, Harry S. Truman, was photographed holding a newspaper bearing this headline. The reason the Tribune was mistaken is that their editor trusted the results of a phone survey. Survey research was then in its infancy, and few academics realized that a sample of telephone users was not representative of the general population. Telephones were not yet widespread, and those who had them tended to be prosperous and have stable addresses. (In many cities, the Bell System telephone directory contained the same names as the Social Register.) In addition, the Gallup poll that the Tribune based its headline on was over two weeks old at the time of the printing.^[14]

Statistical corrections for a biased sample

If entire segments of the population are excluded from a sample, then there are no adjustments that can produce estimates that are representative of the entire population. But if some groups are underrepresented and the degree of underrepresentation can be quantified, then sample weights can correct the bias.

For example, a hypothetical population might include 10 million men and 10 million women. Suppose that a biased sample of 100 patients included 20 men and 80 women. A researcher could correct for this imbalance by attaching a weight of 2.5 for each male and 0.625 for each female. This would adjust any estimates to achieve the same expected value as a sample that included exactly 50 men and 50 women, unless men and women differed in their likelihood of taking part in the survey.

References

^ Medical Dictionary - 'Sampling Bias' Retrieved on September 23, 2009
^ TheFreeDictionary – biased sample Retrieved on 2009-09-23. Site in turn cites: Mosby's Medical Dictionary, 8th edition.
^ Weising, Kurt (2005). DNA fingerprinting in plants: principles, methods, and applications. London: Taylor & Francis Group. p. 180. ISBN 0-8493-1488-7.
^ Page 34 in: Selection and linkage desequilibrium tests under complex demographies and ascertainment bias Francesc Calafell i Majó, Anna Ramírez i Soriano. July 2008
^ ^a ^b Panacek: Error in research Society for Academic Emergency Medicine. Retrieved on Nov 14, 2009
^ medilexicon Medical Dictionary - 'Ascertainment Bias' Retrieved on Nov 14, 2009
^ Dictionary of Cancer Terms – Selection Bias Retrieved on September 23, 2009
^ The effects of sample selection bias on racial differences in child abuse reporting Ards S, Chung C, Myers SL Jr. Child Abuse Negl. 1999 Dec;23(12):1209; author reply 1211-5. PMID: 9504213
^ Sample Selection Bias Correction Theory Corinna Cortes, Mehryar Mohri, Michael Riley, and Afshin Rostamizadeh. New York University.
^ Page 262 in: Behavioral Science. Board Review Series. By Barbara Fadem. ISBN 0781782570, 9780781782579. 216 pages
^ Wallace/Maxcy-Rosenau-Last public health & preventive medicine (page 21) 15ed, illustrated. By Robert B. Wallace. ISBN 0071441980, 9780071441988
^ National Center for Health Statistics (2007). Minority Health.
^ "Browser Statistics". Refsnes Data. 2008. Retrieved 2008-07-05. {{cite web}}: Unknown parameter |month= ignored (help)
^ based on http://www.uh.edu/engines/epi1199.htm retrieved on September 29, 2007

v t e Biases
Cognitive biases	Acquiescence Ambiguity Affinity Anchoring Attentional Attribution Actor–observer Correspondence Authority Automation Availability Mean world Belief Blind spot Choice-supportive Commitment Confirmation Compassion fade Congruence Cultural Distinction Dunning–Kruger Egocentric Curse of knowledge Emotional Extrinsic incentives Fading affect Framing Frequency Frog pond effect Halo effect Hindsight Horn effect Hostile attribution Impact Implicit In-group Illusion of transparency Mean world syndrome Mere-exposure effect Negativity Normalcy Omission Optimism Out-group homogeneity Outcome Overton window Precision Present Pro-innovation Proximity Response Restraint Self-serving Social comparison Social influence bias Spotlight Status quo Substitution Time-saving Trait ascription Turkey illusion von Restorff effect Zero-risk In animals
Statistical biases	Estimator Forecast Healthy user Information Psychological Lead time Length time Non-response Observer Omitted-variable Participation Recall Sampling Selection Self-selection Social desirability Spectrum Survivorship Systematic error Systemic Verification Wet
Other biases	Academic Basking in reflected glory Déformation professionnelle Funding FUTON Inductive Infrastructure Inherent In education Liking gap Media False balance Vietnam War Norway South Asia Sweden United States Arab–Israeli conflict Ukraine Net Political bias Publication Reporting White hat
Bias reduction	Cognitive bias mitigation Debiasing Heuristics in judgment and decision-making
Lists: General Memory