From Wikipedia, the free encyclopedia
Jump to: navigation, search (talk) 11:37, 16 September 2016 (UTC)

WikiProject Statistics (Rated C-class, High-importance)
WikiProject icon

This article is within the scope of the WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page or join the discussion.

C-Class article C  This article has been rated as C-Class on the quality scale.
 High  This article has been rated as High-importance on the importance scale.
WikiProject Mathematics (Rated C-class, Mid-importance)
WikiProject Mathematics
This article is within the scope of WikiProject Mathematics, a collaborative effort to improve the coverage of Mathematics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
Mathematics rating:
C Class
Mid Importance
 Field: Probability and statistics
One of the 500 most frequently viewed mathematics articles.

Addition of History Section[edit]

  • Added a section detailing the historical evolution of confounding, up to and including the modern causal interpretations. There are likely other sources that could be added here as well.
  • Added a paragraph to the causal definition.
  • Added a paragraph to the control of confounding section.

AForns (talk) 04:17, 20 August 2014 (UTC)

Reorganization of Introduction[edit]

Some elements of the introduction belong in existing sub-sections. I have reorganized several elements to better the flow of the article, including:

  • The latter part of the introduction starting with: "In the case of risk assessments evaluating the magnitude and nature of risk to human health..." has been moved to the Types of Confounding sub-section.
  • The concrete example portion of the introduction has been moved to the Examples sub-section.
  • The segue for the causal definition was moved to the introduction (did not belong in the Examples).
  • Various simplifications were made to the causal definition.
  • Moved causal definition and control of confounding below introduction as they are more set-up that should precede the examples.

AForns (talk) 10:20, 11 August 2014 (UTC)

Inclusion of Causal Definition and Example[edit]

Perhaps the first step in bringing the present article up to standards is providing an update that reflects modern understanding of causal calculus. With the community's approval and edits, I propose the following amendments that formalize and clarify the causal notion of confounding (see below):

(added to the end of the present section 2 or the beginning of the proposed section 3, below):

The above correlation-based definition, however, is metaphorical at best – a growing number of analysts agree that confounding is a causal concept, and as such, cannot be described in terms of correlations nor associations [1][2][3] (See causal definition).

3. Causal Definition

The concept of confounding can be formalized, and managed, when information is available about the data generating model (as in the Figure above). To be more specific, let X be some independent variable, Y some dependent variable, and M a causal model that asserts the cause-effect relationships between variables in the system. To estimate the effect of exposure X on outcome Y, the statistician must suppress the effects of extraneous variables that influence both X and Y. We say that, X and Y are confounded by some other variable Z whenever Z is a cause of both X and Y.

In the causal framework, denote as the probability of event Y = y under the hypothetical intervention X = x. X and Y are not confounded in causal model M if and only if the following holds:






for all values X = x and Y = y, where is the conditional probability upon seeing X = x. Intuitively, this equality states that X and Y are not confounded whenever the observationally witnessed association between them is the same as the association that would be measured in a controlled experiment, with x randomized.

4. Control of Confounding

Consider the scenario of a physician deciding to administer drug X to a patient with gender Z. The physician knows that gender differences influence a patient's choice of drug as well as their chances of recovery. In this scenario, gender Z is a confound of administering drug X on recovery outcome Y since Z is a cause of both X and Y:

Causal diagram of Gender as common cause of Drug use and Recovery

Consequently, we will encounter the inequality:






since the observational quantity contains information about the correlation between X and Z, and the interventional quantity does not (being an unbiased estimate of the effect of X on Y). Clearly the statistician desires the unbiased estimate, but in cases where only observational data is available, an unbiased estimate can only be obtained by "adjusting" for all confounding factors, namely, conditioning on their various values and averaging the result. In the case of a single confounder Z, this leads to the "adjustment formula":






which gives an unbiased estimate for the causal effect of X on Y. The same adjustment formula works when there are multiple confounders except, in this case, the choice of a set Z of variables that would guarantee unbiased estimates must be done with caution. The criterion for a proper choice of variables is called the Back-Door [4][5] and requires that the chosen set Z "blocks" (or intercepts) every path from X to Y that ends with an arrow into X. Such sets are called "Back-Door admissible" and may include variables which are not common causes of X and Y, but merely proxies thereof.

Returning to the physician example, since Z complies with the Back-Door requirement (i.e., it intercepts the one Back-Door path X Z Y), the Back-Door adjustment formula is valid:






In this way the physician can predict the likely effect of administering the drug from observational studies in which the conditional probabilities appearing on the right-hand side of the equation can be estimated by regression.


  1. ^ Pearl, J., (2009). Simpson's Paradox, Confounding, and Collapsibility In Causality: Models, Reasoning and Inference (2nd ed.). New York, NY, USA: Cambridge University Press.
  2. ^ VanderWeele, T.J. & Shpitser, I. (2013). On the definition of a confounder. Annals of Statistics, 41:196-220.
  3. ^ Greenland, S., Robins, J. M., & Pearl, J. (1999). Confounding and Collapsibility in Causal Inference. Statistical Science, 14(1), 29–46.
  4. ^ Pearl, J., (1993). "Aspects of Graphical Models Connected With Causality," Statistical Science
  5. ^ Pearl, J. (2009). Causal Diagrams and the Identification of Causal Effects In Causality: Models, Reasoning and Inference (2nd ed.). New York, NY, USA: Cambridge University Press.

Additionally, I'd be happy to recruit several experts to touch up the other elements of the page, though I believe the above represents the first, best, and gentlest improvement to the article. Your thoughts and comments are appreciated.

AForns (talk) 06:35, 20 July 2014 (UTC)

three body problem in causation. Some examples may require an arbitrary estimation. (The real world is a special case). I see no errors. It certainly has a generous component of English. (talk) 17:08, 25 July 2014 (UTC)

Sorry to bring negative energy into this talk page but this article is a complete joke and should be flagged for deletion unless it gets serious help[edit]

The terms "negative" and "positive" correlation should never be used as they are relative terms and are not descriptive at all. Much better terms to use are "directly" and "inversely" (talk) 22:21, 13 January 2014 (UTC)

A correlation near one means "almost directly". A correlation near negative one means "almost inversely". (talk) 16:18, 25 July 2014 (UTC)

Page still incorrect and more confusing than ever - read this instead[edit]

Do not use the Wikipedia page if you want to understand confounding. Some of what is written is correct, but the most critical definition is INCORRECT. I don't have time to fix the page, but here is a simple explanation of the concept of confounding.

A confounder (or confounding variable) is something that is correlated with the independent (causative) variable you are investigating, and causes or prevents the effect (dependent variable) you are investigating. Because it is associated with both of them, it will interfere with the ability of statistical tests to correctly indicate the impact of your causative variable; that is, the confounder will caused biased estimates of the impact of your causative variable.

Note that a true confounder is itself another causative/preventive variable. (Variables that are only correlated with the effect won't cause confounding.) For instance, drinking and smoking are correlated, people who do one tend to do the other. Today we know that tobacco worsens heart disease, but alcohol is protective against heart disease. Tobacco's effect is bigger than alcohol's, so together they cause net harm. Early studies of alcohol use and heart disease indicated that alcohol CAUSED heart disease because researchers had no data on smoking. Once both factors were included (along with other important variables), the truth was understood. In this example, tobacco use was the confounder for early studies investigating the influence of alcohol on heart disease.

I'm a health economist, and we call confounders that, or confounding variables. Never heard the term lurking variable except for Wikipedia. —Preceding unsigned comment added by Scientist99 (talkcontribs) 23:18, 18 January 2008 (UTC)

Actually, while I agree that this page is atrociously written and wrong, your alternative definition is not any better. Confounding is very easy to define: there is confounding between variables x and y if p(y | x) is not equal to p(y | do(x)). Confounders are very hard to define, and in fact defining them is currently an open problem (I cowrote a paper on the subject currently in review). Many seemingly reasonable definitions of 'confounder' fail. IlyaShpitser (talk) 19:52, 26 October 2011 (UTC)
The want to merge articles on 'lurking variable', 'counfounding variable' and 'counfounding' came about because of the bias toward pure mathematical knowledge where I believe the term 'Lurking variable' is used more than 'confounding variable' to describe a similar concept. Supposed (talk) 13:23, 22 January 2008 (UTC)

Fair comment - I propose to re-write this page completely in the next two weeks. I will put my text here first, and if non-one objects strongly, I will replace it. Astaines (talk) 21:54, 2 August 2008 (UTC)

Hi. I'd be interested in contributing to that re-write so can we make sure any structure for a new article is discussed first. Thanks.Davwillev (talk) 16:37, 6 August 2008 (UTC)

Ice Cream Murder[edit]

Are ice-creams and murder rates really a suitable example for comparison? -- 16:19, 13 August 2007 (UTC)

If they correlate, then as far as I know they correlate only because of a confounding factor like the weather. So, yes, it's a useful example. (talk) 15:42, 25 July 2014 (UTC)

Confounding factor should redirect here.--BigaZon 17:02, 18 February 2006 (UTC)

This is silly. Why are you placing the term 'confounding' under a 'lurking variable' umbrella? It's a term on its own that needs expanding, there could be an article on socio-economic confounding which is a huge topic. I can't find anywhere in the discussion of socio-economic confounding where 'lurking variable' is mentioned. It's a facet of wikipedia. Confounding should have its own article. 23:53, 9 May 2006 (UTC)

No, Confounding factor deals with the statistical use of the term, interchangable with "lurking variable" or "confounding variable", not "socio-economic" confounding... see introductory texts like "The Practice of Statistics" and many others. Glad this now redirects properly. If you decide to make an article on some other type of confounding, you might want to make a disambiguation page. --BigaZon 00:31, 17 July 2006 (UTC)

I don't know why lurking variable can't mean unknown confounding variable. There's bound to be some mathematical difference between variables that affect significance and not the confidence interval. (I think I got that right, could be the other way around). (talk) 15:42, 25 July 2014 (UTC)

Old requested move[edit]

  • I think "confounding variable" or "confound" is more commonly used than "lurking variable". Is anyone opposed to a name change? --Jcbutler 20:18, 11 February 2007 (UTC)
  • Me too - I was looking for information on this subject and I typed "confounding variable" Mrweetoes 21:28, 4 March 2007 (UTC)
  • Since "Confounding variable" already exists as a redirect, it requires an administrator to move. I've listed it as a request for move. Whosasking 20:09, 5 March 2007 (UTC)
  • I've moved the page, per the request at WP:RM and consensus here. Cheers. -GTBacchus(talk) 03:05, 11 March 2007 (UTC)
  • I noticed calls for improvement to this article from medicine and probability. From a medicine standpoint it seems like it needs to mention the immense effect of confounding variables on medical care and public health policy, with links to the Women's Health Initiative, Social Darwinism, and Evidence Based Medicine. Does that sound about right? Flkevin 02:27, 17 May 2007 (UTC)

Requested move[edit]

  • Confounding variableConfounding — I recommend that this article be renamed as 'Confounding'. This should redirect from the 'Confounding variable' and 'Lurking variable' articles at least. This is standard terminology in Epidemiology (my own field) and in the quantitative sociology literature. Anyone know what economists call it? Astaines 23:18, 10 November 2007 (UTC)
  • I agree. 'Confounding' can occur in the absence of a known variable.Davwillev (talk) 20:47, 16 January 2008 (UTC)

I've moved the page, per the above discussion. Please let me know if I can be of further assistance. -GTBacchus(talk) 05:16, 24 January 2008 (UTC)

Major edit[edit]

Hi, I've rewritten the section on managing confounding. It was confusing and not very relevant. —Preceding unsigned comment added by Astaines (talkcontribs) 00:13, 28 November 2007 (UTC)

Yes, you've removed many, many interesting things. And the current revision is actually more confusing. For example:
In a typical situation there are far more controls than cases. It is very useful to have a guide for selecting controls.
Uh, what "guide"? I love Wikipedia, but those other people that rob and delete content under the guise of "improvements" are a huge pain. How about contacting the author before doing large changes? I will reinstate the old version of that section.--Keimzelle (talk) 18:45, 15 January 2008 (UTC)


Are the Clever Hans effect, Hawthorne effect and Placebo effect examples of confounding factors? —Preceding unsigned comment added by (talk) 21:55, 8 March 2011 (UTC)

No. Irrelevant.Jimjamjak (talk) 15:55, 12 December 2011 (UTC)


I have moved this Talk contribution to the bottom and replaced the "accuracy" template in the article with an "expert" template as this seems a better one to use for the points here and above. Melcombe (talk) 11:45, 22 July 2011 (UTC)

I tend to agree with the comments at Talk:Confounding#Page still incorrect and more confusing than ever - read this instead.

Certainly in AP Statistics a distinction is made between lurking variables and confounding variables -- they are not considered to be exactly the same.

See for example the following book (Topic 8: Planning and Conducting Experiments)

In terms of this statement:

The methodologies of scientific studies therefore need to control for these factors to avoid a type 1 error; an erroneous 'false positive' conclusion that the dependent variables are in a causal relationship with the independent variable.

When you make a type I error that means that you reject the null hypothesis when the null hypothesis in fact holds. However, the null hypothesis does not necessarily say anything about cause and effect.

So making a type I error does not necessarily mean that the researcher infers a false cause and effect relationship.

Jjjjjjjjjj (talk) 03:18, 22 October 2010 (UTC)

Correct, type I errors say nothing about cause and effect - that depends on whether causation is posited in the hypothesis. The misunderstanding here, I think, is the statement that the variables are in a causal relationship; the author could have said that the variables are correlated. That said, I'm not sure about the value of the observation. Paulwhaley (talk) 18:44, 30 April 2013 (UTC)

Risk assessment[edit]

I am very confused by the content pertaining to "risk assessment" in this article. It seems to me that the studies that are being described are not risk assessments, but epidemiological studies. The two references cited in the second paragraph are articles describing epidemiological studies. I am not denying that epidemiological data may be extensively used in some kinds of human health risk assessments, but I think that the article presents a very confused view of this fact. If I have misunderstood, please could someone clarify what kind of risk assessment this is actually referring to and cite something appropriate? If not, I would suggest that this content is essentially incorrect and I will remove it - there is a considerable amount of detail in other parts of the article dealing with epidemiological studies.Jimjamjak (talk) 15:39, 12 December 2011 (UTC)

Peer review[edit]

This is a very strange point: "Peer review is a process that can assist in reducing instances of confounding. It is a process of evaluating the provision, work process, or output of an individual or collective operating in the same field as the reviewer(s)." I would suggest that this abbreviated overview of peer review is not relevant to the page. Yes - peer review of an article based on an analysis where confounding variables were not identified may be one way in which that analysis can be improved, but then so would review by a colleague, and I don't think we need to describe that process here.Jimjamjak (talk) 15:51, 12 December 2011 (UTC)

You answered your own question in "Unidentified confounding variables". Peer review contributes to understanding causation. People have written books about causation. And if you do not get the mathematics, it's a three body problem. Some statisticians may not be aware of relevant stratifications that have been done. So the trick in doing a good job of it is not only getting good raw data, but getting good approximations of the answer to a three body problem in causation. (talk) 16:43, 25 July 2014 (UTC)

Statistical significance[edit]

I would argue that the concept of statistical significance has no relevance here: "These two variables have a positive, and potentially statistically significant, correlation with each other".Jimjamjak (talk) 15:54, 12 December 2011 (UTC)

Ice cream example[edit]

I quite like the use of the ice cream example in this page, but I feel it is rather long-winded at present. I think that the same message could be put across with much less text, using the same example.Jimjamjak (talk) 15:58, 12 December 2011 (UTC)


The sections seem to go in no particular order. Also, the section "Decreasing the potential for confounding to occur" overlaps somewhat with "Experimental controls". It'd be helpful if someone could reorganize these sections such that there is more logical flow between them. My suggestion:

  • Move ice cream example to introduction
  • Move risk assessment section into examples, supplement with other real-world examples
  • Move types of confounding section to top, right after intro
  • Combine the controls section with the decreasing confounding section

Anyone can feel free to do this, or some other reasonable reorganization. Unfortunately, I don't have the time to do it myself right now. (talk) 08:03, 10 December 2012 (UTC)

Two dubious statements[edit]

I have added "dubious" tags to these two statements, for the following reasons:

  • multivariate analyses reveal much less information about the strength of the confounding variable than do stratification methods.
As far as I've ever known, they find out exactly the same things if done correctly, though one or the other may be more convenient or easier to do in certain contexts.
  • The best available defense against this possibility [confounding] is often to ... conduct a randomized study of a sufficiently large sample taken as a whole, such that all confounding variables (known and unknown) will be distributed by chance across all study groups.
Unless I'm misunderstanding the intent of this sentence, it's just wrong. If the confounding variable is in fact a confounding variable -- if it is correlated with both the dependent variable and an independent variable of interest -- then this is exactly the approach that leads to spurious results, as the independent variable picks up effects that are really due to the omitted, correlated, confounding variable. Duoduoduo (talk) 18:33, 20 December 2012 (UTC)
Okay, I see the intent of the second quote above: it is intended to say that a potentially confounding variable is made to be not confounding by making it uncorrelated with the dummy variable for the control group. I'll try to find a way to clarify this in the article. My first dubious-tag discussed above still applies. Duoduoduo (talk) 16:51, 21 December 2012 (UTC)

I think a property of confounding should be removed[edit]

I agree that 2) a confounder C should be associated with variable V and 3) with outcome O indepedently of T, and that 4) should not be in the causal pathway between V and O. However I'm not convinced of 1) marginal association between C and O. Suppose that C is whether a person is blue or green, and that green would have lower outcomes O in absence of the treatment V, but are treated in a higher percentage than the blue, and the treatment is effective (it leads to higher O), so that the two group have the same marginal distribution of O. In this case, condition 2 (treatment V associated with being blue, so with C), 3 (controlling for treatment, being blue associated with lower outcomes, so dependence between C and O after controlling for V) and 4 (treatment does not affect color, so C is not in the causal pathway between V and O) are respected, but 1 doesn't (blue and green have the same marginal distribution of O, so C and O are not marginally associated). However, in this case, the estimate of the effect of the treatment would be biased downward, if we didn't correct for color, because most of the treated would belong to the green, i.e. the group with lower average potential (i.e. conditional to the tretament V) values.

Edit: To be more precise, the 4 conditions maybe are sufficient, but non necessary. Given it may be case C is independent of O conditioning on V, but we don't know how things would be for different values of V (so we should stress that conditioning is not only on factual but also on counterfactual values), I think conditions should be rephrased as: (A) a confounder C should be associated with variable V and (B) with potential outcomes O(V), for each possible value of the variable V, and that (C) should not be in the causal pathway between V and O.

— Preceding unsigned comment added by borisba (talk) 10:49, 16 September 2016 (UTC)