Spurious relationship

In statistics, a spurious relationship (see also spurious correlation and spurious regression) is a mathematical relationship in which two or more events or variables are not causally related to each other (i.e. they are independent), yet it may be wrongly inferred that they are, due to either coincidence or the presence of a certain third, unseen factor (referred to as a "common response variable", "confounding factor", or "lurking variable"). A well known case of spurious relationship can be found in the time-series literature, where a spurious regression refers to a regression that provides statistical evidence of a linear relationship between independent non stationary variables. Nonstationarity may be due to the presence of a unit root in both variables.[1][2]

General example

An example of a spurious relationship can be illuminated by examining a city's ice cream sales. These sales are highest when the rate of drownings in city swimming pools is highest. To allege that ice cream sales cause drowning, or vice versa, would be to imply a spurious relationship between the two. In reality, a heat wave may have caused both. The heat wave is an example of a hidden or unseen variable, also known as a confounding variable.

Another popular example is a series of Dutch statistics showing a positive correlation between the number of storks nesting in a series of springs and the number of human babies born at that time. Of course there was no causal connection; they were correlated with each other only because they were correlated with the weather nine months before the observations.[3] However Höfer et al. (2004) showed the correlation to be stronger than just weather variations as he could show in post reunification Germany that, while the number of clinical deliveries was not linked with the rise in stork population, out of hospital deliveries correlated with the stork population.[4]

Detecting spurious relationships

The term "spurious relationship" is commonly used in statistics and in particular in experimental research techniques, both of which attempt to understand and predict direct causal relationships (X → Y). A non-causal correlation can be spuriously created by an antecedent which causes both (W → X and W → Y). Intervening variables (X → W → Y), if undetected, may make indirect causation look direct. Because of this, experimentally identified correlations do not represent causal relationships unless spurious relationships can be ruled out.

Experiments

In experiments, spurious relationships can often be identified by controlling for other factors, including those that have been theoretically identified as possible confounding factors. For example, consider a researcher trying to determine whether a new drug kills bacteria; when the researcher applies the drug to a bacterial culture, the bacteria die. But to help in ruling out the presence of a confounding variable, another culture is subjected to conditions that are as nearly identical as possible to those facing the first-mentioned culture, but the second culture is not subjected to the drug. If there is an unseen confounding factor in those conditions, this control culture will die as well, so that no conclusion of efficacy of the drug can be drawn from the results of the first culture. On the other hand, if the control culture does not die, then the researcher cannot reject the hypothesis that the drug is efficacious.

Non-experimental statistical analyses

Disciplines whose data are mostly non-experimental, such as economics, usually employ observational data to establish causal relationships. The body of statistical techniques used in economics is called econometrics. The main statistical method in econometrics is multivariable regression analysis. Typically a linear relationship such as

${\displaystyle y=a_{0}+a_{1}x_{1}+a_{2}x_{2}+\cdots +a_{k}x_{k}+e}$

is hypothesized, in which ${\displaystyle y}$ is the dependent variable (hypothesized to be the caused variable), ${\displaystyle x_{j}}$ for j = 1, ..., k is the jth independent variable (hypothesized to be a causative variable), and ${\displaystyle e}$ is the error term (containing the combined effects of all other causative variables, which must be uncorrelated with the included independent variables). If there is reason to believe that none of the ${\displaystyle x_{j}}$s is caused by y, then estimates of the coefficients ${\displaystyle a_{j}}$ are obtained. If the null hypothesis that ${\displaystyle a_{j}=0}$ is rejected, then the alternative hypothesis that ${\displaystyle a_{j}\neq 0}$ and equivalently that ${\displaystyle x_{j}}$ causes y cannot be rejected. On the other hand, if the null hypothesis that ${\displaystyle a_{j}=0}$ cannot be rejected, then equivalently the hypothesis of no causal effect of ${\displaystyle x_{j}}$ on y cannot be rejected. Here the notion of causality is one of contributory causality: If the true value ${\displaystyle a_{j}\neq 0}$, then a change in ${\displaystyle x_{j}}$ will result in a change in y unless some other causative variable(s), either included in the regression or implicit in the error term, change in such a way as to exactly offset its effect; thus a change in ${\displaystyle x_{j}}$ is not sufficient to change y. Likewise, a change in ${\displaystyle x_{j}}$ is not necessary to change y, because a change in y could be caused by something implicit in the error term (or by some other causative explanatory variable included in the model).

Regression analysis controls for other relevant variables by including them as regressors (explanatory variables). This helps to avoid mistaken inference of causality due to the presence of a third, underlying, variable that influences both the potentially causative variable and the potentially caused variable: its effect on the potentially caused variable is captured by directly including it in the regression, so that effect will not be picked up as a spurious effect of the potentially causative variable of interest. In addition, the use of multivariate regression helps to avoid wrongly inferring that an indirect effect of, say x1 (e.g., x1x2y) is a direct effect (x1y).

Just as an experimenter must be careful to employ an experimental design that controls for every confounding factor, so also must the user of multiple regression be careful to control for all confounding factors by including them among the regressors. If a confounding factor is omitted from the regression, its effect is captured in the error term by default, and if the resulting error term is correlated with one (or more) of the included regressors, then the estimated regression may be biased or inconsistent (see omitted variable bias).

In addition to regression analysis, the data can be examined to determine if Granger causality exists. The presence of Granger causality indicates both that x precedes y, and that x contains unique information about y.

Other relationships

There are several other relationships defined in statistical analysis as follows.