Omitted-variable bias

From Wikipedia, the free encyclopedia
  (Redirected from Omitted variable bias)
Jump to: navigation, search

In statistics, omitted-variable bias (OVB) occurs when a model is created which incorrectly leaves out one or more important causal factors. The "bias" is created when the model compensates for the missing factor by over- or underestimating the effect of one of the other factors.

More specifically, OVB is the bias that appears in the estimates of parameters in a regression analysis, when the assumed specification is incorrect in that it omits an independent variable that is correlated with both the dependent variable and one or more included independent variables.

Omitted-variable bias in linear regression[edit]

Two conditions must hold true for omitted-variable bias to exist in linear regression:

  • the omitted variable must be a determinant of the dependent variable (i.e., its true regression coefficient is not zero); and
  • the omitted variable must be correlated with one or more of the included independent variables (i.e. the covariance of the omitted variable and the independent variable, cov(z,x), is not equal to zero).

As an example, consider a linear model of the form

y_i = x_i \beta + z_i \delta + u_i,\qquad i = 1,\dots,n

where

  • xi is a 1 × p row vector of values of p independent variables observed at time i or for the i th study participant;
  • β is a p × 1 column vector of unobservable parameters (the response coefficients of the dependent variable to each of the p independent variables in xi) to be estimated;
  • zi is a scalar and is the value of another independent variable that is observed at time i or for the i th study participant;
  • δ is a scalar and is an unobservable parameter (the response coefficient of the dependent variable to zi) to be estimated;
  • ui is the unobservable error term occurring at time i or for the i th study participant; it is an unobserved realization of a random variable having expected value 0 (conditionally on xi and zi);
  • yi is the observation of the dependent variable at time i or for the i th study participant.

We collect the observations of all variables subscripted i = 1, ..., n, and stack them one below another, to obtain the matrix X and the vectors Y, Z, and U:

 X = \left[ \begin{array}{c} x_1 \\  \vdots \\ x_n \end{array} \right] \in \mathbb{R}^{n\times p},

and

 Y = \left[ \begin{array}{c} y_1 \\  \vdots \\ y_n \end{array} \right],\quad  Z = \left[ \begin{array}{c} z_1 \\  \vdots \\ z_n \end{array} \right],\quad  U = \left[ \begin{array}{c} u_1 \\  \vdots \\ u_n \end{array} \right] \in \mathbb{R}^{n\times 1}.

If the independent variable z is omitted from the regression, then the estimated values of the response parameters of the other independent variables will be given by, by the usual least squares calculation,

\hat{\beta} = (X'X)^{-1}X'Y\,

(where the "prime" notation means the transpose of a matrix and the -1 superscript is matrix inversion).

Substituting for Y based on the assumed linear model,


\begin{align}
\hat{\beta} & = (X'X)^{-1}X'(X\beta+Z\delta+U) \\
& =(X'X)^{-1}X'X\beta + (X'X)^{-1}X'Z\delta + (X'X)^{-1}X'U \\
& =\beta + (X'X)^{-1}X'Z\delta + (X'X)^{-1}X'U.
\end{align}

On taking expectations, the contribution of the final term is zero; this follows from the assumption that U has zero expectation. On simplifying the remaining terms:


\begin{align}
E[ \hat{\beta} | X ] & = \beta + (X'X)^{-1}X'Z\delta \\
& = \beta + \text{bias}.
\end{align}

The second term after the equal sign is the omitted-variable bias in this case, which is non-zero if the omitted variable z is correlated with any of the included variables in the matrix X (that is, if X'Z does not equal a vector of zeroes). Note that the bias is equal to the weighted portion of zi which is "explained" by xi.

Effects on ordinary least squares[edit]

The Gauss–Markov theorem states that regression models which fulfill the classical linear regression model assumptions provide the best, linear and unbiased estimators. With respect to ordinary least squares, the relevant assumption of the classical linear regression model is that the error term is uncorrelated with the regressors.

The presence of omitted-variable bias violates this particular assumption. The violation causes the OLS estimator to be biased and inconsistent. The direction of the bias depends on the estimators as well as the covariance between the regressors and the omitted variables. A positive covariance of the omitted variable with both a regressor and the dependent variable will lead the OLS estimate of the included regressor's coefficient to be greater than the true value of that coefficient. This effect can be seen by taking the expectation of the parameter, as shown in the previous section.

See also[edit]

References[edit]

  • Barreto; Howland (2005). "Omitted Variable Bias". Introductory Econometrics: Using Monte Carlo Simulation with Microsoft Excel. Cambridge University Press. 
  • Clarke, Kevin A. (2005). "The Phantom Menace: Omitted Variable Bias in Econometric Research". Conflict Management and Peace Science 22: 341–352. doi:10.1080/07388940500339183. 
  • Greene, W. H. (1993). Econometric Analysis (2nd ed.). Macmillan. pp. 245–246. 
  • Wooldridge, Jeffrey M. (2009). "Omitted Variable Bias: The Simple Case". Introductory Econometrics: A Modern Approach. Mason, OH: Cengage Learning. pp. 89–93. ISBN 9780324660548.