# Multicollinearity

In statistics, multicollinearity (also collinearity) is a phenomenon in which one predictor variable in a multiple regression model can be perfectly predicted from the others. In this situation, the coefficient estimates of the multiple regression may change erratically in response to small changes in the data or the procedure used to fit the model.

Contrary to popular belief, including collinear variables does not reduce the predictive power or reliability of the model as a whole, nor does it reduce how accurately coefficients are estimated. In fact, high collinearity indicates that it is exceptionally important to include all variables, as excluding any variable will cause strong confounding.[1]

Note that in statements of the assumptions underlying regression analyses such as ordinary least squares, the phrase "no multicollinearity" usually refers to the absence of perfect multicollinearity, which is an exact (non-stochastic) linear relation among the predictors. In such a case, the design matrix ${\displaystyle X}$ has less than full rank, and therefore the moment matrix ${\displaystyle X^{\mathsf {T}}X}$ cannot be inverted. Under these circumstances, for a general linear model ${\displaystyle y=X\beta +\epsilon }$, the ordinary least squares estimator ${\displaystyle {\hat {\beta }}_{OLS}=(X^{\mathsf {T}}X)^{-1}X^{\mathsf {T}}y}$ does not exist.

## Definition

Multicollinearity refers to a situation in which more than two explanatory variables in a multiple regression model are highly linearly related. Mathematically, a set of variables is perfectly multicollinear if there exist one or more exact linear relationships among some of the variables. That is, for all observations ${\displaystyle i}$,

${\displaystyle \lambda _{0}+\lambda _{1}X_{1i}+\lambda _{2}X_{2i}+\cdots +\lambda _{k}X_{ki}=0}$

(1)

where ${\displaystyle \lambda _{k}}$ are constants and ${\displaystyle X_{ki}}$ is the ${\displaystyle i^{th}}$ observation on the ${\displaystyle k^{th}}$ explanatory variable.

To explore one issue caused by multicollinearity, consider the process of attempting to obtain estimates for the parameters of the multiple regression equation

${\displaystyle Y_{i}=\beta _{0}+\beta _{1}X_{1i}+\cdots +\beta _{k}X_{ki}+\varepsilon _{i}}$.

The ordinary least squares estimates involve inverting the matrix ${\displaystyle X^{\mathsf {T}}X}$, where

${\displaystyle X={\begin{bmatrix}1&X_{11}&\cdots &X_{k1}\\\vdots &\vdots &&\vdots \\1&X_{1N}&\cdots &X_{kN}\end{bmatrix}}}$

is an ${\displaystyle N\times (k+1)}$ matrix, where ${\displaystyle N}$ is the number of observations, ${\displaystyle k}$ is the number of explanatory variables, and ${\displaystyle N\geq k+1}$. If there is an exact linear relationship (perfect multicollinearity) among the independent variables, then at least one of the columns of ${\displaystyle X}$ is a linear combination of the others, and so the rank of ${\displaystyle X}$ (and therefore of ${\displaystyle X^{\mathsf {T}}X}$) is less than ${\displaystyle k+1}$, and the matrix ${\displaystyle X^{\mathsf {T}}X}$ will not be invertible.

Perfect collinearity is common when working with raw datasets, which frequently contain redundant information. Once redundancies are identified and removed, however, nearly collinear variables often remain due to correlations inherent in the system being studied. In such a case, Equation (1) may be modified to include an error term ${\displaystyle v_{i}}$:

${\displaystyle \lambda _{0}+\lambda _{1}X_{1i}+\lambda _{2}X_{2i}+\cdots +\lambda _{k}X_{ki}+v_{i}=0}$.

In this case, there is no exact linear relationship among the variables, but the variables${\displaystyle X_{j}}$ are nearly collinear if the variance of ${\displaystyle v_{i}}$ is small. In this case, the matrix ${\displaystyle X^{\mathsf {T}}X}$ has an inverse, but it is ill-conditioned. A computer algorithm may or may not be able to compute an approximate inverse; even if it can, the resulting inverse may have large rounding errors.

## Measures

The following are measures of multicollinearity:

1. Variance inflation factor (VIF):
${\displaystyle \mathrm {tolerance} =1-R_{j}^{2},\quad \mathrm {VIF} ={\frac {1}{\mathrm {tolerance} }},}$
where ${\displaystyle R_{j}^{2}}$ measures how well the ${\displaystyle j}$th variable can be estimated using all other regressors. It is a popular misconception that factors greater than 5, 10, 20, or 40 indicate "severe" multicollinearity, but this is incorrect. A large VIF can be present regardless of how accurately a regression is estimated.[2]
2. Condition number: The standard measure of ill-conditioning in a matrix is the condition index. This determines if the inversion of the matrix is numerically unstable with finite-precision numbers (standard computer floats and doubles), indicating the potential sensitivity of the computed inverse to small changes in the original matrix. The condition number is computed by finding the maximum singular value divided by the minimum singular value of the design matrix. If the condition number is above 30, the regression may have severe multicollinearity; multicollinearity exists if, in addition, two or more of the variables related to the high condition number have high proportions of variance explained. One advantage of this method is that it also shows which variables are causing the problem.[3]
3. Correlation matrices: calculating the correlation between every pair of explanatory variables yields indications as to the likelihood that any given couplet of right-hand-side variables are creating multicollinearity problems. Correlation values (off-diagonal elements) of at least 0.4 are sometimes interpreted as indicating a multicollinearity problem, but this is incorrect; multicollinearity can only be detected by looking at all variables simultaneously, and may be present even when all correlations are small.

## Consequences

The primary consequence of approximate multicollinearity is that, even if the matrix ${\displaystyle X^{\mathsf {T}}X}$ is invertible, a computer algorithm may be unsuccessful in obtaining an approximate inverse. Even if it does obtain one, the inverse may be numerically inaccurate.

When there is a strong correlation among predictor variables in the population, it is difficult to identify which of several variables should be included. However, this is not an artifact of poor modeling. Estimated standard errors are large because there is partial confounding of different variables, which makes it difficult to identify which regressor is truly causing the outcome variable. This confounding remains even when the researcher attempts to ignore it (by excluding variables from the regression). As a result, excluding multicollinear variables from regressions will often invalidate causal inference by removing important confounders.

## Remedies to numerical problems

1. Make sure the data are not redundant. Datasets often include redundant variables. For example, a dataset may include variables for income, expenses, and savings. However, because income is equal to expenses plus savings (by definition), it is incorrect to include all 3 variables simultaneously. Similarly, including a dummy variable for every category (e.g., summer, autumn, winter, and spring) as well as a constant term creates perfect multicollinearity.
2. Standardize predictor variables. Generating polynomial terms (i.e., for ${\displaystyle x_{1}}$, ${\displaystyle x_{1}^{2}}$, ${\displaystyle x_{1}^{3}}$, etc.) or interaction terms (i.e., ${\displaystyle x_{1}\times x_{2}}$, etc.) can cause multicollinearity if the variable in question has a limited range. Mean-centering will eliminate this special kind of multicollinearity.[4]
3. Use an orthogonal representation of the data.[5] Poorly-written statistical software will sometimes fail to converge to a correct representation when variables are strongly correlated. However, it is still possible to rewrite the regression to use only uncorrelated variables by performing a change of basis.
4. The model should be left as is. Multicollinearity does not affect the accuracy of the model or its predictions. It is a numerical problem, not a statistical one. Excluding collinear variables leads to artificially small estimates for standard errors, but does not reduce the true (not estimated) standard errors for regression coefficients.[6] Excluding variables with a high variance inflation factor invalidates all calculated standard errors and p-values, by turning the results of the regression into a post hoc analysis.[7]