In statistics, multicollinearity (also collinearity) is a phenomenon in which one predictor variable in a multiple regression model can be perfectly predicted from the others. In this situation, the coefficient estimates of the multiple regression may change erratically in response to small changes in the data or the procedure used to fit the model.
Contrary to popular belief, including collinear variables does not reduce the predictive power or reliability of the model as a whole, nor does it reduce how accurately coefficients are estimated. In fact, high collinearity indicates that it is exceptionally important to include all variables, as excluding any variable will cause strong confounding.
Note that in statements of the assumptions underlying regression analyses such as ordinary least squares, the phrase "no multicollinearity" usually refers to the absence of perfect multicollinearity, which is an exact (non-stochastic) linear relation among the predictors. In such a case, the design matrix has less than full rank, and therefore the moment matrix cannot be inverted. Under these circumstances, for a general linear model , the ordinary least squares estimator does not exist.
Multicollinearity refers to a situation in which more than two explanatory variables in a multiple regression model are highly linearly related. Mathematically, a set of variables is perfectly multicollinear if there exist one or more exact linear relationships among some of the variables. That is, for all observations ,
where are constants and is the observation on the explanatory variable.
To explore one issue caused by multicollinearity, consider the process of attempting to obtain estimates for the parameters of the multiple regression equation
The ordinary least squares estimates involve inverting the matrix , where
is an matrix, where is the number of observations, is the number of explanatory variables, and . If there is an exact linear relationship (perfect multicollinearity) among the independent variables, then at least one of the columns of is a linear combination of the others, and so the rank of (and therefore of ) is less than , and the matrix will not be invertible.
Perfect collinearity is common when working with raw datasets, which frequently contain redundant information. Once redundancies are identified and removed, however, nearly collinear variables often remain due to correlations inherent in the system being studied. In such a case, Equation (1) may be modified to include an error term :
In this case, there is no exact linear relationship among the variables, but the variables are nearly collinear if the variance of is small. In this case, the matrix has an inverse, but it is ill-conditioned. A computer algorithm may or may not be able to compute an approximate inverse; even if it can, the resulting inverse may have large rounding errors.
The following are measures of multicollinearity:
- Variance inflation factor (VIF):
where measures how well the th variable can be estimated using all other regressors. It is a popular misconception that factors greater than 5, 10, 20, or 40 indicate "severe" multicollinearity, but this is incorrect. A large VIF can be present regardless of how accurately a regression is estimated.
- Condition number: The standard measure of ill-conditioning in a matrix is the condition index. This determines if the inversion of the matrix is numerically unstable with finite-precision numbers (standard computer floats and doubles), indicating the potential sensitivity of the computed inverse to small changes in the original matrix. The condition number is computed by finding the maximum singular value divided by the minimum singular value of the design matrix. If the condition number is above 30, the regression may have severe multicollinearity; multicollinearity exists if, in addition, two or more of the variables related to the high condition number have high proportions of variance explained. One advantage of this method is that it also shows which variables are causing the problem.
- Correlation matrices: calculating the correlation between every pair of explanatory variables yields indications as to the likelihood that any given couplet of right-hand-side variables are creating multicollinearity problems. Correlation values (off-diagonal elements) of at least 0.4 are sometimes interpreted as indicating a multicollinearity problem, but this is incorrect; multicollinearity can only be detected by looking at all variables simultaneously, and may be present even when all correlations are small.
The primary consequence of approximate multicollinearity is that, even if the matrix is invertible, a computer algorithm may be unsuccessful in obtaining an approximate inverse. Even if it does obtain one, the inverse may be numerically inaccurate.
When there is a strong correlation among predictor variables in the population, it is difficult to identify which of several variables should be included. However, this is not an artifact of poor modeling. Estimated standard errors are large because there is partial confounding of different variables, which makes it difficult to identify which regressor is truly causing the outcome variable. This confounding remains even when the researcher attempts to ignore it (by excluding variables from the regression). As a result, excluding multicollinear variables from regressions will often invalidate causal inference by removing important confounders.
Remedies to numerical problems
- Make sure the data are not redundant. Datasets often include redundant variables. For example, a dataset may include variables for income, expenses, and savings. However, because income is equal to expenses plus savings (by definition), it is incorrect to include all 3 variables simultaneously. Similarly, including a dummy variable for every category (e.g., summer, autumn, winter, and spring) as well as a constant term creates perfect multicollinearity.
- Standardize predictor variables. Generating polynomial terms (i.e., for , , , etc.) or interaction terms (i.e., , etc.) can cause multicollinearity if the variable in question has a limited range. Mean-centering will eliminate this special kind of multicollinearity.
- Use an orthogonal representation of the data. Poorly-written statistical software will sometimes fail to converge to a correct representation when variables are strongly correlated. However, it is still possible to rewrite the regression to use only uncorrelated variables by performing a change of basis.
- The model should be left as is. Multicollinearity does not affect the accuracy of the model or its predictions. It is a numerical problem, not a statistical one. Excluding collinear variables leads to artificially small estimates for standard errors, but does not reduce the true (not estimated) standard errors for regression coefficients. Excluding variables with a high variance inflation factor invalidates all calculated standard errors and p-values, by turning the results of the regression into a post hoc analysis.
- Giles, Dave (15 September 2011). "Econometrics Beat: Dave Giles' Blog: Micronumerosity". Econometrics Beat. Retrieved 3 September 2023.
- O’Brien, R. M. (2007). "A Caution Regarding Rules of Thumb for Variance Inflation Factors". Quality & Quantity. 41 (5): 673–690. doi:10.1007/s11135-006-9018-6. S2CID 28778523.
- Belsley, David (1991). Conditioning Diagnostics: Collinearity and Weak Data in Regression. New York: Wiley. ISBN 978-0-471-52889-0.
- "12.6 - Reducing Structural Multicollinearity | STAT 501". newonlinecourses.science.psu.edu. Retrieved 16 March 2019.
- "Computational Tricks with Turing (Non-Centered Parametrization and QR Decomposition)". storopoli.io. Retrieved 3 September 2023.
- Gujarati, Damodar (2009). "Multicollinearity: what happens if the regressors are correlated?". Basic Econometrics (4th ed.). McGraw−Hill. pp. 363. ISBN 9780073375779.
- Gelman, Andrew; Loken, Eric (14 November 2013). "The garden of forking paths" (PDF). Unpublished – via Columbia.
- Belsley, David A.; Kuh, Edwin; Welsch, Roy E. (1980). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. New York: Wiley. ISBN 978-0-471-05856-4.
- Goldberger, Arthur S. (1991). "Multicollinearity". A Course in Econometrics. Cambridge: Harvard University Press. pp. 245–53. ISBN 9780674175440.
- Hill, R. Carter; Adkins, Lee C. (2001). "Collinearity". In Baltagi, Badi H. (ed.). A Companion to Theoretical Econometrics. Blackwell. pp. 256–278. doi:10.1002/9780470996249.ch13. ISBN 978-0-631-21254-6.
- Johnston, John (1972). Econometric Methods (Second ed.). New York: McGraw-Hill. pp. 159–168. ISBN 9780070326798.
- Kalnins, Arturs (2022). "When does multicollinearity bias coefficients and cause type 1 errors? A reconciliation of Lindner, Puck, and Verbeke (2020) with Kalnins (2018)". Journal of International Business Studies. 53 (7): 1536–1548. doi:10.1057/s41267-022-00531-9. S2CID 249323519.
- Kmenta, Jan (1986). Elements of Econometrics (Second ed.). New York: Macmillan. pp. 430–442. ISBN 978-0-02-365070-3.
- Maddala, G. S.; Lahiri, Kajal (2009). Introduction to Econometrics (Fourth ed.). Chichester: Wiley. pp. 279–312. ISBN 978-0-470-01512-4.
- Tomaschek, Fabian; Hendrix, Peter; Baayen, R. Harald (2018). "Strategies for addressing collinearity in multivariate linguistic data". Journal of Phonetics. 71: 249–267. doi:10.1016/j.wocn.2018.09.004.