Principal component regression
In statistics, principal component regression (PCR) is a regression analysis that uses principal component analysis when estimating regression coefficients. It is a procedure used to overcome problems which arise when the exploratory variables are close to being collinear.
In PCR instead of regressing the dependent variable on the independent variables directly, the principal components of the independent variables are used. One typically only uses a subset of the principal components in the regression, making a kind of regularized estimation.
Often the principal components with the highest variance are selected. However, the low-variance principal components may also be important, — in some cases even more important.
PCR (principal components regression) is a regression method that can be divided into three steps:
- The first step is to run a principal components analysis on the table of the explanatory variables, and (usually) select a subset of the components.
- The second step is to run an ordinary least squares regression (linear regression) on the selected principal components.
- The resulting model is transformed by the inverse of the PCA loadings, so that it is now in terms of the explanatory variables.
A linear regression follows the model,
where is the regressor that explains the connection between X and y. There are many forms of regression, each of which finds , an estimate for , using sample data. The most common form of regression, Ordinary least squares, finds a value of that minimizes the variance in - the solution being
To examine how well this estimator operates we look at its expected value and variance, given normally distributed random variables for X and y. In this case, the expected value is
Thus a simple linear regression is unbiased - given enough data it will approximate the true value of . The covariance matrix of is
and because observations are independent and identically distributed, and
This means that the variance of the estimator is inversely proportional to the covariance of the independent variables, i.e. the dimensions with less variance actually introduce more variability in the regression. As the principal components of X are the eigenvectors of , the concept can be shown through an eigenvalue decomposition :
where are the loadings of component k and are the eigenvalues, a.k.a. the amount of the total variance contained in component k. It is easy to see that very low values of , which correspond to the most linear relationships between variables, would have the most effect on the variance of this estimator. Similarly, removing the lowest PC's would lower the magnitude of the variance the most, while eliminating variance in your data the least. PCR is slightly more complex as it actually regresses in the PC space rather than the original space, but this demonstrates that eliminating collinearities in the data can give a more accurate regression.
- Canonical correlation
- Deming regression
- Multilinear subspace learning
- Partial least squares regression
- Principal component analysis
- Total sum of squares
- Dodge, Y. (2003) The Oxford Dictionary of Statistical Terms, OUP. ISBN 0-19-920613-9
- Ian T. Jolliffe (1982). "A note on the Use of Principal Components in Regression". Journal of the Royal Statistical Society, Series C 31 (3): 300–303. doi:10.2307/2348005. JSTOR 2348005.
- Jolliffe, I.T. (2002). Principal Component Analysis (2nd ed.). New York: Springer-Verlag. p. 167-173. ISBN 0387954422.
- R. Kramer, Chemometric Techniques for Quantitative Analysis, (1998) Marcel-Dekker, ISBN 0-8247-0198-4.