# Principal component regression

In statistics, principal component regression (PCR) is a regression analysis that uses principal component analysis when estimating regression coefficients. It is a procedure used to overcome problems which arise when the exploratory variables are close to being collinear.[1]

In PCR instead of regressing the dependent variable on the independent variables directly, the principal components of the independent variables are used. One typically only uses a subset of the principal components in the regression, making a kind of regularized estimation.

Often the principal components with the highest variance are selected. However, the low-variance principal components may also be important, — in some cases even more important.[2]

## The principle

PCR (principal components regression) is a regression method that can be divided into three steps:[citation needed]

1. The first step is to run a principal components analysis on the table of the explanatory variables, and (usually) select a subset of the components.
2. The second step is to run an ordinary least squares regression (linear regression) on the selected principal components.
3. The resulting model is transformed by the inverse of the PCA loadings, so that it is now in terms of the explanatory variables.

## Motivation

A linear regression follows the model,

$\mathbf{y} = \mathbf{X}\boldsymbol\beta + \boldsymbol\varepsilon, \,$

where $\boldsymbol\beta$ is the regressor that explains the connection between X and y. There are many forms of regression, each of which finds $\hat{\boldsymbol\beta}$, an estimate for $\boldsymbol\beta$, using sample data. The most common form of regression, Ordinary least squares, finds a value of $\hat{\boldsymbol\beta}$ that minimizes the variance in $\boldsymbol\varepsilon$ - the solution being

$\hat{\boldsymbol\beta} = (\mathbf{X}^{\rm T}\mathbf{X})^{-1} \mathbf{X}^{\rm T}\mathbf{y}.$

To examine how well this estimator operates we look at its expected value and variance, given normally distributed random variables for X and y.[3] In this case, the expected value is

$E[\hat{\boldsymbol\beta}] = E[(\mathbf{X}^{\rm T}\mathbf{X})^{-1} \mathbf{X}^{\rm T}\mathbf{y}] = E[(\mathbf{X}^{\rm T}\mathbf{X})^{-1} \mathbf{X}^{\rm T}\mathbf{X}\boldsymbol\beta] + E[(\mathbf{X}^{\rm T}\mathbf{X})^{-1} \mathbf{X}^{\rm T}\boldsymbol\varepsilon] = \boldsymbol\beta$.

Thus a simple linear regression is unbiased - given enough data it will approximate the true value of $\boldsymbol\beta$. The covariance matrix of $\hat{\boldsymbol\beta}$ is

$\boldsymbol\Sigma_{\hat{\beta}} = E[\hat{\boldsymbol\beta}\hat{\boldsymbol\beta}^{\rm T}] = E[(\mathbf{X}^{\rm T}\mathbf{X})^{-1} \mathbf{X}^{\rm T}\mathbf{y} ((\mathbf{X}^{\rm T}\mathbf{X})^{-1} \mathbf{X}^{\rm T}\mathbf{y})^{\rm T}] = E[(\mathbf{X}^{\rm T}\mathbf{X})^{-1}\mathbf{X}^{\rm T}\mathbf{y}\mathbf{y}^{\rm T}\mathbf{X}(\mathbf{X}^{\rm T}\mathbf{X})^{-1}]$

and because observations are independent and identically distributed, $E[\mathbf{y}\mathbf{y}^{\rm T}] = \sigma_y^2I$ and

$\boldsymbol\Sigma_{\hat{\beta}} = \sigma_y^2 (\mathbf{X}^{\rm T}\mathbf{X})^{-1}\mathbf{X}^{\rm T}\mathbf{X}(\mathbf{X}^{\rm T}\mathbf{X})^{-1} = \sigma_y^2\boldsymbol\Sigma_x^{-1}$.

This means that the variance of the estimator is inversely proportional to the covariance of the independent variables, i.e. the dimensions with less variance actually introduce more variability in the regression. As the principal components of X are the eigenvectors of $\boldsymbol\Sigma_x^{-1}$, the concept can be shown through an eigenvalue decomposition :

$\boldsymbol\Sigma_{\hat{\beta}} = \sigma_y^2 \sideset{}{}\sum_{k}\frac{\boldsymbol\alpha_k\boldsymbol\alpha_k^{\rm T}}{\lambda_k}$

where $\boldsymbol\alpha_k$ are the loadings of component k and $\lambda_k$ are the eigenvalues, a.k.a. the amount of the total variance contained in component k. It is easy to see that very low values of $\lambda_k$, which correspond to the most linear relationships between variables, would have the most effect on the variance of this estimator. Similarly, removing the lowest PC's would lower the magnitude of the variance the most, while eliminating variance in your data the least. PCR is slightly more complex as it actually regresses in the PC space rather than the original space, but this demonstrates that eliminating collinearities in the data can give a more accurate regression.