Estimation of covariance matrices: Difference between revisions

Content deleted Content added

Inline

Revision as of 14:17, 25 March 2007

Given a sample X₁,..., X_n from a random vector X ∈ R^p×1 (a p×1 column), the unbiased estimator of the covariance matrix

Cov(X)=E((X-E(X))(X-E(X))^{T})

is the sample covariance matrix

{1 \over {n-1}}\sum _{i=1}^{n}(X_{i}-{\overline {X}})(X_{i}-{\overline {X}})^{T},

where

{\overline {X}}={1 \over {n}}\sum _{i=1}^{n}X_{i}

is the sample mean. This is true regardless if the random variable X has normal distribution or not. The reason for the factor n-1 rather than n is essentially that the mean is not known and is replaced by the sample mean.

The maximum likelihood estimator of the covariance matrix, however, is slightly is different. When the random variable X is normally distributed, the maximum likelihood estimate is

{1 \over n}\sum _{i=1}^{n}(X_{i}-{\overline {X}})(X_{i}-{\overline {X}})^{T}.

Clearly, the difference between the unbiased and the maximum likelihood estimator diminishes for large n.

The probability distribution of the maximum likelihood estimator of the covariance matrix of a multivariate normal distribution is the Wishart distribution. Although no one is surprised that the estimator of the population covariance matrix is closely related to the sample covariance matrix, the mathematical derivation is perhaps not widely known and is surprisingly subtle and elegant.

The multivariate normal distribution

A random vector X ∈ R^p×1 (a p×1 "column vector") has a multivariate normal distribution with a nonsingular covariance matrix Σ precisely if Σ ∈ R^{p × p} is a positive-definite matrix and the probability density function of X is

f(x)=[\mathrm {constant} ]\cdot \det(\Sigma )^{-1/2}\exp \left(-{1 \over 2}(x-\mu )^{T}\Sigma ^{-1}(x-\mu )\right)

where μ ∈ R^p×1 is the expected value. The matrix Σ is the higher-dimensional analog of what in one dimension would be the variance.

Maximum-likelihood estimation

Suppose now that X₁, ..., X_n are independent and identically distributed with the distribution above. Based on the observed values x₁, ..., x_n of this sample, we wish to estimate Σ (we adhere to the convention of writing random variables as capital letters and data as lower-case letters).

First steps

It is fairly readily shown that the maximum-likelihood estimate of the mean vector μ is the "sample mean" vector:

{\overline {x}}=(x_{1}+\cdots +x_{n})/n.

See the section on estimation in the article on the normal distribution for details; the process here is similar.

Since the estimate of μ does not depend on Σ, we can just substitute it for μ in the likelihood function

L(\mu ,\Sigma )=[\mathrm {constant} ]\cdot \prod _{i=1}^{n}\det(\Sigma )^{-1/2}\exp \left(-{1 \over 2}(x_{i}-\mu )^{T}\Sigma ^{-1}(x_{i}-\mu )\right)

\propto \det(\Sigma )^{-n/2}\exp \left(-{1 \over 2}\sum _{i=1}^{n}(x_{i}-\mu )^{T}\Sigma ^{-1}(x_{i}-\mu )\right)

and then seek the value of Σ that maximizes this (in practice it is easier to work with $\log L$ ).

We have

L({\overline {x}},\Sigma )\propto \det(\Sigma )^{-n/2}\exp \left(-{1 \over 2}\sum _{i=1}^{n}(x_{i}-{\overline {x}})^{T}\Sigma ^{-1}(x_{i}-{\overline {x}})\right).

The trace of a 1 × 1 matrix

Now we come to the first surprising step.

Regard the scalar $(x_{i}-{\overline {x}})^{T}\Sigma ^{-1}(x_{i}-{\overline {x}})$ as the trace of a 1×1 matrix!

This makes it possible to use the identity tr(AB) = tr(BA) whenever A and B are matrices so shaped that both products exist. We get

L(\mu ,\Sigma )=\det(\Sigma )^{-n/2}\exp \left(-{1 \over 2}\sum _{i=1}^{n}\operatorname {tr} ((x_{i}-\mu )^{T}\Sigma ^{-1}(x_{i}-\mu ))\right)

=\det(\Sigma )^{-n/2}\exp \left(-{1 \over 2}\sum _{i=1}^{n}\operatorname {tr} ((x_{i}-\mu )(x_{i}-\mu )^{T}\Sigma ^{-1})\right)

(so now we are taking the trace of a p×p matrix!)

=\det(\Sigma )^{-n/2}\exp \left(-{1 \over 2}\operatorname {tr} \left(\sum _{i=1}^{n}(x_{i}-\mu )(x_{i}-\mu )^{T}\Sigma ^{-1}\right)\right)

=\det(\Sigma )^{-n/2}\exp \left(-{1 \over 2}\operatorname {tr} \left(S\Sigma ^{-1}\right)\right)

where

S=\sum _{i=1}^{n}(x_{i}-{\overline {x}})(x_{i}-{\overline {x}})^{T}\in \mathbf {R} ^{p\times p}.

Using the spectral theorem

It follows from the spectral theorem of linear algebra that a positive-definite symmetric matrix S has a unique positive-definite symmetric square root S^1/2. We can again use the "cyclic property" of the trace to write

\det(\Sigma )^{-n/2}\exp \left(-{1 \over 2}\operatorname {tr} \left(S^{1/2}\Sigma ^{-1}S^{1/2}\right)\right).

Let B = S^1/2 Σ⁻¹ S^1/2. Then the expression above becomes

\det(S)^{-n/2}\det(B)^{n/2}\exp \left(-{1 \over 2}\operatorname {tr} (B)\right).

The positive-definite matrix B can be diagonalized, and then the problem of finding the value of B that maximizes

\det(B)^{n/2}\exp \left(-{1 \over 2}\operatorname {tr} (B)\right)

reduces to the problem of finding the values of the diagonal entries λ₁, ..., λ_p that maximize

\lambda _{i}^{n/2}\exp(-\lambda _{i}/2).

This is just a calculus problem and we get λ_i = n, so that B = n I_p, i.e., n times the p×p identity matrix.

Concluding steps

Finally we get

\Sigma =S^{1/2}B^{-1}S^{1/2}=S^{1/2}((1/n)I_{p})S^{1/2}=S/n,\,

i.e., the p×p "sample covariance matrix"

{S \over n}={1 \over n}\sum _{i=1}^{n}(X_{i}-{\overline {X}})(X_{i}-{\overline {X}})^{T}

is the maximum-likelihood estimator of the "population covariance matrix" Σ. At this point we are using a capital X rather than a lower-case x because we are thinking of it "as an estimator rather than as an estimate", i.e., as something random whose probability distribution we could profit by knowing. The random matrix S can be shown to have a Wishart distribution with n − 1 degrees of freedom.^[1] That is:

\sum _{i=1}^{n}(X_{i}-{\overline {X}})(X_{i}-{\overline {X}})^{T}\sim W_{p}(\Sigma ,n-1)

.

Alternative derivation

An alternative derivation of the maximum likelihood estimator can be performed via matrix calculus formulae (see also differential of a determinant and differential of the inverse matrix). It also verifies the aforementioned fact about the maximum likelihood estimate of the mean. Re-write the likelihood in the log form using the trace trick:

\operatorname {ln} L(\mu ,\Sigma )=\operatorname {const} -{n \over 2}\ln \det(\Sigma )-{1 \over 2}\operatorname {tr} \left[\Sigma ^{-1}\sum _{i=1}^{n}(x_{i}-\mu )(x_{i}-\mu )^{T}\right].

The differential of this log-likelihood is

d\ln L(\mu ,\Sigma )=-{n \over 2}\operatorname {tr} \left[\Sigma ^{-1}\left\{d\Sigma \right\}\right]-{1 \over 2}\operatorname {tr} \left[-\Sigma ^{-1}\{d\Sigma \}\Sigma ^{-1}\sum _{i=1}^{n}(x_{i}-\mu )(x_{i}-\mu )^{T}-2\Sigma ^{-1}\sum _{i=1}^{n}(x_{i}-\mu )\{d\mu \}^{T}\right].

It naturally breaks down into the part related to the estimation of the mean, and to the part related to the estimation of the variance. The first order condition for maximum, $d\ln L(\mu ,\Sigma )=0$ , is satisfied when the terms multiplying $d\mu$ and $d\Sigma$ are identically zero. Assuming (the maximum likelihood estimate of) $\Sigma$ is non-singular, the first order condition for the estimate of the mean vector is

\sum _{i=1}^{n}(x_{i}-\mu )=0,

which leads to the maximum likelihood estimator

{\widehat {\mu }}={\bar {X}}={1 \over n}\sum _{i=1}^{n}X_{i}.

This lets us simplify $\sum _{i=1}^{n}(x_{i}-\mu )(x_{i}-\mu )^{T}=\sum _{i=1}^{n}(x_{i}-{\bar {x}})(x_{i}-{\bar {x}})^{T}=S$ as defined above. Then the terms involving $d\Sigma$ in $d\ln L$ can be combined as

-{1 \over 2}\operatorname {tr} \left(\Sigma ^{-1}\left\{d\Sigma \right\}\left[nI_{p}-\Sigma ^{-1}S\right]\right).

The first order condition $d\ln L(\mu ,\Sigma )=0$ will hold when the term in the square bracket is (matrix-valued) zero. Pre-multiplying the latter by $\Sigma$ and dividing by $n$ gives

{\widehat {\Sigma }}={1 \over n}S,

which of course coincides with the canonical derivation given earlier.

Shrinkage estimation

If the sample size n is small and the number of considered variables p is large, the above empirical estimators of covariance and correlation are very unstable. Specifically, it is possible to furnish estimators that improve considerably upon the maximum likelihood estimate in terms of mean squared error. Moreover, for n < p, the empirical estimate of the covariance matrix becomes singular, i.e. it cannot be inverted to compute the precision matrix.

As an alternative, many methods have been suggested to improve the estimation of the covariance matrix. All of these approaches rely on the concept of shrinkage. This is implicit in Bayesian methods, in penalized maximum likelihood methods, and explicit in the Stein-type shrinkage approach.

A simple version of a shrinkage estimator of the covariance matrix is constructed as follows. One considers a convex combination of the empirical estimator with some suitable chosen target, e.g., the diagonal matrix. Subsequently, the mixing parameter is selected to maximize the expected accuracy of the shrunken estimator. This can be done by cross-validation, or by using an analytic estimate of the shrinkage intensity. The resulting regularized estimator can be shown to outperform the maximum likelihood estimator for small samples. For large samples, the shrinkage intensity will reduce to zero, hence in this case the shrinkage estimator will be identical to the empirical estimator. Apart from increased efficiency the shrinkage estimate has the additional advantage that it is always positive definite and well conditioned.

A review on this topic is given, e.g., in Schäfer and Strimmer 2005.^[2]

A covariance shrinkage estimator is implemented in the R package "corpcor".

References

^ K.V. Mardia, J.T. Kent, and J.M. Bibby (1979) Multivariate Analysis, Academic Press.
^ J. Schäfer and K. Strimmer (2005) A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics, Statistical Applications in Genetics and Molecular Biology: Vol. 4: No. 1, Article 32.

[1] K.V. Mardia, J.T. Kent, and J.M. Bibby (1979) Multivariate Analysis, Academic Press.

[2] J. Schäfer and K. Strimmer (2005) A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics, Statistical Applications in Genetics and Molecular Biology: Vol. 4: No. 1, Article 32.

[1]

[2]

@@ Line 1: / Line 1: @@
+Given a [[sample]] ''X''<sub>1</sub>,..., ''X''<sub>n</sub>
-In [[multivariate statistics]], the importance of the [[Wishart distribution]] stems in part from the fact that it is the [[probability distribution]] of the [[maximum likelihood]] [[estimator]] of the [[covariance matrix]] of a [[multivariate normal distribution]].  Although no one is surprised that the estimator of the population covariance matrix is simply the [[Sample mean and covariance|sample covariance matrix]], the mathematical derivation is perhaps not widely known and is surprisingly subtle and elegant.
+from a random vector ''X'' &isin; '''R'''<sup>''p''&times;1</sup> (a ''p''&times;1 column), the [[unbiased]] [[estimator]] of the [[covariance matrix]]
+:<math>Cov(X) = E((X-E(X))(X-E(X))^T)</math>
+is the [[Sample mean and covariance|sample covariance matrix]]
+:<math>{1 \over {n-1}}\sum_{i=1}^n (X_i-\overline{X})(X_i-\overline{X})^T,</math>
+where
+:<math>\overline{X}={1 \over {n}}\sum_{i=1}^n X_i</math>
+is the [[Sample mean and covariance|sample mean]].
+This is true regardless if the random variable ''X'' has [[multivariate normal distribution|normal distribution]] or not. The reason for the factor ''n-1'' rather than ''n'' is essentially that the mean is not known and is replaced by the sample mean.
+The [[maximum likelihood]] [[estimator]] of the covariance matrix, however, is slightly is different. When the [[random variable]]  ''X'' is [[multivariate normal distribution|normally distributed]], the maximum likelihood estimate is
+:<math>{1 \over n}\sum_{i=1}^n (X_i-\overline{X})(X_i-\overline{X})^T.</math>
+Clearly, the difference between the unbiased and the maximum likelihood estimator diminishes for large ''n''.
+The [[probability distribution]] of the [[maximum likelihood]] [[estimator]] of the [[covariance matrix]] of a [[multivariate normal distribution]] is the [[Wishart distribution]]. Although no one is surprised that the estimator of the population covariance matrix is closely related to the [[Sample mean and covariance|sample covariance matrix]], the mathematical derivation is perhaps not widely known and is surprisingly subtle and elegant.
 ==The multivariate normal distribution==