Covariance matrix

Sample points from a multivariate Gaussian distribution with a standard deviation of 3 in roughly the lower left-upper right direction and of 1 in the orthogonal direction. Because the x and y components co-vary, the variances of x and y do not fully describe the distribution. A 2×2 covariance matrix is needed; the directions of the arrows correspond to the eigenvectors of this covariance matrix and their lengths to the square roots of the eigenvalues.

In probability theory and statistics, a covariance matrix (also known as dispersion matrix or variance covariance matrix) is a matrix whose element in the i, j position is the covariance between the i ^th and j ^th elements of a random vector (that is, of a vector of random variables). Each element of the vector is a scalar random variable, either with a finite number of observed empirical values or with a finite or infinite number of potential values specified by a theoretical joint probability distribution of all the random variables.

Intuitively, the covariance matrix generalizes the notion of variance to multiple dimensions. As an example, the variation in a collection of random points in two-dimensional space cannot be characterized fully by a single number, nor would the variances in the x and y directions contain all of the necessary information; a 2×2 matrix would be necessary to fully characterize the two-dimensional variation.

Analogous to the fact that it is necessary to build a Hessian matrix to fully describe the concavity of a multivariate function, a covariance matrix is necessary to fully describe the variation in a distribution.

Definition

Throughout this article, boldfaced unsubscripted X and Y are used to refer to random vectors, and unboldfaced subscripted X_i and Y_i are used to refer to random scalars. If the entries in the column vector

\mathbf {X} ={\begin{bmatrix}X_{1}\\\vdots \\X_{n}\end{bmatrix}}

are random variables, each with finite variance, then the covariance matrix Σ is the matrix whose (i, j) entry is the covariance

\Sigma _{ij}=\mathrm {cov} (X_{i},X_{j})=\mathrm {E} {\begin{bmatrix}(X_{i}-\mu _{i})(X_{j}-\mu _{j})\end{bmatrix}}

where

\mu _{i}=\mathrm {E} (X_{i})\,

is the expected value of the ith entry in the vector X.^{[citation needed]} In other words, we have

\Sigma ={\begin{bmatrix}\mathrm {E} [(X_{1}-\mu _{1})(X_{1}-\mu _{1})]&\mathrm {E} [(X_{1}-\mu _{1})(X_{2}-\mu _{2})]&\cdots &\mathrm {E} [(X_{1}-\mu _{1})(X_{n}-\mu _{n})]\\\\\mathrm {E} [(X_{2}-\mu _{2})(X_{1}-\mu _{1})]&\mathrm {E} [(X_{2}-\mu _{2})(X_{2}-\mu _{2})]&\cdots &\mathrm {E} [(X_{2}-\mu _{2})(X_{n}-\mu _{n})]\\\\\vdots &\vdots &\ddots &\vdots \\\\\mathrm {E} [(X_{n}-\mu _{n})(X_{1}-\mu _{1})]&\mathrm {E} [(X_{n}-\mu _{n})(X_{2}-\mu _{2})]&\cdots &\mathrm {E} [(X_{n}-\mu _{n})(X_{n}-\mu _{n})]\end{bmatrix}}.

The inverse of this matrix, $\Sigma ^{-1}$ , is the inverse covariance matrix, also known as the concentration matrix or precision matrix;^[1] see precision (statistics). The elements of the precision matrix have an interpretation in terms of partial correlations and partial variances.^{[citation needed]}

Generalization of the variance

The definition above is equivalent to the matrix equality

\Sigma =\mathrm {E} \left[\left({\textbf {X}}-\mathrm {E} [{\textbf {X}}]\right)\left({\textbf {X}}-\mathrm {E} [{\textbf {X}}]\right)^{\rm {T}}\right]

This form can be seen as a generalization of the scalar-valued variance to higher dimensions. Recall that for a scalar-valued random variable X

\sigma ^{2}=\mathrm {var} (X)=\mathrm {E} [(X-\mathrm {E} (X))^{2}]=\mathrm {E} [(X-\mathrm {E} (X))\cdot (X-\mathrm {E} (X))].\,

Conflicting nomenclatures and notations

Nomenclatures differ. Some statisticians, following the probabilist William Feller, call this matrix the variance of the random vector $X$ , because it is the natural generalization to higher dimensions of the 1-dimensional variance. Others call it the covariance matrix, because it is the matrix of covariances between the scalar components of the vector $X$ . Thus

\operatorname {var} ({\textbf {X}})=\operatorname {cov} ({\textbf {X}})=\mathrm {E} \left[({\textbf {X}}-\mathrm {E} [{\textbf {X}}])({\textbf {X}}-\mathrm {E} [{\textbf {X}}])^{\rm {T}}\right].

However, the notation for the cross-covariance between two vectors is standard:

\operatorname {cov} ({\textbf {X}},{\textbf {Y}})=\mathrm {E} \left[({\textbf {X}}-\mathrm {E} [{\textbf {X}}])({\textbf {Y}}-\mathrm {E} [{\textbf {Y}}])^{\rm {T}}\right].

The var notation is found in William Feller's two-volume book An Introduction to Probability Theory and Its Applications,^{[full citation needed]} but both forms are quite standard and there is no ambiguity between them.

The matrix $\Sigma$ is also often called the variance-covariance matrix since the diagonal terms are in fact variances.

Properties

For $\Sigma =\mathrm {E} \left[\left({\textbf {X}}-\mathrm {E} [{\textbf {X}}]\right)\left({\textbf {X}}-\mathrm {E} [{\textbf {X}}]\right)^{\rm {T}}\right]$ and $\mu =\mathrm {E} ({\textbf {X}})$ , where X is a random p-dimensional variable and Y a random q-dimensional variable, the following basic properties apply:^{[citation needed]}

$\Sigma =\mathrm {E} (\mathbf {XX^{\rm {T}}} )-\mathbf {\mu } \mathbf {\mu ^{\rm {T}}}$
$\Sigma \,$ is positive-semidefinite and symmetric.
$\operatorname {cov} (\mathbf {AX} +\mathbf {a} )=\mathbf {A} \,\operatorname {cov} (\mathbf {X} )\,\mathbf {A^{\rm {T}}}$
$\operatorname {cov} (\mathbf {X} ,\mathbf {Y} )=\operatorname {cov} (\mathbf {Y} ,\mathbf {X} )^{\rm {T}}$
$\operatorname {cov} (\mathbf {X} _{1}+\mathbf {X} _{2},\mathbf {Y} )=\operatorname {cov} (\mathbf {X} _{1},\mathbf {Y} )+\operatorname {cov} (\mathbf {X} _{2},\mathbf {Y} )$
If p = q, then $\operatorname {var} (\mathbf {X} +\mathbf {Y} )=\operatorname {var} (\mathbf {X} )+\operatorname {cov} (\mathbf {X} ,\mathbf {Y} )+\operatorname {cov} (\mathbf {Y} ,\mathbf {X} )+\operatorname {var} (\mathbf {Y} )$
$\operatorname {cov} (\mathbf {AX} ,\mathbf {B} ^{\rm {T}}\mathbf {Y} )=\mathbf {A} \,\operatorname {cov} (\mathbf {X} ,\mathbf {Y} )\,\mathbf {B}$
If $\mathbf {X}$ and $\mathbf {Y}$ are independent, then $\operatorname {cov} (\mathbf {X} ,\mathbf {Y} )=0$

where $\mathbf {X} ,\mathbf {X} _{1}$ and $\mathbf {X} _{2}$ are random p×1 vectors, $\mathbf {Y}$ is a random q×1 vector, $\mathbf {a}$ is q×1 vector, $\mathbf {A}$ and $\mathbf {B}$ are q×p matrices.

This covariance matrix is a useful tool in many different areas. From it a transformation matrix can be derived that allows one to completely decorrelate the data^{[citation needed]} or, from a different point of view, to find an optimal basis for representing the data in a compact way^{[citation needed]} (see Rayleigh quotient for a formal proof and additional properties of covariance matrices). This is called principal components analysis (PCA) and the Karhunen-Loève transform (KL-transform).

As a linear operator

Applied to one vector, the covariance matrix maps a linear combination, c, of the random variables, X, onto a vector of covariances with those variables: $\mathbf {c} ^{\rm {T}}\Sigma =\operatorname {cov} (\mathbf {c} ^{\rm {T}}\mathbf {X} ,\mathbf {X} )$ .^{[citation needed]} Treated as a bilinear form, it yields the covariance between the two linear combinations: $\mathbf {d} ^{\rm {T}}\Sigma \mathbf {c} =\operatorname {cov} (\mathbf {d} ^{\rm {T}}\mathbf {X} ,\mathbf {c} ^{\rm {T}}\mathbf {X} )$ .^{[citation needed]} The variance of a linear combination is then $\mathbf {c} ^{\rm {T}}\Sigma \mathbf {c}$ , its covariance with itself.

Similarly, the (pseudo-)inverse covariance matrix provides an inner product, $\langle c-\mu |\Sigma ^{+}|c-\mu \rangle$ which induces the Mahalanobis distance, a measure of the "unlikelihood" of c.^{[citation needed]}

Which matrices are covariance matrices?

From the identity just above (let $\mathbf {b}$ be a $(p\times 1)$ real-valued vector)

\operatorname {var} (\mathbf {b} ^{\rm {T}}\mathbf {X} )=\mathbf {b} ^{\rm {T}}\operatorname {var} (\mathbf {X} )\mathbf {b} ,\,

the fact that the variance of any real-valued random variable is nonnegative, and the symmetry of the covariance matrix's definition it follows that only a positive-semidefinite matrix can be a covariance matrix.^{[citation needed]} The answer to the converse question, whether every symmetric positive semi-definite matrix is a covariance matrix, is "yes." To see this, suppose M is a p×p positive-semidefinite matrix. From the finite-dimensional case of the spectral theorem, it follows that M has a nonnegative symmetric square root, that can be denoted by M^1/2. Let $\mathbf {X}$ be any p×1 column vector-valued random variable whose covariance matrix is the p×p identity matrix. Then

\operatorname {var} (M^{1/2}\mathbf {X} )=M^{1/2}(\operatorname {var} (\mathbf {X} ))M^{1/2}=M.\,

How to find a valid covariance matrix

In some applications (e.g. building data models from only partially observed data) one wants to find the “nearest” covariance matrix to a given symmetric matrix (e.g. of observed covariances). In 2002, Higham^[2] formalized the notion of nearness using a weighted Frobenius norm and provided a method for computing the nearest covariance matrix.

Complex random vectors

The variance of a complex scalar-valued random variable with expected value μ is conventionally defined using complex conjugation:^{[citation needed]}

\operatorname {var} (z)=\operatorname {E} \left[(z-\mu )(z-\mu )^{*}\right]

where the complex conjugate of a complex number $z$ is denoted $z^{*}$ ; thus the variance of a complex number is a real number.

If $Z$ is a column-vector of complex-valued random variables, then the conjugate transpose is formed by both transposing and conjugating. In the following expression, the product of a vector with its conjugate transpose results in a square matrix, as its expectation:

\operatorname {E} \left[(Z-\mu )(Z-\mu )^{\dagger }\right],

where $Z^{\dagger }$ denotes the conjugate transpose, which is applicable to the scalar case since the transpose of a scalar is still a scalar. The matrix so obtained will be Hermitian positive-semidefinite,^[3] with real numbers in the main diagonal and complex numbers off-diagonal.

Estimation

See estimation of covariance matrices and Sample covariance matrix.

As a parameter of a distribution

If a vector of n possibly correlated random variables is jointly normally distributed, or more generally elliptically distributed, then its probability density function can be expressed in terms of the covariance matrix.

References

^ Wasserman, Larry (2004). All of Statistics: A Concise Course in Statistical Inference. ISBN 0-387-40272-1.
^ Higham, Nicholas J. "Computing the nearest correlation matrix—a problem from finance". IMA Journal of Numerical Analysis. 22 (3): 329–343. doi:10.1093/imanum/22.3.329. {{cite journal}}: Cite has empty unknown parameter: |coauthors= (help)
^ Brookes, Mike. "The Matrix Reference Manual" (Document). {{cite document}}: Cite document requires |publisher= (help); Unknown parameter |chapter= ignored (help); Unknown parameter |url= ignored (help)