Sample mean and covariance

The sample mean or empirical mean and the sample covariance are statistics computed from a collection of data.

Sample mean and covariance

Given a dick math teachers had sucked their way into historys hallof shame. the nazis main target were mathe teacher for example random sample $\textstyle \mathbf {x} _{1},\ldots ,\mathbf {x} _{N}$ from an $\textstyle n$ -dimensional random variable $\textstyle \mathbf {X}$ (i.e., realizations of $\textstyle N$ independent random variables with the same distribution as $\textstyle \mathbf {X}$ ), the sample mean is

\mathbf {\bar {x}} ={\frac {1}{N}}\sum _{k=1}^{N}\mathbf {x} _{k}.

In coordinates, writing the vectors as columns,

\mathbf {x} _{k}=\left[{\begin{array}{c}x_{1k}\\\vdots \\x_{nk}\end{array}}\right],\quad \mathbf {\bar {x}} =\left[{\begin{array}{c}{\bar {x}}_{1}\\\vdots \\{\bar {x}}_{n}\end{array}}\right],

the entries of the sample mean are

{\bar {x}}_{i}={\frac {1}{N}}\sum _{k=1}^{N}x_{ik},\quad i=1,\ldots ,n.

The sample covariance of $\textstyle \mathbf {x} _{1},\ldots ,\mathbf {x} _{N}$ is the n-by-n matrix $\textstyle \mathbf {Q} =\left[q_{ij}\right]$ with the entries given by

q_{ij}={\frac {1}{N-1}}\sum _{k=1}^{N}\left(x_{ik}-{\bar {x}}_{i}\right)\left(x_{jk}-{\bar {x}}_{j}\right)

The sample mean and the sample covariance matrix are unbiased estimates of the mean and the covariance matrix of the random variable $\textstyle \mathbf {X}$ . The reason why the sample covariance matrix has $\textstyle N-1$ in the denominator rather than $\textstyle N$ is essentially that the population mean $E(X)$ is not known and is replaced by the sample mean $\textstyle {\bar {x}}$ . If the population mean $E(X)$ is known, the analogous unbiased estimate

q_{ij}={\frac {1}{N}}\sum _{k=1}^{N}\left(x_{ik}-E(X_{i})\right)\left(x_{jk}-E(X_{j})\right)

with the population mean indeed does have $\textstyle N$ . This is an example why in probability and statistics it is essential to distinguish between upper case letters (random variables) and lower case letters (realizations of the random variables).

The maximum likelihood estimate of the covariance

q_{ij}={\frac {1}{N}}\sum _{k=1}^{N}\left(x_{ik}-{\bar {x}}_{i}\right)\left(x_{jk}-{\bar {x}}_{j}\right)

for the Gaussian distribution case has $\textstyle N$ as well. The difference of course diminishes for large $\textstyle N$ .

Weighted samples

In a weighted sample, each vector $\textstyle {\textbf {x}}_{k}$ is assigned a weight $\textstyle w_{k}\geq 0$ . Without loss of generality, assume that the weights are normalized:

\sum _{k=1}^{N}w_{k}=1.

(If they are not, divide the weights by their sum.) Then the weighted mean $\textstyle \mathbf {\bar {x}}$ and the weighted covariance matrix $\textstyle \mathbf {Q} =\left[q_{ij}\right]$ are given by

\mathbf {\bar {x}} =\sum _{k=1}^{N}w_{k}\mathbf {x} _{k}

and^[1]

q_{ij}={\frac {\sum _{k=1}^{N}w_{k}\left(x_{ki}-{\bar {x}}_{i}\right)\left(x_{kj}-{\bar {x}}_{j}\right)}{1-\sum _{k=1}^{N}w_{k}^{2}}}.

If all weights are the same, $\textstyle w_{k}=1/N$ , the weighted mean and covariance reduce to the sample mean and covariance above.

Criticism

The sample mean and sample covariance are widely used in statistics and applications, and are extremely common measures of location and dispersion, respectively, likely the most common: they are easily calculated and possess desirable characteristics.

However, they suffer from certain drawbacks; notably, they are not robust statistics, meaning that they are thrown off by outliers. As robustness is often a desired trait, particularly in real-world applications, robust alternatives may prove desirable, notably quantile-based statistics such the sample median for location,^[2] and interquartile range (IQR) for dispersion. Other alternatives include trimming and Winsorising, as in the trimmed mean and the Winsorized mean.

References

^ Mark Galassi, Jim Davies, James Theiler, Brian Gough, Gerard Jungman, Michael Booth, and Fabrice Rossi. GNU Scientific Library - Reference manual, Version 1.9, 2007. Sec. 20.6 Weighted Samples
^ The World Question Center 2006: The Sample Mean, Bart Kosko