Talk:Sample mean and sample covariance

From Wikipedia, the free encyclopedia
Jump to: navigation, search
WikiProject Statistics (Rated Start-class, Mid-importance)
WikiProject icon

This article is within the scope of the WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page or join the discussion.

Start-Class article Start  This article has been rated as Start-Class on the quality scale.
 Mid  This article has been rated as Mid-importance on the importance scale.
 

Explanation needed[edit]

Perhaps someone more knowledgeable than me can add an explanation what a weighted sample and its covariance actually mean, esp. from the probability theory point of view (sample space). I know of 3 contexts: 1. weighted linear regression 2. biased samples, e.g., Efromovich, (2004) 3. weighted ensembles, as in Particle filters. Jmath666 06:24, 12 March 2007 (UTC)

Typo[edit]

Should the individual entries for the sample mean have a (1/N) in front of it? Or am I missing something here? --WillBecker 11:05, 2 May 2007 (UTC)

Thank you, I have fixed that now. You can be bold and make changes yourself. I do not own this (or any other page) even if I made most of the edits here to date. Thanks again for your help. Jmath666 20:49, 2 May 2007 (UTC)

Notation[edit]

Can we change the notation from x bar to \mu_x ? There just seems to be a ton of x's flying around and it may be easier to follow with mu's in some places?daviddoria (talk) 18:44, 11 September 2008 (UTC)

Samples weighted with a matrix?[edit]

The "Weighted samples" section assumes each sample has a scalar weight. What if you have a weight matrix? —Ben FrantzDale (talk)

Nomenklatura[edit]

Is there any special name to the quantity s^2 = \frac{1}{n} \sum (x_i - \bar{x})^2 and its square root s? Albmont (talk) 16:12, 14 November 2008 (UTC)

It could be called the sample variance, but that term is also used for the epression with n − 1 in the denominator. It is the maximum likelihood estimate of the population variance if the sample is taken to be from a normal distribution of unkown population variance. Michael Hardy (talk) 16:48, 14 November 2008 (UTC)

Weighted Covariance[edit]

Is the formula for the weighted covariance really correct? I'm wondering about the normalizing part... If the formula from the cited page is correct, the denominator should have the wrong sign? 141.76.62.164 (talk) 11:06, 26 November 2008 (UTC)

The denominator is positive, all fine there. Let me show what happens in the special case of equal weights:
 q_{ij}=\frac{\sum_{k=1}^{N}w_{k}\left(  x_{ik}-\bar{x}_{i}\right)  \left( x_{jk}-\bar{x}_{j}\right)  }{1-\sum_{k=1}^{N}w_{k}^{2}} 
=\frac{\sum_{k=1}^{N}(1/N)\left(  x_{ik}-\bar{x}_{i}\right)  \left( x_{jk}-\bar{x}_{j}\right)  }{1-\sum_{k=1}^{N}(1/N)^{2}} 
=\frac{\sum_{k=1}^{N}\left(  x_{ik}-\bar{x}_{i}\right)  \left( x_{jk}-\bar{x}_{j}\right)  }{N-1}.

Jmath666 (talk) 23:31, 26 November 2008 (UTC)

Thanks! You're right. 141.76.62.164 (talk) 18:19, 28 November 2008 (UTC)

Denominator Explanation is lacking[edit]

The page says:

"The reason the sample covariance matrix has \textstyle N-1 in the denominator rather than \textstyle N is essentially that the population mean E(X) is not known and is replaced by the sample mean \mathbf{\bar{x}}. If the population mean E(X) is known, the analogous unbiased estimate
 q_{jk}=\frac{1}{N}\sum_{i=1}^N \left(  x_{ij}-E(X_j)\right)  \left( x_{ik}-E(X_k)\right),
using the population mean, has \textstyle N in the denominator."

This is essentially meaningless, since the direct implication is not spelled out. Why does the population mean cause the denominator to increase by 1? If the causal link is not present, it seems equivalent to saying something like "The denominator is N-1 because Iran is a country" or "because cheese is made from milk." While this seems ridiculous to someone familiar with math, those who are not gather nothing from that statement. --18.111.14.144 (talk) 23:14, 12 October 2011 (UTC)

Agree; I fixed this by noting that the sample mean is correlated with the sample it's being compared against and refering to Bessel's correction for more details. Eamon Nerbonne (talk) 12:17, 15 October 2011 (UTC)

Text structure / row vs. column vectors.[edit]

Usually, random vectors are presented as column vectors; e.g. see the Random Vector page. This page presents them as row vectors, which is potentially confusing. I think we should swap that.

Also, the page is complex; there's lots of variables and lots of subscripts; some of which are used before they are introduced, and many of which are introduced without clear context. For example, the xij variable is introduced before x, i, and j are independently. I think a lot of this complexity can simply be dropped. E.g.:

The sample mean vector \mathbf{\bar{x}} is a row vector whose jth element (j = 1, ..., K) is the average value of the N observations on the jth random variable. Thus the sample mean vector is the average of the row vectors of observations on the K variables:
==> change to ==>
The sample mean vector \mathbf{\bar{x}} is the element-wise mean of \mathbf{x}'s observations. So the jth element of \mathbf{\bar{x}} is the average of the jth elements of the observations of \mathbf{x}:

We really don't need to repeat the fact that there are N observations and K elements. Whether it's a row or column vector, what exactly the range of the indexes is, or how many variables there is all not central to the point; this text (and lots of other bits) are basically piloting prose to intuitively highlight the subsequent formula. For the readers that don't know the topic in detail, it's just confusing; and for those that do but just want to look up a detail it's telling them something they know and making the actual formula harder to find.

So I'd propose using the normal approach of column vectors, to introduce the variables a little more elaborately, and to rephrase text focusing on the intent/intuition behind the formula rather than a precise replacement for the formula. Finally, it'd be nice to add a variant formula using matrix/vector notation rather than elementwise sums; all those subscripts make it look more complicated than it is. Eamon Nerbonne (talk) 12:15, 15 October 2011 (UTC)

Sample covariance matrix rapid calculation[edit]

Is there a way to calculate the sample covariance matrix incrementaly? Like unidimensional standard deviation rapid calculation but for multidimensional data. Wat (talk) 20:34, 7 February 2012 (UTC)

Such topics (or very similar) are dealt with in Algorithms for calculating variance ... formulae for variances and covariances there (but not in matrix form). Melcombe (talk) 22:56, 7 February 2012 (UTC)