Talk:Algorithms for calculating variance
Statistics Start‑class Mid‑importance | ||||||||||
|
Mathematics Start‑class Mid‑priority | ||||||||||
|
Online algorithm in testing yields horrible results when mean=~0
Most all the tests I've seen of these algorithms add some unrealistic constant (i.e. 10^6 or larger) to the dataset to demonstrate that the suggested algorithm on this page is indeed better. I naively used this algorithm in my own work, to horrible effect. My dataset consists of a large number of discrete values, perhaps with the values -1, 0, or 1, and with an average usually between -1 and 1. I wrote the following simple test program to demonstrate the difference in results between what I'll call METHOD1 (using a running sum and sum of squares of the dataset) and METHOD2 (using a running computation of the average and the variance, which the current wiki strongly recommends).
#include <iostream>
#include <iomanip>
#include <cmath>
#include <cstdlib>
using namespace std;
int main( )
{
srand( time( NULL ) );
const double target = 0.95;
float sum = 0;
float average = 0;
float sumsq = 0;
float qvalue = 0;
double sumd = 0;
double averaged = 0;
double sumsqd = 0;
double qvalued = 0;
int numtrials = 0;
const int width = 15;
cout << setw( width ) << left << "numtrials"
<< setw( width ) << "float avg 2"
<< setw( width ) << "float avg 1"
<< setw( width ) << "double avg 2"
<< setw( width ) << "double avg 1"
<< setw( width ) << "float std 2"
<< setw( width ) << "float std 1"
<< setw( width ) << "double std 2"
<< setw( width ) << "double std 1"
<< endl;
while( true )
{
for( int i = 0; i < 1000000; i++ )
{
const int sample = ( static_cast< double >( rand( ) ) / RAND_MAX < target ? 1 : 0 );
numtrials++;
sum += sample;
sumd += sample;
const float delta = sample - average;
average += delta / numtrials;
const double deltad = sample - averaged;
averaged += deltad / numtrials;
sumsq += sample * sample;
sumsqd += sample * sample;
qvalue += delta * ( sample - average );
qvalued += deltad * ( sample - averaged );
}
cout << fixed << setprecision( 6 );
cout << setw( width ) << left << numtrials
<< setw( width ) << average
<< setw( width ) << sum / numtrials
<< setw( width ) << averaged
<< setw( width ) << sumd / numtrials
<< setw( width ) << sqrt( qvalue / ( numtrials - 1 ) )
<< setw( width ) << sqrt( ( sumsq - ( sum / numtrials ) * sum ) / ( numtrials - 1 ) )
<< setw( width ) << sqrt( qvalued / ( numtrials - 1 ) )
<< setw( width ) << sqrt( ( sumsqd - ( sumd / numtrials ) * sumd ) / ( numtrials - 1 ) )
<< endl;
}
return 0;
}
And here sample output:
numtrials float avg 2 float avg 1 double avg 2 double avg 1 float std 2 float std 1 double std 2 double std 1
1000000 0.948275 0.950115 0.950115 0.950115 0.218147 0.217707 0.217707 0.217707
2000000 0.941763 0.949966 0.949966 0.949966 0.217107 0.218015 0.218015 0.218015
3000000 0.922894 0.949982 0.949982 0.949982 0.217433 0.217982 0.217982 0.217982
4000000 0.909789 0.950044 0.950044 0.950044 0.215531 0.217854 0.217854 0.217854
5000000 0.899830 0.950042 0.950042 0.950042 0.219784 0.217859 0.217859 0.217859
6000000 0.890922 0.950006 0.950006 0.950006 0.218891 0.217933 0.217933 0.217933
7000000 0.884997 0.950047 0.950047 0.950047 0.215908 0.217848 0.217848 0.217848
8000000 0.879075 0.950082 0.950082 0.950082 0.213635 0.217776 0.217776 0.217776
9000000 0.873134 0.950091 0.950091 0.950091 0.214217 0.217758 0.217758 0.217758
10000000 0.868035 0.950095 0.950095 0.950095 0.219110 0.217749 0.217749 0.217749
11000000 0.865048 0.950076 0.950076 0.950076 0.220991 0.217788 0.217788 0.217788
12000000 0.862079 0.950086 0.950086 0.950086 0.218815 0.217768 0.217768 0.217768
13000000 0.859129 0.950118 0.950118 0.950118 0.216916 0.217701 0.217701 0.217701
14000000 0.856129 0.950086 0.950086 0.950086 0.215379 0.217768 0.217768 0.217768
15000000 0.853163 0.950096 0.950096 0.950096 0.213971 0.217746 0.217746 0.217746
16000000 0.850167 0.950074 0.950074 0.950074 0.212786 0.217793 0.217793 0.217793
17000000 0.847186 0.950069 0.950069 0.950069 0.211621 0.217803 0.217803 0.217803
18000000 0.844209 0.932068 0.950068 0.950068 0.210246 0.251630 0.217805 0.217805
19000000 0.841231 0.883011 0.950066 0.950066 0.209009 0.321407 0.217808 0.217808
20000000 0.838260 0.838861 0.950071 0.950071 0.207879 0.367659 0.217798 0.217798
21000000 0.835285 0.798915 0.950072 0.950072 0.206857 0.400811 0.217797 0.217797
22000000 0.832305 0.762601 0.950069 0.950069 0.205931 0.425489 0.217803 0.217803
23000000 0.829313 0.729444 0.950057 0.950057 0.205096 0.444247 0.217828 0.217828
24000000 0.826340 0.699051 0.950059 0.950059 0.204305 0.458671 0.217823 0.217823
25000000 0.823366 0.671089 0.950061 0.950061 0.203576 0.469818 0.217819 0.217819
26000000 0.820379 0.645278 0.950055 0.950055 0.203316 0.478429 0.217832 0.217832
27000000 0.817389 0.621378 0.950047 0.950047 0.202405 0.485044 0.217849 0.217849
28000000 0.816234 0.599186 0.950046 0.950046 0.201543 0.490063 0.217849 0.217849
29000000 0.816234 0.578525 0.950056 0.950056 0.200723 0.493795 0.217829 0.217829
30000000 0.816234 0.559241 0.950035 0.950035 0.200000 0.496478 0.217872 0.217872
31000000 0.816234 0.541201 0.950036 0.950036 0.199291 0.498300 0.217871 0.217871
32000000 0.816234 0.524288 0.950039 0.950039 0.198619 0.499410 0.217864 0.217864
33000000 0.816234 0.508400 0.950038 0.950038 0.197993 0.499929 0.217867 0.217867
34000000 0.816234 0.493448 0.950034 0.950034 0.197406 0.499957 0.217876 0.217876
35000000 0.816234 0.479349 0.950037 0.950037 0.196839 0.499573 0.217868 0.217868
36000000 0.816234 0.466034 0.950038 0.950038 0.196306 0.498845 0.217866 0.217866
37000000 0.816234 0.453438 0.950037 0.950037 0.195804 0.497827 0.217868 0.217868
38000000 0.816234 0.441506 0.950036 0.950036 0.195328 0.496567 0.217871 0.217871
39000000 0.816234 0.430185 0.950032 0.950032 0.194878 0.495102 0.217878 0.217878
Note how the computed average using float's and method 2 fails to six digits accuracy before even 1 million trials, while method 1 using floats reproduces the double results all the way out to 17 million trials. — Preceding unsigned comment added by 173.219.85.45 (talk) 19:41, 19 October 2012 (UTC)
Untitled
Editorial comment: it is actually the first formula that has precision problems when dealing with limited-precision arithmetic. If the difference between measurements and the mean is very small, then the first formula will yield precision problems, as information will be lost in the (xi - µ) operation. There is no such loss of significance in the intermediate operations of the second formula. -- ScottMoonen
Editorial comment the second: in fact, the second formula is the one more commonly beset with problems. The first can have problems when the mean is very large relative to the variance, but this is relatively rare in practice and this problem also affects the second formula. Much more common is the situation where you have comparable mean and variance and a very large number of observations. In this case, the second formula will result in the subtraction of two very large numbers whose difference is relatively small (by a factor roughly equal to the number of observations). If you have one million observations, you lose roughly six significant figures with the second formula if you use ordinary floating point arithmetic. -- TedDunning
comment: The problem may occur
- when the deviations are very small relative to the mean or
- when they are small relative to the representational capacity of the arithmetic instrument (floating point computer, fixed point calculator, paper and pencil).
To be precise we have to specify the instrument and the nature of the data. -- DickBeldin
Ted, what do you mean by ordinary floating point arithmetic? Are talking about precision with repsect to register width, or is there an implicit reference to some "non-ordinary" scheme, that is, non-IEEE? Thanks.
I thought the reference was pretty explicit. I didn't mean, however, to imply the use of non-standard but still floating point arithmetic. I had in mind systems that were entirely different. A simple example is a system that uses rationals based on unlimited precision integers. Another system might be fixed point arithmetic. The first would not lose precision in the computation while the second would lose more or less precision than I quote depending on the values in question and the characteristics of the fixed point system. You could also imagine an optimizer that noted that you might be adding up large numbers of very small numbers and would re-order the addition to avoid most of the round-off errro by using log N temporary values to perform the addition. -- TedDunning
FWIW my favorite variance algorithm is the 'corrected two pass algorithm' of Chan, Golub, Leveque (American Statistician, 1983, page 242):
avg=0 for all i avg=avg+data[i] end for avg=avg/n
sum=0 sum2=0 for all i sum2=sum2+(data[i]-avg)^2 sum=sum+(data[i]-avg) end for var=(sum2-sum/n)/(n-1)
return var
AlexC
- I added this algorithm (as Algorithm II), but without the sum in the second loop -- the sum of the variations from the mean is of course 0 (). But perhaps this is a clever sort of compensation, where the errors in sum and sum2 are known to be (probably, perhaps) correlated, so that subtracting them is a benefit? Still, though, for all datasets where , (as in sum and sum2), so the gain in accuracy might be irrelevant. And, if the second loop is meant to be the normal variance operation on preconditioned data, why isn't sum squared at the end? Anyway, if it's a Good Thing that I've omitted, please tell me and/or add it to the article (with explanation!). --Tardis 03:26, 9 June 2006 (UTC)
I removed the following comments from the article as they do not belong there. I have posted them here so that somebody may investigat them further. It would be nice if users use the talk page for comments and not the article. --Richss 16:12, Nov 11, 2004 (UTC)
- Why is the second equation n(n-1)? I know two possibilities n (unbiased) or n-1 biased...See also the article Variance in wikipedia. This equations do not look good... --132.231.54.1
- The above algorithm does not seem to be correct. The resulting variance does not equal either one of the variances whose formulas are given at the top of the page. --137.226.12.196
--- I had a look at the source for AlexCs algorithm. I believe there is a mistake: var=(sum2-sum/n)/(n-1) should read var=(sum2-sum*sum/n)/(n-1)
The result of this algorithm agrees with my graphic calculator. I think the corrected algorithm should be added to the page
-- aj504@york.ac.uk who still hasn't got a wikipedia account :)
Incorrect Algorithm?
The second algorithm doesn't look right to me. At each step it's calculating the variance of data[i] with respect to the *average so far*, not the true average.
Can someone confirm that I'm not just reading it wrong?
I have derived and added the update formulas (only the results) for the unbiased/biased variance. The clearly shown that the (old) second algorithm which was in dispute was wrong, so I removed it. I have also removed the example which just shows some numbers (was it a weak numerical proof?) and does not provide any insight on how the formulas work. It was not clear whether the algorithms were for biased or unbiased estimators, so I added comments on it. BurkayDonderici
Algorithm III does not compute
Hi, In the mathematical equation for m(new), I don't agree with (n+1) in the denominator. I believe that the denominator should be simply (n).
I didn't want to edit the entry since I am not a mathematician. CAN SOMEONE WHO IS PLEASE MAKE THE APPROPRIATE COMMENTS .
If you plug some numbers into the equation shown, and compare your results to what a spreadsheet calculator gives, you will see that there is a significant error.
I tested the show equation in MS Excel. I used a population of 404 values generated by the RANDBETWEEN(0,1000) function.
I used 4 methods to calculate the average at each point through the population.
Method 1) (Cumulative total)/(Number of samples) [this calculation was done for each sample for tracking purposes
Method 2) Previous average + new contribution. M(new) = M(old) + ( (M(old)-X(new))/n )
Method 3) Previous average + new contribution. M(new) = M(old) + ( (M(old)-X(new))/(n+1) )
Method 4) I then used the Average function on the total population.
Methods 1 and 2 produced the same results all the way through the population and matched Method 4's result for the whole population. However, Method 3 always produced a lower result, and the error (which started significatly) reduced as the number of samples grew.
Regards, Napoleon BlownApart
Algorithm III
I'm pretty sure this is wrong. From Welford's Paper:
S[n]=S[n-1]+(n-1)/n*(x[n]-mean[n-1])^2
--cfp 21:58, 15 August 2006 (UTC)
- I'm talking rubbish it's fine as x[n]-mean[n] = (n-1)/n*(x[n]-mean[n-1]). Sorry! --cfp 22:20, 15 August 2006 (UTC)
--Gabriel (talk) 21:35, 17 July 2011 (UTC)
I think, the advantages of Welford's algorithm should be toned down. It is no more numerically stable than the direct two-pass method. See http://www.johndcook.com/blog/2008/09/26/comparing-three-methods-of-computing-standard-deviation/ and http://www.jstor.org/stable/1267176 .
Online algorithm
I removed the line about Knuth/Welford being on online algorithm. All the algorithms in this article are online algorithms, it's not worth noting. —johndburger 12:24, 18 September 2007 (UTC)
- Well, algorithms I and III only scan the list of data once, whereas both versions of algorithm II scan it twice. Hence, since I/O is by several orders of magnitude slower than calculations, that means that, unless the whole data set is read into memory, algorithm II will be twice as slow, and it won't work if the data file is not seekable. --Army1987 16:00, 28 October 2007 (UTC)
The Welford algorithm *is* an on-line algorithm, because it needs to see only one data item at a time! It does *not* need to store the complete set of data, i.e., it does not need to look at any data item more than once. — Preceding unsigned comment added by Snoopy67 (talk • contribs) 21:30, 17 July 2011 (UTC)
Algorithm IV
Doesn't the final step in West's algorithm have a typo? Instead of:
Variance = S * sumweight * n / (n-1)
Shouldn't it actually be:
Variance = S / sumweight * n / (n-1)
Looking at West's paper, he has "mathematical description" and "computational description" side by side. The math for the final line looks like: And if you work through it, should be the variance times the (incremental) weight for the recurrence to work. West's "computational description" is as its presented here, but I think its a typo in West.
Shauncutts (talk) 06:15, 29 January 2008 (UTC)
- Actually, I'm pretty sure the last line should be: Variance = S * n / ((n-1) * sumweight)
- (I just implemented this algorithm based on the original paper, and have been manually checking results - I'm going to edit the main page to reflect this)
Incidentally, would it not make sense to alter algorithm 4 by adding the temporary variables used in the original paper. This will save at least three operations per loop, at the cost of making it less readable. Specifically the algorith would then look like this:
n = 0 foreach x in the data: if n=0 then n = 1 mean = x S = 0 sumweight = weight else n = n + 1 temp = weight + sumweight Q = x-mean R = Q*weight/temp S = S + sumweight*R*Q mean = mean + R sumweight = temp end if end for Variance = S * n / ((n-1) * sumweight) // if sample is the population, omit n/(n-1)
Gorm (talk) 12:54, 19 February 2008 (UTC)
- Agreed. I've now added the intermediary variables Q and R to the formula as per West's original paper The imp (talk) 10:35, 30 September 2009 (UTC)
Minor spelling and grammar fixes
OSFTactical (talk) 22:59, 6 December 2007 (UTC)
Suspecting a formulea error
On 17/11/2007 Hi,
I suspect that the two end formulaes given for the third algorithm :
variance_n = sqrt(M2/n) variance = sqrt(M2/(n - 1))
are not the "variance" but the "standard deviation" (Ecart-type in French)
Can someone check it ?
Pascal P. 90.37.121.130 (talk) 13:52, 17 November 2008 (UTC)
Division by zero problem?
The online algorithm should work with n=1 since it needs to progress through that stage to get to n=2, n=3, etc. Therefore, it would be incorrect to calculate the variance as
variance = M2 / (n-1)
A possible fix is:
if n > 1 then variance = M2 / (n-1) else variance = 0 end if
Does any one have any objections to me updating this? JBrusey (talk) 14:06, 9 April 2009 (UTC)
- The sample variance is not defined for samples of size 1. So while it is good to avoid dividing by 0, the value 0 is wrong too. I suggest writing
if n > 1 then variance = M2 / (n-1) else variance = undefined end if
- Btw, I don't understand your comment at the beginning since the problematic division is outside the loop. McKay (talk) 01:02, 10 April 2009 (UTC)
- It should not say "else variance = 0". That implies the sample variance is 0 when n = 1. That is incorrect. The sample variance is undefined in that case. Michael Hardy (talk) 01:18, 10 April 2009 (UTC)
Easiest online algorithm yet.
Hi all,
I was looking at Murray Spiegel's Probability and Statistics (1994), and it lists this formulation for variance:
Mathworld had a similar formulation for standard deviation. This all means that, as long as you keep running tabs of:
- sum of all samples
- sum of squares of all samples
- number of samples
then you have an easy way to compute the variance as you go. You also have an easy way to combine the variances of two populations. Just add the sums (and sums of squares) and the counts. No divide-by-zeros. Very few math operations. What more can you ask for? :)
- That is already in the article (the first algorithm), and the problem with it is stated there. McKay (talk) 02:14, 2 May 2009 (UTC)
- Ah, well the connection between the two wasn't spelled out to the satisfaction of my (possibly caffeine-deprived) mind. And, as far as I can tell, the online algorithm does not easily merge statistics from multiple populations, but rather updates the statistic one sample at a time. Correct me if I am wrong, or please add a suitable method to the page. 209.237.23.36 (talk) 22:56, 13 May 2009 (UTC)
- The connection to the second equation of the "naive algorithm" follows immediately from the definition of and . The "parallel algorithm" is an online algorithm that easily merges statistics from multiple population (as noted in the text), though a better heading might make this clear. The key point is that a parallel algorithm can be implemented as a repeated serial algorithm, and the merging part is identical in both. Markjamesabraham (talk) 22:45, 26 May 2010 (UTC)
Pseudocode
The numerically stable mean and variance computation works, but attempting something similar for skewness and kurtosis using the incremental moments computation code in the later section seems to fail. Is something missing in it ? Are there any reference implementations ? Shyamal (talk) 16:17, 14 July 2009 (UTC)
- I've implemented the Terriberry moment-based method for skewness and kurtosis in the incremental case (i.e. B={x}) and it worked fine. Markjamesabraham (talk) 22:36, 26 May 2010 (UTC)
- Some published code is at http://lib.stat.cmu.edu/apstat/52 , which also indicates its original source. Melcombe (talk) 12:13, 27 May 2010 (UTC)
- Actually I subsequently found that Chan's online-merging algorithm does fine for higher-order moments, but calculates the mean very badly. When A and B have large and comparable size, for is not as numerically stable as when one of A and B are very much larger than the other. The error contribution from the subtraction that forms is scaled only by about 1/2, whereas when the error contribution from the subtraction is divided by . Instead, prefer to use . Main page updated accordingly Markjamesabraham (talk) 14:54, 30 May 2010 (UTC)
- Some published code is at http://lib.stat.cmu.edu/apstat/52 , which also indicates its original source. Melcombe (talk) 12:13, 27 May 2010 (UTC)
Compensated variant
I cannot make sense of the "compensated" algorithm. It does not appear to be using ordinary Kahan summation. Bkalafut (talk) 01:19, 9 July 2010 (UTC)
Hey
Actually I just wanted to ask that on the page :Algorithms for calculating variance is the pseudocode that is given for calculating variance correct?
variance = E(X**2) - (E(X))**2 Computational formula for the variance
Should'nt it be mean*mean as opposed to Sum * mean?
Thanks...
```` —Preceding unsigned comment added by Nk28 (talk • contribs) 06:59, 12 October 2010 (UTC)
Pseudocode is valid Python, but gets tripped up by integer division
Would anyone object to a note pointing out that the pseudocode can be run verbatim in a Python intepreter, but that in Python < 3.0 you'll need a
from __future__ import division
statement. (Can you tell I just got caught out by this? :D) Alexsdutton (talk) 15:05, 23 December 2010 (UTC)
- I think it would be inappropriate. The whole idea of pseudocode is that it's not language specific. -- RoySmith (talk) 18:52, 25 December 2010 (UTC)
- I'm torn. Perhaps the text should say it is Python 3.0? I am loathe to add that line to an otherwise beautiful looks-like-pseudocode example, but as you say, the gotcha is easy to fall into... —Ben FrantzDale (talk)
Divide by n vs n-1?
Can somebody explain why there are two different flavors of the formula for variance, one where you divide by n, the other where you divide by n-1? Obviously, for large values of n, they approach each other, but why the difference at all? -- RoySmith (talk) 18:55, 25 December 2010 (UTC)
- Wikipedia's mathematics reference desk would probably be a better place for this sort of question.
- Notice that
- The rule that
- var(X + Y) = var(X) + var(Y)
- when X and Y are independent random variables applies when you divide by n and not when you divide by n − 1.
- Dividing by n − 1 is done when one is trying to estimate the variance of a population by using a random sample. It should never be done when one has the whole population and the variance can be computed exactly. It's effect is to make the estimate unbiased. That's not really all that good a reason to do it, but that's why it's done. See Bessel's correction for an account of some of the mathematical theory.
- This is standard textbook material. Maybe more of it should be included in the article.
- Michael Hardy (talk) 19:28, 25 December 2010 (UTC)
- Thanks for the pointer to Bessel's correction, that explains this better than I've ever understood it before. And, yes, my intent in asking the question was not so much for my own knowledge (although that's part of it), but as a suggestion that something along these lines be included in this article. -- RoySmith (talk) 03:28, 28 December 2010 (UTC)
Weighted variance pseudocode correctness
The weighted variance pseudocode seems correct to me; I just did a back-of-the-envelope check (but feel free to check it more thoroughly!). I've removed the misleading comment in the pseudocode (#[WARNING] This seems wrong. It should be moved before M2 assignment). This comment doesn't seem to make much sense; one can easily show that it yields the wrong result for the trivial case in which the weights are all equal to one. Theaucitron (talk) 11:07, 21 August 2012 (UTC)
Online algorithm for covariance
Is this algorithm right? I'm not an expert, but when I compare it to the online variance computation, it's different, and if I replace y with x, I should get cov(x,x) = var(x), so could anyone have a look please? 131.227.95.222 (talk) 18:03, 27 February 2013 (UTC)
- Alright, I found a paper talking about this and now I understand that the formula is talking about co-moments and not covariances, I will add the note to the article. 131.227.95.222 (talk) 10:31, 28 February 2013 (UTC)
Merge discussion
There was a merge tag added at Standard deviation#Rapid calculation methods to this article. I think it is quite reasonable to have that section be a summary of this article. I must admit I was rather surprised I couldn't find a version of the formula in the standard deviation article, I've put a bit in its talk page about that and that that article is slow referring to variance and never links to it.
For this article as old speedup calculation method is at assumed mean. Dmcq (talk) 09:19, 17 May 2013 (UTC)
Kurtosis algorithm biased for normal distribution?
The wikipedia page for kurtosis claims that unbiased estimation of kurtosis is possible for normally distributed inputs. However, when I run the code my results are clearly biased.
[hs4 ~] 30$ ~/tmp/simple_kurt.py
Average kurtosis: -0.545012387836
[hs4 ~] 30$ ~/tmp/simple_kurt.py
Average kurtosis: -0.536851508033
[hs4 ~] 30$ ~/tmp/simple_kurt.py
Average kurtosis: -0.55578863384
[hs4 ~] 30$ ~/tmp/simple_kurt.py
Average kurtosis: -0.546190322949
Here's the code I'm running:
#!/usr/bin/python2.7
from __future__ import division
from random import gauss
def kurtosis(data):
n = 0
mean = 0
M2 = 0
M3 = 0
M4 = 0
for x in data:
n1 = n
n = n + 1
delta = x - mean
delta_n = delta / n
delta_n2 = delta_n * delta_n
term1 = delta * delta_n * n1
mean = mean + delta_n
M4 = M4 + term1 * delta_n2 * (n*n - 3*n + 3) + 6 * delta_n2 * M2 - 4 * delta_n * M3
M3 = M3 + term1 * delta_n * (n - 2) - 3 * delta_n * M2
M2 = M2 + term1
kurtosis = (n*M4) / (M2*M2) - 3
return kurtosis
max_datasets = 10000
sum_kurtosis = 0
for lv in range(max_datasets):
data = [gauss(0, 1) for i in range(10)]
sum_kurtosis += kurtosis(data)
print "Average kurtosis:", sum_kurtosis/max_datasets