Talk:Anscombe's quartet

From Wikipedia, the free encyclopedia
Jump to: navigation, search
WikiProject Mathematics (Rated B-class, Mid-importance)
WikiProject Mathematics
This article is within the scope of WikiProject Mathematics, a collaborative effort to improve the coverage of Mathematics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
Mathematics rating:
B Class
Mid Importance
 Field: Probability and statistics
WikiProject Statistics (Rated B-class, High-importance)
WikiProject icon

This article is within the scope of the WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page or join the discussion.

B-Class article B  This article has been rated as B-Class on the quality scale.
 High  This article has been rated as High-importance on the importance scale.
 

Notes[edit]

After putting this together, I saw that Anscombe's Quartet is listed under Wikipedia:Requested articles/Mathematics, under Logic / Set Theory. Which surprised me. Is there another Anscombe's Quartet out there ? Chris24359 20:56, 12 March 2007 (UTC)

There was a mention about it at Correlation#Correlation_and_linearity; it now links back to this page. Schutz 08:30, 13 August 2007 (UTC)

Variances...[edit]

The variances of the x and y variables had been miscalculated. Whoever did the first calculation seems to have summed all (x-mean_x)^2, and then divided by 10, instead of N=11. This error seems to have been repeated elsewhere on the internet, but there are webpages (like this one [1]) which give the correct standard deviation/variance.

Could people "refix"ing the values on the page leave a comment here explaining their calculation? Erkcan (talk) 07:10, 18 April 2008 (UTC)

  • The x-variances for the 4 datasets are 10,10,10,10. The y-variances are 3.75206280991736, 3.75239008264463, 3.74783636363636, 3.74840826446281. Erkcan (talk) 07:25, 18 April 2008 (UTC)
  • There is some confusion here between population and sample variance, in the former case the denominator is n (11), in the latter case n-1 (10). Which one is correct depends on whether x and y are the population or a sample. But it doesn't really matter, what is more important is that the variance and mean are the same (however calculated) for each data set. It is incorrect to refer to mean and variance of each x or y. Mean of x would be better. Also the lines in the graphs don't intercept the y-axis at 3, I presume the origin is not zero which is a bit confusing. I would also ask that the other statistics from the original paper are added, this seems to be in hand from the page source. Jmgibbons (talk) 13:57, 2 September 2009 (UTC)
I have rephrased the table to avoid the "each x" usage. Melcombe (talk) 17:24, 12 November 2009 (UTC)
  • Whoops, made an anonymous edit refixing those values before I read this page. I actually ran accross this as I was writing a minor report on the quartet and the page threw me off for a while, thinking I was calculating the variance wrongly, somehow. So yes, I assure you that correct statistics matter and I would kind of like to know why people calculated them with n instead of n-1. Also, the image was generated with the n-1 variances, so there's another reason to keep them as such. 81.57.247.167 (talk) 08:09, 12 November 2009 (UTC)

Delete part of a sentence[edit]

Finally, the fourth example (bottom right) shows another example when one outlier is enough to produce a high correlation coefficient, even though the relationship between the two variables is not linear.

replaced with

Finally, the fourth example (bottom right) shows another example when one outlier is enough to produce a high correlation coefficient.

REASON: The relation between x and E[Y|x] in this "made-up population" may or may not be linear. There is no basis to test lack of linear fit, with the given "design" of x values. (There are degrees of freeedom for pure error only, but NONE for lack of fit when there are only 2 distinct x values.) I think that going into these matters is beyond the scope of the page, so my proposal is just a deletion.

129.1.23.19 (talk) 20:42, 30 September 2011 (UTC)

What the heck is "d.p." in the first table?[edit]

Can anyone substitute in the longer statistical terminology? — Preceding unsigned comment added by 18.111.93.217 (talk) 14:36, 15 October 2011 (UTC)

It means decimal places --Rumping (talk) 00:27, 17 November 2011 (UTC)

File:Anscombe's quartet 3.svg to appear as POTD soon[edit]

Hello! This is a note to let the editors of this article know that File:Anscombe's quartet 3.svg will be appearing as picture of the day on December 11, 2011. You can view and edit the POTD blurb at Template:POTD/2011-12-11. If this article needs any attention or maintenance, it would be preferable if that could be done before its appearance on the Main Page so Wikipedia doesn't look bad. :) Thanks! howcheng {chat} 18:37, 9 December 2011 (UTC)

Picture of the day
Anscombe's quartet

Anscombe's quartet is a group of four data sets that have identical simple statistical properties, yet appear very different when graphed. They were constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data before analysing it and the effect of outliers on statistical properties.

Image: Schutz
ArchiveMore featured pictures...


Certainly a interesting picture - but shouldn't discrepancies cited here in this discussion page be resolved first?--173.69.135.105 (talk) 03:16, 14 December 2011 (UTC)

Regression lines[edit]

The regression lines shown (and mentioned in the body of the article) are least squares regression lines. As other forms of regression calculations can be carried out giving different results, I'm going to insert "least squares" in the first mention of the regression lines. An L1 regression, for example, minimizes the sum of the absolute values of the residuals, and has the property that the outlier in the third dataset will be effectively ignored.

Addendum: after inserting the phrase "least squares" and reviewing the article before saving it, I came to the conclusion that the wording had become overly convoluted. I'm not going to save the edited version of the article, but I am still of the opinion that in some way it needs to make clear that the regression lines shown are least squares.Floozybackloves (talk) 04:08, 11 December 2011 (UTC)

I think that the variance values are wrong[edit]

I was doing an homework about means and variances and I tried one of dataset from Anscombe's quartet. Then the variance results I calculated was different from the ones written in wikipedia. Firt I thought I made a mistake and searched for it. After search everything seemed correct then I started searching on the internet. The previous versions of wikipedia page had these numbers:

variance of x = 10

variance of y = 3.75

these numbers are same with the results I found. Can someone check it? — Preceding unsigned comment added by 193.140.194.64 (talk) 18:57, 9 October 2012 (UTC)

The issue is N = 11 versus N-1 = 10 in the denominator in the computation of variance, where N is the number of observations. If the data are a sample from a population and the mean is estimated from the sample, then using N-1 in the denominator has the desirable property that E(sample variance) = population variance. The intuition is that estimating the sample mean "uses up" one of the observations so that dividing by N (instead of N less one used up observation) would understate the spread in the data. The Wikipedia page on Bessel's correction has a very clear discussion.

The sample variance computed with N-1 in the denominator are:

sample variance of X = 11

sample variance of Y ≈ 4.12


Michaelaoash (talk) 20:19, 8 August 2013 (UTC)