Jump to content

Talk:Pearson correlation coefficient: Difference between revisions

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia
Content deleted Content added
SineBot (talk | contribs)
m Signing comment by 66.233.184.215 - "→‎Pearson Distance: "
→‎"Direction": new section
Line 153: Line 153:


::I agree. <math>1- r(x, y)</math> is not a distance. The correlation coefficient is <math> r(x, y) = \cos(\alpha)</math> with <math>\alpha</math> the angle between <math>x,y</math>. For <math>1-r(x,y)</math> to be a distance would require <math> 2 \cos(\frac{alpha}{2}) - \cos(\alpha) \leq 1 \forall \alpha </math>, which is not true, thus <math>1-r(x,y)</math> is a semi-metric. <span style="font-size: smaller;" class="autosigned">— Preceding [[Wikipedia:Signatures|unsigned]] comment added by [[Special:Contributions/66.233.184.215|66.233.184.215]] ([[User talk:66.233.184.215|talk]]) 18:22, 11 June 2012 (UTC)</span><!-- Template:Unsigned IP --> <!--Autosigned by SineBot-->
::I agree. <math>1- r(x, y)</math> is not a distance. The correlation coefficient is <math> r(x, y) = \cos(\alpha)</math> with <math>\alpha</math> the angle between <math>x,y</math>. For <math>1-r(x,y)</math> to be a distance would require <math> 2 \cos(\frac{alpha}{2}) - \cos(\alpha) \leq 1 \forall \alpha </math>, which is not true, thus <math>1-r(x,y)</math> is a semi-metric. <span style="font-size: smaller;" class="autosigned">— Preceding [[Wikipedia:Signatures|unsigned]] comment added by [[Special:Contributions/66.233.184.215|66.233.184.215]] ([[User talk:66.233.184.215|talk]]) 18:22, 11 June 2012 (UTC)</span><!-- Template:Unsigned IP --> <!--Autosigned by SineBot-->

== "Direction" ==

Ok so cool you've mentioned "strength" of linear association - but what about direction? This is what the - and + signs indicate [[Special:Contributions/129.180.166.53|129.180.166.53]] ([[User talk:129.180.166.53|talk]]) 08:10, 16 June 2012 (UTC)

Revision as of 08:10, 16 June 2012

WikiProject iconStatistics Unassessed
WikiProject iconThis article is within the scope of WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
???This article has not yet received a rating on Wikipedia's content assessment scale.
???This article has not yet received a rating on the importance scale.
WikiProject iconMathematics Start‑class Mid‑priority
WikiProject iconThis article is within the scope of WikiProject Mathematics, a collaborative effort to improve the coverage of mathematics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
StartThis article has been rated as Start-class on Wikipedia's content assessment scale.
MidThis article has been rated as Mid-priority on the project's priority scale.

Odd comments

Correlations are rarely if ever 0? Is it really that hard to find things which are completely uncorrelated? —Preceding unsigned comment added by 169.237.24.238 (talk) 19:46, 12 June 2008 (UTC)[reply]

Just as hard as getting a random number that is exactly zero. It just never happens. Jmath666 (talk) 06:27, 5 August 2008 (UTC)[reply]

Is there a particular reason that the formula in the article is linked as an seperate image, rather than using Wikipedia's TeX markup? I noticed lots of experimenting on in the article history. -- DrBob 21:50 May 5, 2003 (UTC)

TeX insisted on putting r = in the numerator. That was probably my fault but having spent enough time on it I decided to put in an image till I could figure it out. The image is less than ideal, though, so I will be trying again to write the formula in TeX. Jfitzg

It's easy when you know how, eh? My TeX looked something like that, but obviously not enough. Thanks.Jfitzg

I remember. I had \sum in the wrong place. I was tired. Really.
It's also the curly braces {} that tell TeX how to do the grouping, rather than showing up in the text like normal braces (). They look rather too similar in some fonts, so it can be hard to spot the difference. See the Tex markup article for more examples. -- Anon.
Thanks, anon. I finally got round to converting the others.

Would it be possible to add a label to the diagram Correlation_examples2.svg‎? The center picture has an undefined correlation coefficient which I think should be labeled in the diagram, not just the description. I found the lack of a label confusing at first. — Preceding unsigned comment added by 149.76.197.101 (talk) 23:48, 6 June 2011 (UTC)[reply]


This article could benefit from a little example with some numbers and an actual calculation of r. AxelBoldt 22:26 22 May 2003 (UTC)

Feel free.

Merge with Correlation

The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.
The result was Keep due to no consensus. -- -Sykko-(talk to me) 02:02, 26 September 2008 (UTC)[reply]

This topic is covered in Correlation. I move that we merge the two and have this page redirect to Correlation.

I disagree. Its better to cover the details in a separate page. —Preceding unsigned comment added by Sekarnet (talkcontribs) 08:44, 20 August 2008 (UTC)[reply]
I disagree, I think it is fine to cover the details of the specific method in a separate page, allowing the 'central' subject article to be clear for non technical people. However, the stuff about linear regression here should probably be removed. Any reason this page isn't listed under rank correlation coefficient? Hmmm... I guess it isn't a 'rank' method. --Dan|(talk) 08:02, 8 August 2007 (UTC)[reply]
I agree with the proposal, merge it with Correlation, it's actually already covered much better in there --mcld (talk) 10:27, 30 May 2008 (UTC)[reply]
I disagree. I suggest moving much of the mathematical stuff pertaining to the correlation coefficient to be under "Pearson... ", and that it should be extended with results about the sampling distribution under the joint-normal case. This would leave the "Correlation" article to given a general description and to compare with other measures of dependence. Melcombe (talk) 11:54, 30 May 2008 (UTC)[reply]
Keep I came to Wikipedia to learn more about this subject and I am glad that I was able to find this detailed info on it's own. I already knew what correlation was and would have been disappointed to be redirected to that article when looking up this one. Although if it is covered better on the correlation article then perhaps some of the info there should be used to improve this article. -Sykko-(talk to me) 23:45, 12 September 2008 (UTC)[reply]
Since there was no consensus I am going to remove the tag and close the discussion now -Sykko-(talk to me) 02:02, 26 September 2008 (UTC)[reply]
The discussion above is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.

in computer software

why have the, "in computer software" section, it doesn't appear in any other statistics aarticle that I'm aware of. Any opinions? Pdbailey (talk) 03:01, 24 January 2008 (UTC)[reply]

Agree. Not encyclopedic. The info is straightforward to find using the help system in any half-decent software. And where would it end? Qwfp (talk) 09:41, 28 February 2008 (UTC)[reply]

Pseudocode

I moved the pseudocode over from correlation and may do so with a few other sections that are specific to Pearson correlation. I'm leaving this note here to point to an extended debate on the Talk:correlation page regarding this section. Skbkekas (talk) 02:42, 5 June 2009 (UTC)[reply]

I came to correlation and this article looking for an algorithm, found the single pass pseudocode, and tried it. The wording in its section seems to suggest that it provides good numerical stability and only requires a single pass. While it might have good numerical stability for a single pass implementation, I found it was not sufficiently stable (compared to say, Excel) for a simple analysis of small series of financial returns. Considering the existing debate on the wisdom of including pseudocode at all, I suggest that someone either add two-pass pseudocode for comparison, or emphasize the limited/specific utility of the single-pass psuedocode. Expyram (talk) 15:25, 30 July 2009 (UTC)[reply]

You can see code for a stability analysis on my talk page. Given that it is nontrivial to construct an example exhibiting significant inaccuracy (and hence this almost never happens for real-world financial time series), it seems plausible that Expyram wrote a bug. I would suggest benchmarking against the worked examples. Brianboonstra (talk) 21:46, 3 August 2009 (UTC)[reply]

Section merger proposal

I have commented on the correlation article talk page about why I disagree with the section merger proposal for the "sensitivity to the data distribution" section. Skbkekas (talk) 03:58, 2 November 2009 (UTC)[reply]

So have I. JamesBWatson (talk) 12:13, 2 November 2009 (UTC)[reply]

Question from IP

The derivation that the coefficient of determination is the square of the correlation coefficient depends on the existence of a "constant" term in the regression, i.e. a model like y = mx + b. To see this break down, consider the data points (-1,0) and (0,1). If you run a linear model of the form y = mx, you will have a negative r^2 value [computing as 1 - RSS / TSS] —Preceding unsigned comment added by 66.227.30.35 (talk) 21:02, 3 February 2010 (UTC)[reply]

I don't understand. R-squared is only valid if there is a constant in the regression. 018 (talk) 22:44, 3 February 2010 (UTC)[reply]
There is no claim that the account given is valid for anything other than the model y = mx + b, so the fact that formulas do not apply to other models is not surprising. JamesBWatson (talk) 09:54, 5 February 2010 (UTC)[reply]

single pass code

The section titled, "Computing correlation accurately in a single pass" could probably use more text. Questions to be answered: what is sweep? what is mean_x ? what is delta_x? what is the formula that shows it works? I also wonder, why not show how to do it in three passes first and then show the single pass code? 018 (talk) 00:14, 19 April 2010 (UTC)[reply]

Something doesn't seem right...

... it looks to me like you are using the formula for r as the population correlation coefficient, not the sample coefficient. According to my textbook, the formula is:

What am I missing here? - 114.76.235.170 (talk) 14:51, 12 August 2010 (UTC)[reply]

Note: have updated the sum of notations. - 114.76.235.170 (talk) 08:19, 15 August 2010 (UTC)[reply]
A good first step might be to write them in the same notation. You will also have to rearrange the sums and know that (sum of x) = (average of x) times n. I'm also not sure if you are commenting on the article or asking for help with the algebra. If you want help, the math reference desk is the best place on Wikipedia (a teacher's office hours might be the best if you are taking a class). WP:RD/MA 018 (talk) 15:59, 12 August 2010 (UTC)[reply]
Not taking a class, was commenting here because I was generally curious about the formula. My understanding is that the coefficient is the sum of the product of the z-scores for the dependent and independent variables, over the number of elements in the sample. However, z-scores use the standard deviation - but there are two formulas for the standard deviation, one using Bessel's correction for the unbiased estimator (sample deviation) and the other doesn't (population deviation). I was wondering where this comes into play, if at all. - 114.76.235.170 (talk) 03:05, 14 August 2010 (UTC)[reply]
I see, is there an n versus (n-1) issue? No. All correlations have range [-1,1] and so adjustment like that would necessarily move the range. Another way you can think of it is that the correction term would cancel out because it would appear in the numerator and the denominator. Similarly, I think your textbook's version uses a trick that would have been useful in the 1970s and 1980s that multiplies the numerator and denominator by n to make the computation easier. The older statistical methods are often written down in difficult to understand notations like this to "help the reader" because they were easier to compute/used fewer resources on the machines of the time. Now they are just anachronistic. 018 (talk) 16:36, 14 August 2010 (UTC)[reply]
Ah, I see... I think :-) - 114.76.235.170 (talk) 08:19, 15 August 2010 (UTC)[reply]
OK, so the formula here is:
Now if you multiple the numerator and the denominator by n then this gives you:
Then you expand the numerator:
Which is the same as:
Which is:
This cancels a term...
Leading to the simplified numerator...
So expanding the denominator:
Now is and is similarly , therefore the denominator can be rewritten further:
114.76.235.170 (talk) 09:14, 15 August 2010 (UTC)[reply]
Nope, stuck. Can't work out how they get from here:
to here:
Any ideas? - 114.76.235.170 (talk) 12:54, 15 August 2010 (UTC)[reply]
Hi, this discussion is better suited for WP:RD/MA. However, I will tell you that you need to keep track of your subscripts: when you use the subscript i in the outer and inner sum, you can easily get the wrong answers. so if you have

This mistake you make and then recover from in the numerator. In the denominator, you use , this is how addition works. 018 (talk) 16:42, 15 August 2010 (UTC)[reply]

reflective correlation

It would be helpful if there were references for the "reflective correlation" section. I have this formula in some code and I'm trying to track down its origin, but the phrase "reflective correlation" does not seem to be common when searching on the *net, except in links back to this page. 75.194.255.109 (talk) 20:14, 9 September 2010 (UTC)[reply]

You could try searching for "uncentered correlation": this does bring up some uses of these formulae. And this may be a better name to use for the idea here. Melcombe (talk) 08:48, 10 September 2010 (UTC)[reply]

Pearson's Correlation Coefficient is Biased

I believe the statement that Pearson's Correlation Coefficient is "asymptotically unbiased" is incorrect, or at least, non-ideal phrasing. The estimator is biased even for bivariate normal data. In the normal case, Fisher showed an approximate solution for the expectation of the sample estimate of correlation is E[r] = ρ −ρ (1− ρ 2 ) / 2n. Check out the article http://www.uv.es/revispsi/articulos1.03/9.ZUMBO.pdf. This bias goes to zero as n goes to infinity, so perhaps the term "consistent" was meant by "asymptotically unbiased". MrYdobon (talk) 13:47, 24 October 2010 (UTC)[reply]

"Asymptotically unbiased" means that the bias goes to zero as n goes to infinity. Consistency is something else entirely. Skbkekas (talk) 17:05, 24 October 2010 (UTC)[reply]
No it isn't, it's closely related: An estimator that is asymptotically unbiased is consistent if its variance decreases to zero as the sample size tends to infinity, which in practice it nearly always does. See Talk:Consistent estimator#related to other concepts Qwfp (talk) 20:11, 24 October 2010 (UTC)[reply]
Being "asymptotically unbiased" means that the (E[rn] - ρ)/sn goes to zero, where sn is the standard deviation of rn -- I left out the sn term before. By linearization, it's easy to see that the variance of rn decays like 1/n, thus sn is proportional to n-1/2, so "asymptotically unbiased" means that n1/2(E[rn] - ρ) goes to zero. Also by linearization, it's easy to see that nb(E[rn]−ρ) goes to zero, for b<1. Thus the correlation coefficient is asymptotically unbiased. The only condition needed here is that for the linearization arguments to go through, you need some finite moments, four of them I think. Skbkekas (talk) 22:24, 24 October 2010 (UTC)[reply]

Correlation coefficient and linear transformations

In the section Removing correlation it states "It is always possible to remove the correlation between random variables with a linear transformation" however, in Mathematical properties it is stated that the PPMCC "is invariant to changes in location and scale. That is, we may transform X to a + bX and transform Y to c + dY, where a, b, c, and d are constants, without changing the correlation coefficient". These two statements seem contradictory to me, in that we're saying simultaneously that for any linear transformation the correlation coefficient remains unchanged, but there still exists a particular linear transformation which removes correlation. The only way I can make sense of this is if the linear transformation results in a dataset with no correlation, but with a non-zero Pearson product-moment correlation coefficient. - Clarification in the text of this apparent discrepancy would be appreciated. -- 140.142.20.229 (talk) 23:01, 15 December 2010 (UTC)[reply]

A clarification has been attempted. I would prefer not to have the Mathematical properties section expanded beyond this as it would be off-topic for that section, but something more might be added to Removing correlation if neccesary. But remember this is not a text-book. Melcombe (talk) 09:56, 16 December 2010 (UTC)[reply]
I don't know what clarification was attempted, but I came here to comment on the exact same issue as 140.142.20.229: the two statements still appear contradictory (at least to me).
My understanding is admittedly naive but if the geometric interpretation of the correlation coefficient is the cosine of an angle between the vectors corresponding to each set of centered data, how could a full-blown linear transformation ever hope to preserve the correlation coefficient in general?Yes, translations are fine, but surely a change in slope would change the coefficient? --Saforrest (talk) 08:04, 3 February 2011 (UTC)[reply]
My understanding is that the correlation is being removed by mixing X and Y. It's a linear transformation on the vector, not just on the individual variables.

Correct me if I'm wrong

I'm not extremely familiar with the Pearson Correlation Coefficient, but I needed to implement it as a function for a program I was making, so I implemented this equation from the article:

This seemed pretty straightforward and was easily implemented, I tested it on the dataset x{1,2,3,4,5} y{1,2,3,4,5}, on which it returned r = 1.25. I was suprised since I thought correlations went from -1 to 1 so I calculated r by hand using the above formula thinking I made a mistake in my code and recieved r = 1.25. After this I adjusted the formula to be the average of the products of the standard scores:

Upon which I recieved the expected r = 1 both by hand and from the program, can someone confirm if this is the correct formula? --Guruthegreat0 (talk) 00:02, 15 April 2011 (UTC)[reply]

The correlation is a ratio involving a covariance and two variances. You can estimate these quantities using the unbiased estimate (normalizing by N-1), or using the maximum likelihood estimate (normalizing by N), see Sample mean and sample covariance. When calculating the correlation coefficient, it doesn't matter which normalization you use, but you should use the same normalization for all three quantities. You are getting r=1.25 because you are using the N-1 normalization for the covariance but the N normalizations for the variances. Skbkekas (talk) 14:27, 15 April 2011 (UTC)[reply]

Pearson Distance

Is the Pearson Distance really a distance? In the metric sense. Symmetry, positivity and the null property is obvious but does the triangle inequality for sure hold? What we have is that: 1-r(1,2) \leq 2-[r(1,3)+r(2,3)]. That is r(1,2)\geq r(1,3)+r(2,3)-1. (Let the random variables 1,2,3 be just some vectors in R^n, nothing fancy). I don't know how to see this inequality. And by the way, I also don't know how to imagine that two random variables are the same in the general sense (their distributions are the same? Up to some moments?). --78.104.124.124 (talk) 00:16, 11 March 2012 (UTC)[reply]

I agree. is not a distance. The correlation coefficient is with the angle between . For to be a distance would require , which is not true, thus is a semi-metric. — Preceding unsigned comment added by 66.233.184.215 (talk) 18:22, 11 June 2012 (UTC)[reply]

"Direction"

Ok so cool you've mentioned "strength" of linear association - but what about direction? This is what the - and + signs indicate 129.180.166.53 (talk) 08:10, 16 June 2012 (UTC)[reply]