Jump to content

Talk:Pearson correlation coefficient: Difference between revisions

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia
Content deleted Content added
Illia Connell (talk | contribs)
stats rating
Jfgrcar (talk | contribs)
added topic
Line 1: Line 1:
{{WikiProject Statistics|class=start|importance=top}}
{{WikiProject Statistics|class=start|importance=top}}
{{maths rating|frequentlyviewed=yes|class=start|importance=mid|field=statistics}}
{{maths rating|frequentlyviewed=yes|class=start|importance=mid|field=statistics}}

==Needs Basic Reference==

if Pearson really invented this, then there should be a citation to a specific paper in which Pearson invented it. What is the citation? [[User:Jfgrcar|Jfgrcar]] ([[User talk:Jfgrcar|talk]]) 19:34, 9 October 2013 (UTC)


==Odd comments==
==Odd comments==

Revision as of 19:34, 9 October 2013

WikiProject iconStatistics Start‑class Top‑importance
WikiProject iconThis article is within the scope of WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
StartThis article has been rated as Start-class on Wikipedia's content assessment scale.
TopThis article has been rated as Top-importance on the importance scale.
WikiProject iconMathematics Start‑class Mid‑priority
WikiProject iconThis article is within the scope of WikiProject Mathematics, a collaborative effort to improve the coverage of mathematics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
StartThis article has been rated as Start-class on Wikipedia's content assessment scale.
MidThis article has been rated as Mid-priority on the project's priority scale.

Needs Basic Reference

if Pearson really invented this, then there should be a citation to a specific paper in which Pearson invented it. What is the citation? Jfgrcar (talk) 19:34, 9 October 2013 (UTC)[reply]

Odd comments

Correlations are rarely if ever 0? Is it really that hard to find things which are completely uncorrelated? —Preceding unsigned comment added by 169.237.24.238 (talk) 19:46, 12 June 2008 (UTC)[reply]

Just as hard as getting a random number that is exactly zero. It just never happens. Jmath666 (talk) 06:27, 5 August 2008 (UTC)[reply]

Is there a particular reason that the formula in the article is linked as an seperate image, rather than using Wikipedia's TeX markup? I noticed lots of experimenting on in the article history. -- DrBob 21:50 May 5, 2003 (UTC)

TeX insisted on putting r = in the numerator. That was probably my fault but having spent enough time on it I decided to put in an image till I could figure it out. The image is less than ideal, though, so I will be trying again to write the formula in TeX. Jfitzg

It's easy when you know how, eh? My TeX looked something like that, but obviously not enough. Thanks.Jfitzg

I remember. I had \sum in the wrong place. I was tired. Really.
It's also the curly braces {} that tell TeX how to do the grouping, rather than showing up in the text like normal braces (). They look rather too similar in some fonts, so it can be hard to spot the difference. See the Tex markup article for more examples. -- Anon.
Thanks, anon. I finally got round to converting the others.

Would it be possible to add a label to the diagram Correlation_examples2.svg‎? The center picture has an undefined correlation coefficient which I think should be labeled in the diagram, not just the description. I found the lack of a label confusing at first. — Preceding unsigned comment added by 149.76.197.101 (talk) 23:48, 6 June 2011 (UTC)[reply]


This article could benefit from a little example with some numbers and an actual calculation of r. AxelBoldt 22:26 22 May 2003 (UTC)

Feel free.

Merge with Correlation

The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.
The result was Keep due to no consensus. -- -Sykko-(talk to me) 02:02, 26 September 2008 (UTC)[reply]

This topic is covered in Correlation. I move that we merge the two and have this page redirect to Correlation.

I disagree. Its better to cover the details in a separate page. —Preceding unsigned comment added by Sekarnet (talkcontribs) 08:44, 20 August 2008 (UTC)[reply]
I disagree, I think it is fine to cover the details of the specific method in a separate page, allowing the 'central' subject article to be clear for non technical people. However, the stuff about linear regression here should probably be removed. Any reason this page isn't listed under rank correlation coefficient? Hmmm... I guess it isn't a 'rank' method. --Dan|(talk) 08:02, 8 August 2007 (UTC)[reply]
I agree with the proposal, merge it with Correlation, it's actually already covered much better in there --mcld (talk) 10:27, 30 May 2008 (UTC)[reply]
I disagree. I suggest moving much of the mathematical stuff pertaining to the correlation coefficient to be under "Pearson... ", and that it should be extended with results about the sampling distribution under the joint-normal case. This would leave the "Correlation" article to given a general description and to compare with other measures of dependence. Melcombe (talk) 11:54, 30 May 2008 (UTC)[reply]
Keep I came to Wikipedia to learn more about this subject and I am glad that I was able to find this detailed info on it's own. I already knew what correlation was and would have been disappointed to be redirected to that article when looking up this one. Although if it is covered better on the correlation article then perhaps some of the info there should be used to improve this article. -Sykko-(talk to me) 23:45, 12 September 2008 (UTC)[reply]
Since there was no consensus I am going to remove the tag and close the discussion now -Sykko-(talk to me) 02:02, 26 September 2008 (UTC)[reply]
The discussion above is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.

in computer software

why have the, "in computer software" section, it doesn't appear in any other statistics aarticle that I'm aware of. Any opinions? Pdbailey (talk) 03:01, 24 January 2008 (UTC)[reply]

Agree. Not encyclopedic. The info is straightforward to find using the help system in any half-decent software. And where would it end? Qwfp (talk) 09:41, 28 February 2008 (UTC)[reply]

Pseudocode

I moved the pseudocode over from correlation and may do so with a few other sections that are specific to Pearson correlation. I'm leaving this note here to point to an extended debate on the Talk:correlation page regarding this section. Skbkekas (talk) 02:42, 5 June 2009 (UTC)[reply]

I came to correlation and this article looking for an algorithm, found the single pass pseudocode, and tried it. The wording in its section seems to suggest that it provides good numerical stability and only requires a single pass. While it might have good numerical stability for a single pass implementation, I found it was not sufficiently stable (compared to say, Excel) for a simple analysis of small series of financial returns. Considering the existing debate on the wisdom of including pseudocode at all, I suggest that someone either add two-pass pseudocode for comparison, or emphasize the limited/specific utility of the single-pass psuedocode. Expyram (talk) 15:25, 30 July 2009 (UTC)[reply]

You can see code for a stability analysis on my talk page. Given that it is nontrivial to construct an example exhibiting significant inaccuracy (and hence this almost never happens for real-world financial time series), it seems plausible that Expyram wrote a bug. I would suggest benchmarking against the worked examples. Brianboonstra (talk) 21:46, 3 August 2009 (UTC)[reply]

Section merger proposal

I have commented on the correlation article talk page about why I disagree with the section merger proposal for the "sensitivity to the data distribution" section. Skbkekas (talk) 03:58, 2 November 2009 (UTC)[reply]

So have I. JamesBWatson (talk) 12:13, 2 November 2009 (UTC)[reply]

Question from IP

The derivation that the coefficient of determination is the square of the correlation coefficient depends on the existence of a "constant" term in the regression, i.e. a model like y = mx + b. To see this break down, consider the data points (-1,0) and (0,1). If you run a linear model of the form y = mx, you will have a negative r^2 value [computing as 1 - RSS / TSS] —Preceding unsigned comment added by 66.227.30.35 (talk) 21:02, 3 February 2010 (UTC)[reply]

I don't understand. R-squared is only valid if there is a constant in the regression. 018 (talk) 22:44, 3 February 2010 (UTC)[reply]
There is no claim that the account given is valid for anything other than the model y = mx + b, so the fact that formulas do not apply to other models is not surprising. JamesBWatson (talk) 09:54, 5 February 2010 (UTC)[reply]

single pass code

The section titled, "Computing correlation accurately in a single pass" could probably use more text. Questions to be answered: what is sweep? what is mean_x ? what is delta_x? what is the formula that shows it works? I also wonder, why not show how to do it in three passes first and then show the single pass code? 018 (talk) 00:14, 19 April 2010 (UTC)[reply]

Something doesn't seem right...

... it looks to me like you are using the formula for r as the population correlation coefficient, not the sample coefficient. According to my textbook, the formula is:

What am I missing here? - 114.76.235.170 (talk) 14:51, 12 August 2010 (UTC)[reply]

Note: have updated the sum of notations. - 114.76.235.170 (talk) 08:19, 15 August 2010 (UTC)[reply]
A good first step might be to write them in the same notation. You will also have to rearrange the sums and know that (sum of x) = (average of x) times n. I'm also not sure if you are commenting on the article or asking for help with the algebra. If you want help, the math reference desk is the best place on Wikipedia (a teacher's office hours might be the best if you are taking a class). WP:RD/MA 018 (talk) 15:59, 12 August 2010 (UTC)[reply]
Not taking a class, was commenting here because I was generally curious about the formula. My understanding is that the coefficient is the sum of the product of the z-scores for the dependent and independent variables, over the number of elements in the sample. However, z-scores use the standard deviation - but there are two formulas for the standard deviation, one using Bessel's correction for the unbiased estimator (sample deviation) and the other doesn't (population deviation). I was wondering where this comes into play, if at all. - 114.76.235.170 (talk) 03:05, 14 August 2010 (UTC)[reply]
I see, is there an n versus (n-1) issue? No. All correlations have range [-1,1] and so adjustment like that would necessarily move the range. Another way you can think of it is that the correction term would cancel out because it would appear in the numerator and the denominator. Similarly, I think your textbook's version uses a trick that would have been useful in the 1970s and 1980s that multiplies the numerator and denominator by n to make the computation easier. The older statistical methods are often written down in difficult to understand notations like this to "help the reader" because they were easier to compute/used fewer resources on the machines of the time. Now they are just anachronistic. 018 (talk) 16:36, 14 August 2010 (UTC)[reply]
Ah, I see... I think :-) - 114.76.235.170 (talk) 08:19, 15 August 2010 (UTC)[reply]
OK, so the formula here is:
Now if you multiple the numerator and the denominator by n then this gives you:
Then you expand the numerator:
Which is the same as:
Which is:
This cancels a term...
Leading to the simplified numerator...
So expanding the denominator:
Now is and is similarly , therefore the denominator can be rewritten further:
114.76.235.170 (talk) 09:14, 15 August 2010 (UTC)[reply]
Nope, stuck. Can't work out how they get from here:
to here:
Any ideas? - 114.76.235.170 (talk) 12:54, 15 August 2010 (UTC)[reply]
Hi, this discussion is better suited for WP:RD/MA. However, I will tell you that you need to keep track of your subscripts: when you use the subscript i in the outer and inner sum, you can easily get the wrong answers. so if you have

This mistake you make and then recover from in the numerator. In the denominator, you use , this is how addition works. 018 (talk) 16:42, 15 August 2010 (UTC)[reply]

reflective correlation

It would be helpful if there were references for the "reflective correlation" section. I have this formula in some code and I'm trying to track down its origin, but the phrase "reflective correlation" does not seem to be common when searching on the *net, except in links back to this page. 75.194.255.109 (talk) 20:14, 9 September 2010 (UTC)[reply]

You could try searching for "uncentered correlation": this does bring up some uses of these formulae. And this may be a better name to use for the idea here. Melcombe (talk) 08:48, 10 September 2010 (UTC)[reply]

Pearson's Correlation Coefficient is Biased

I believe the statement that Pearson's Correlation Coefficient is "asymptotically unbiased" is incorrect, or at least, non-ideal phrasing. The estimator is biased even for bivariate normal data. In the normal case, Fisher showed an approximate solution for the expectation of the sample estimate of correlation is E[r] = ρ −ρ (1− ρ 2 ) / 2n. Check out the article http://www.uv.es/revispsi/articulos1.03/9.ZUMBO.pdf. This bias goes to zero as n goes to infinity, so perhaps the term "consistent" was meant by "asymptotically unbiased". MrYdobon (talk) 13:47, 24 October 2010 (UTC)[reply]

"Asymptotically unbiased" means that the bias goes to zero as n goes to infinity. Consistency is something else entirely. Skbkekas (talk) 17:05, 24 October 2010 (UTC)[reply]
No it isn't, it's closely related: An estimator that is asymptotically unbiased is consistent if its variance decreases to zero as the sample size tends to infinity, which in practice it nearly always does. See Talk:Consistent estimator#related to other concepts Qwfp (talk) 20:11, 24 October 2010 (UTC)[reply]
Being "asymptotically unbiased" means that the (E[rn] - ρ)/sn goes to zero, where sn is the standard deviation of rn -- I left out the sn term before. By linearization, it's easy to see that the variance of rn decays like 1/n, thus sn is proportional to n-1/2, so "asymptotically unbiased" means that n1/2(E[rn] - ρ) goes to zero. Also by linearization, it's easy to see that nb(E[rn]−ρ) goes to zero, for b<1. Thus the correlation coefficient is asymptotically unbiased. The only condition needed here is that for the linearization arguments to go through, you need some finite moments, four of them I think. Skbkekas (talk) 22:24, 24 October 2010 (UTC)[reply]

Correlation coefficient and linear transformations

In the section Removing correlation it states "It is always possible to remove the correlation between random variables with a linear transformation" however, in Mathematical properties it is stated that the PPMCC "is invariant to changes in location and scale. That is, we may transform X to a + bX and transform Y to c + dY, where a, b, c, and d are constants, without changing the correlation coefficient". These two statements seem contradictory to me, in that we're saying simultaneously that for any linear transformation the correlation coefficient remains unchanged, but there still exists a particular linear transformation which removes correlation. The only way I can make sense of this is if the linear transformation results in a dataset with no correlation, but with a non-zero Pearson product-moment correlation coefficient. - Clarification in the text of this apparent discrepancy would be appreciated. -- 140.142.20.229 (talk) 23:01, 15 December 2010 (UTC)[reply]

A clarification has been attempted. I would prefer not to have the Mathematical properties section expanded beyond this as it would be off-topic for that section, but something more might be added to Removing correlation if neccesary. But remember this is not a text-book. Melcombe (talk) 09:56, 16 December 2010 (UTC)[reply]
I don't know what clarification was attempted, but I came here to comment on the exact same issue as 140.142.20.229: the two statements still appear contradictory (at least to me).
My understanding is admittedly naive but if the geometric interpretation of the correlation coefficient is the cosine of an angle between the vectors corresponding to each set of centered data, how could a full-blown linear transformation ever hope to preserve the correlation coefficient in general?Yes, translations are fine, but surely a change in slope would change the coefficient? --Saforrest (talk) 08:04, 3 February 2011 (UTC)[reply]
My understanding is that the correlation is being removed by mixing X and Y. It's a linear transformation on the vector, not just on the individual variables.

Correct me if I'm wrong

I'm not extremely familiar with the Pearson Correlation Coefficient, but I needed to implement it as a function for a program I was making, so I implemented this equation from the article:

This seemed pretty straightforward and was easily implemented, I tested it on the dataset x{1,2,3,4,5} y{1,2,3,4,5}, on which it returned r = 1.25. I was suprised since I thought correlations went from -1 to 1 so I calculated r by hand using the above formula thinking I made a mistake in my code and recieved r = 1.25. After this I adjusted the formula to be the average of the products of the standard scores:

Upon which I recieved the expected r = 1 both by hand and from the program, can someone confirm if this is the correct formula? --Guruthegreat0 (talk) 00:02, 15 April 2011 (UTC)[reply]

The correlation is a ratio involving a covariance and two variances. You can estimate these quantities using the unbiased estimate (normalizing by N-1), or using the maximum likelihood estimate (normalizing by N), see Sample mean and sample covariance. When calculating the correlation coefficient, it doesn't matter which normalization you use, but you should use the same normalization for all three quantities. You are getting r=1.25 because you are using the N-1 normalization for the covariance but the N normalizations for the variances. Skbkekas (talk) 14:27, 15 April 2011 (UTC)[reply]

Pearson Distance

Is the Pearson Distance really a distance? In the metric sense. Symmetry, positivity and the null property is obvious but does the triangle inequality for sure hold? What we have is that: 1-r(1,2) \leq 2-[r(1,3)+r(2,3)]. That is r(1,2)\geq r(1,3)+r(2,3)-1. (Let the random variables 1,2,3 be just some vectors in R^n, nothing fancy). I don't know how to see this inequality. And by the way, I also don't know how to imagine that two random variables are the same in the general sense (their distributions are the same? Up to some moments?). --78.104.124.124 (talk) 00:16, 11 March 2012 (UTC)[reply]

I agree. is not a distance. The correlation coefficient is with the angle between . For to be a distance would require , which is not true, thus is a semi-metric. — Preceding unsigned comment added by 66.233.184.215 (talk) 18:22, 11 June 2012 (UTC)[reply]

"Direction"

Ok so cool you've mentioned "strength" of linear association - but what about direction? This is what the - and + signs indicate 129.180.166.53 (talk) 08:10, 16 June 2012 (UTC)[reply]

Pearson's r being "misleading"

It should be included that Pearson's r can sometimes be misleading, and that data should ALWAYS be visualized. For example, a perfect parabolic shape, shows r=0. Does this means there is "no" relationship? No. It just means it has no linear relationship 129.180.166.53 (talk) 08:53, 16 June 2012 (UTC)[reply]

Scatterplot Graph

I am not convinced that the scatterplot in the second column of the third row of the figure indeed has a correlation of zero (this is the one that looks like a tilted rectangle, but not a diamond). I generated a similar scatter plot with the following R code and obtained a Pearson Correlation Coefficient of .32.

x <- NULL
y <- NULL
for(xrep in 1:100) {

for(yrep in 1:100)
{
  x <- c(x, xrep + (100 - yrep))
  y <- c(y, xrep + yrep * .5)
}

}
plot(x,y)
cor(x,y)

(Sorry, I cannot figure out how to get the Wikipedia editor to treat the entire sequence of code as code. This is the best I could do.)

Kmarkus (talk) —Preceding undated comment added 18:46, 14 July 2012 (UTC)[reply]

Requested move

The following discussion is an archived discussion of a requested move. Please do not modify it. Subsequent comments should be made in a new section on the talk page. Editors desiring to contest the closing decision should consider a move review. No further edits should be made to this section.

The result of the move request was: nomination withdrawn. Favonian (talk) 07:07, 11 August 2012 (UTC)[reply]


Pearson product-moment correlation coefficientPearson product–moment correlation coefficient

Per WP:MOSDASH and many authoritative external style guides (even though not always practised by statisticians). Also needs to match sibling article titles. Tony (talk) 00:38, 11 August 2012 (UTC)[reply]

  • Comment – I presume Tony means MOS:DASH. I might support if I understood the intended meaning of this term. Looks like a fair number of articles and books do use the en dash, so it's probably right. But can someone explain the intended relationship between the words product and moment here. Is it a correlation of a product moment (the moment of a product)? That would need a hyphen. Or a correlation between a product and a moment (or between products and moments)? That would need an en dash. Or something else? The article doesn't make this clear, and so far neither do the sources I've looked at. Once we know what it means, deciding the punctuation per MOS:DASH should be easy. Dicklyon (talk) 02:24, 11 August 2012 (UTC)[reply]
Dick, it's a correlation; a correlation coefficient at that. Is that alone not a dash context? Tony (talk) 02:30, 11 August 2012 (UTC)[reply]
It cetainly sounds like it. An A–B correlation would normally get an en dash. But I'm not sure that's what it is here. As I look at it, it appears to be a correlation computed as the first moment of the product of the mean-compensated random variables, in which case the hyphen might be signalling that it's a correlation based on a "product moment", no? Until I know more, I have to reserve judgement. Maybe some statisticians will know better what the intended reading is. If hyphen is right, then you're certainly not among the first ones to think otherwise, as lots of sources do use the en dash. Dicklyon (talk) 02:36, 11 August 2012 (UTC)[reply]
Tony, hate to tell you, but not this one. See this article which includes essentially our formula, with explanation "For the covariance we shall take the product moment of the deviations of the x's and y's..." That is, it's a correlation coefficient based on a product moment, not a correlation between a product and a moment; more specifically, it's an "origin moment of the product" as this paper calls it, in which "the first origin moment is the mean" (so the correlation is the mean of the product, as the equation shows). Suggest you withdraw the RM. Dicklyon (talk) 03:30, 11 August 2012 (UTC)[reply]
Dick, thanks for that bit of research; I'd have been incapable of it. And it shows the importance of typographical distinctions. Withdrawing RM. (Since the bot is down, I'm flagging this to Favonian.) Tony (talk) 03:33, 11 August 2012 (UTC)[reply]
The above discussion is preserved as an archive of a requested move. Please do not modify it. Subsequent comments should be made in a new section on this talk page or in a move review. No further edits should be made to this section.

Confidence Interval

The last paragraph of the first section of the confidence interval article states "A confidence interval does not predict that the true value of the parameter has a particular probability of being in the confidence interval given the data actually obtained. (An interval intended to have such a property, called a credible interval, can be estimated using Bayesian methods; but such methods bring with them their own distinct strengths and weaknesses)." which is inconsistent with "The other aim is to construct a confidence interval around r that has a given probability of containing ρ." from this article. The latter is a common mistake that people make when interpreting the confidence interval. I think this should be changed.

Image

An image is added to show x-y axis, values and correlation lines more clearly, the previous several set point image looks like astrophysics pictures, there are no x-y axis, no line and the lower sets of points can be confusing. The new image is adapted from an image in Applied Statistics and Probability by Montgomery and generally in most other statistics books, it is actually a classic illustration. Some books don’t show the line for example in Statistics 3rd edition by S. Ross. — Preceding unsigned comment added by Kiatdd (talkcontribs) 20:04, 21 October 2012 (UTC)[reply]

interpretation

Recently, redundant material has been inserted into this article. There is a section [Interpretation of the size of a correlation] in the article, with references. It is not necessary, as in recent edits, to insert another section about this at the end of the article. Similar edits have been reverted at least twice. Please read the full article before editing. Mathstat (talk) 14:41, 31 December 2012 (UTC)[reply]

Inference

Could we get some examples of applied papers, preferably recent papers, using inference on a correlation? I'm a little curious in particular about the graph showing the minimum sample size to show a correlation significantly different from zero. I don't recall this ever being emphasized or even noted in papers I've read. It would be nice to know whether it is used much in practice and why or why not. — Preceding unsigned comment added by 193.205.23.67 (talk) 13:13, 1 February 2013 (UTC)[reply]

Assuming that the data is bivariate normal, then the transformation can be applied and critical values of used for the test decision. The graph simply plots the inverse of that: as a function of sample size n (where is the critical value for n-2 df at level .05). To generate the plot in R:
r <- function(n, sig=.05) {
  q <- qt(1-sig/2, df=n-2)
  q / sqrt(n-2 + q^2)
}
curve(r(x), from=3, to=200)

Most undergraduate statistics textbooks refer to this correlation test for bivariate normal data. See e.g. Larsen and Marx 4e. In practice the plot would not be used for inference because it is easier and more accurate to refer to t critical values. The plot gives an idea about the power of the test - relating the sample size to the critical value in terms of r. Mathstat (talk) 22:30, 1 February 2013 (UTC)[reply]