Talk:Correlation and dependence
|Correlation and dependence has been listed as a level-4 vital article in Mathematics. If you can improve it, please do. This article has been rated as B-Class.|
|WikiProject Mathematics||(Rated B-class, Top-importance)|
|WikiProject Statistics||(Rated B-class, Top-importance)|
|The content of Association (statistics) was merged into Correlation and dependence on 12 June 2016. For the contribution history and old versions of the redirected page, please see ; for the discussion at that location, see its talk page.|
- 1 Unclear phrase
- 2 Direct and Inverse correlations not "positive" and "negative"
- 3 pseudocode
- 4 Correlation coefficient
- 5 numerical instability
- 6 Non-linear correlation
- 7 Section merger proposal
- 8 Correction of a misunderstanding
- 9 Which one is known as canonical correlation
- 10 Pearson correlation "mainly sensitive" to linear relationships??
- 11 Reference update?
- 12 Mistakes in Formulas
- 13 The alarm clock and the dawn
- 14 Correlation and linearity
- 15 Dependence does not demonstrate a causal relationship
"Correlation refers to any of a broad class of statistical relationships involving dependence." I see references to this sentence on the web and it is too vague.— Preceding unsigned comment added by Toncho11~enwiki (talk • contribs) 15:03, 4 November 2015 (UTC)
Direct and Inverse correlations not "positive" and "negative"
Positive and negative are almost never to be used in statistics. These words have arithmetic connotations and are poorly suited for use in statistics especially as descriptive nouns. Direct and Inverse correlation should be used instead of positive and negative in every manner. 18.104.22.168 (talk) 22:30, 13 January 2014 (UTC)
I suggest removing the sections with pseudo code. Wikipedia is not the place for computing tips and tricks (there's some rule about Wikipedia not being a 'how to' site). If there are important algorithmic considerations, they should be presented more formally using correct numerical analysis terminology. And surely, there's no need to show the same algorithm in two languages. —G716 <T·C> 11:10, 12 October 2008 (UTC)
- I suggest NOT removing it, it is quite convenient to skip all these "important" formulas and get to the point. However, there seems to be extra division by N there:
pop_sd_x = sqrt( sum_sq_x / N ) pop_sd_y = sqrt( sum_sq_y / N ) cov_x_y = sum_coproduct / N correlation = cov_x_y / (pop_sd_x * pop_sd_y)
seems to be the same as
pop_sd_x = sqrt( sum_sq_x ) pop_sd_y = sqrt( sum_sq_y ) cov_x_y = sum_coproduct correlation = cov_x_y / (pop_sd_x * pop_sd_y)
- I agree with removing the pseudo code, given that this is not a text book. Melcombe (talk) 10:34, 18 December 2008 (UTC)
- Look, the argument is indeed rational, but deleting information without finding a new home for it is evil, whether or not it fits with the guidelines! Could you perhaps add it to the Ada programming wikibook and/or one of the Python wikibooks? --mcld (talk) 14:32, 19 December 2008 (UTC)
I'm the guy who put the pseudocode there in the first place because so many people come here looking for how to compute correlation. I agree with Mcld ... don't just delete it! It is fine for the pseudocode to be moved elsewhere and just have a link to it, but it is important and difficult to find stuff.
A language-specific site is not really an acceptably general home for it -- this is pseudocode, not an implementation how-to. In addition, if there's a nontrivial risk of the link going stale, then I think the code ought to stay here, where it can actually be useful to people.
For reference, researching stable one-pass algorithms and distilling my findings into that pseudocode took me several hours (and I have a PhD in mathematics). I hope and believe it has saved many people many hours of work.
Frankly, G716 has the right idea: someone should properly write up the numerical analysis, providing the appropriate context for the pseudocode snippet. I simply lack the time to do it myself. But removing the pseudocode because the entry lacks the contextual information seems hasty and wrong.Brianboonstra (talk) 18:37, 6 February 2009 (UTC)
- You need to reference the sources of your research though (don't worry about formatting). Else you're asking any users of the code to just take it on trust, or someone else to repeat your research. Qwfp (talk) 20:00, 6 February 2009 (UTC)
- A reference for this code is certainly required. Further description of the code is also needed. For example, why is a one-pass algorithm better than a two-pass? What is the trade-off in accuracy? How much is the speed increased? Will this speed increase really improve someone's application overall (i.e. is the calculation of r likely to be a bottleneck)? Darkroll (talk) 02:55, 10 February 2009 (UTC)
The pseudocode and related text which was in the article (as of 22 Feb 2009) is reproduced below.
Computing correlation accurately in a single pass
The following algorithm (in pseudocode) will calculate Pearson correlation with good numerical stability. Notice this is not an accurate computation as in each iteration only the updated mean is used (not the exact full calculation mean), then the delta is squared, so this error is not fixed by the sweep factor.sum_sq_x = 0 sum_sq_y = 0 sum_coproduct = 0 mean_x = x mean_y = y for i in 2 to N: sweep = (i - 1.0) / i delta_x = x[i] - mean_x delta_y = y[i] - mean_y sum_sq_x += delta_x * delta_x * sweep sum_sq_y += delta_y * delta_y * sweep sum_coproduct += delta_x * delta_y * sweep mean_x += delta_x / i mean_y += delta_y / i pop_sd_x = sqrt( sum_sq_x ) pop_sd_y = sqrt( sum_sq_y ) cov_x_y = sum_coproduct correlation = cov_x_y / (pop_sd_x * pop_sd_y)
I have removed this to this discussion page so that it is still available to anyone who wants to see it, but is not included in the article where the general consensus is that it does not belong on this page.
The correct calculation does not require the mean subtracting from each observation while passing through the data: even though this is one way in which it is possible to to calculate the correlation, it is not the computationally simplest way to do it, and it would require the means to be found before calculating the deviations from the means (thus requiring a second pass through the data to do it accurately using this approach). Rather the sum of products
- and sums of squares and
are collected, allong with the sums of
- and ,
then allowance is made for the means by subtraction at the end of the calculation, i.e. once the means are known, using the sample correlation coefficient formula given in the article:
I have explained this in this discussion page hoping that it will satisfy those who thought the code was a useful part of the article. The code is still available, but you are strongly recommended not to use it. Instead do the calculation using the above formula. Alternatively you can use the formula
but if you do, you need to calculate the means before you can start, so although this formula is easy to understand, it is slightly less easy to use in practical calculations.
I have a question for the pseudocode. The previous version:
pop_sd_x = sqrt( sum_sq_x / N ) pop_sd_y = sqrt( sum_sq_y / N ) cov_x_y = sum_coproduct / N correlation = cov_x_y / (pop_sd_x * pop_sd_y)
The current version
pop_sd_x = sqrt( sum_sq_x ) pop_sd_y = sqrt( sum_sq_y ) cov_x_y = sum_coproduct correlation = cov_x_y / (pop_sd_x * pop_sd_y)
And someone has said that they are the same, but I miss one N to be the same. Do you know which is the correct one? —Preceding unsigned comment added by 22.214.171.124 (talk) 14:57, 5 October 2009 (UTC)
Correlation coefficient currently directs here. Should it direct to Coefficient of determination (i.e. r-squared) instead? (note: I'm cross-listing this post at Talk:Coefficient of determination.) rʨanaɢ talk/contribs 03:30, 22 September 2009 (UTC)
- No, it shouldn't. This is the main article on correlation, and defines the correlation coefficient. The article on coefficient of determination mentions the correlation coefficient, but does not define it; in fact it rather presupposes a knowledge of the correlation coefficient. What is more this is as it should be, both because correlation coefficient is a much more widely known concept than coefficient of determination, and because it makes more sense to redirect upwards to a more general topic than to redirect sideways to a different concept at the same level. JamesBWatson (talk) 13:05, 27 September 2009 (UTC)
You need to be a little bit careful running around claiming algorithms are numerically unstable. Instability depends on the range of numbers used. The one pass algorithm is stable if the full calculation can be done using integer arithmetic without intermediate results overflowing. Charles Esson (talk) 08:11, 8 October 2009 (UTC)
(JamesBWatson left this comment in my user page; since I believe it's of general interest, I'm taking the liberty of moving it here.
I see that you reverted an edit to the article Correlation with the edit summary "Undid revision 320015831 by JamesBWatson (talk) in stats, "correlation" always refers to linear -- not any -- relationship". I have restored the edit, together with references to three textbooks which use the expression "nonlinear correlation". I could have given many more references; for example, here are just a few papers with the expression in their titles:
- A. Mitropolsky, “On the multiple non-linear correlation equations”, Izv. Akad. Nauk SSSR Ser. Mat., 3:4 (1939), 399–406
- Non-linear canonical correlation analysis with a simulated annealing solution, Sheng G. Shi, Winson Taam (Journal of Applied Statistics, Volume 19, Issue 1 1992 , pages 155 - 165)
- Non-Linear Correlation Discovery-Based Technique in Data Mining, Liu Bo (Intelligent Information Technology Application Workshops, 2008. IITAW '08)
- Ravi K. Sheth (UC Berkeley), Bhuvnesh Jain (MPA-garching), The non-linear correlation function and the shapes of virialized halos.
Google Scholar gives 2790 citations for "non-linear correlation" and 3650 for "nonlinear correlation". I assure you, "correlation" usually, but by no means always, refers to linear correlation. JamesBWatson (talk) 15:37, 31 October 2009 (UTC)
- Thanks for the references, JamesBWatson. I guess the generalization of correlation coefficient for both linear and non-linear associations would require rewriting, e.g., Correlation#Correlation and linearity. Furthermore, we need to define and show how to calculate it. I can see how it could be obtained as , where the variances and the covariance come from a non-simple linear regression (for simple vs. non-simple linear regression, see Regression analysis#Linear regression). Is that what you mean? 126.96.36.199 (talk) 06:34, 1 November 2009 (UTC)
- First, unless "non-linear correlation" is precisely defined, I see no point in just mentioning it in this article. Secondly, if the interpretation above (in terms of variance and covariances) is correct, wouldn't such a non-linearity extend to the PMCC as well? 188.8.131.52 (talk) 05:20, 4 November 2009 (UTC)
Correlation refers only to LINEAR relationships. "Non-linear correlation" (when used correctly) refer to techniques transforming one or more variables from a non-linear relation so that it can be made linear. E.g., suppose there is a exponential relationship between two variables -- clearly not linear. But by transforming one variable (by using logs), the relationship can be expressed linearly, and a linear correlation coefficient can be calculated. (We've all seen linear log or log-log scale graphs). Even the references above supports this definition. E.g., if you read the Shi & Taam paper -- they define their approach as finding "... a and b which maximize the LINEAR relationship between v and h(u)". The Spearman Correlation Coefficient mentioned in the article is another example, transforming one or more variables to ranks so it can be analyzed linearly. In any case, it doesn't matter if some papers or books use this term incorrectly. JamesBWatson should not be doing original research by looking at titles, but by looking at AUTHORITATIVE definitions in mathematical texts (in this case, authoritative statistics texts). 184.108.40.206 (talk) 22:38, 20 August 2015 (UTC)
Section merger proposal
I disagree with the proposal to move material from this article to the Pearson correlation article. In fact, this issue has been discussed quite a bit in the past, and the consensus was to use the Pearson correlation article for issues related to linear correlation measures of the product-moment type, while the correlation article could cover topics related to pairwise association measures in general. The section on "sensitivity to the data distribution" applies specifically to the Pearson correlation measure. Some parts of it may be more general, but not most of it. I was the person who originally created this section, in both articles. I later came to feel that the section in the correlation article needed to be merged to Pearson correlation, not the other way around. I just hadn't had a chance to do it yet. The proposed merger takes us in the wrong direction. Skbkekas (talk) 03:54, 2 November 2009 (UTC)
- I don't see that it does any harm to keep both sections, but I certainly agree with Skbkekas that if there is to be a merge it should be from Correlation to Pearson correlation, not the other way around. JamesBWatson (talk) 12:10, 2 November 2009 (UTC)
I have changed the merge templates on both articles to indicate moving material to Pearson correlation, with the discussion pointer still pointing here. I have reverted the move already made of some stuff and I think much more should be moved. If this direction of change is what is wanted, it may be best to rename this article to someting like "correlation and dependence" to give a better indication of its scope. Melcombe (talk) 16:58, 2 November 2009 (UTC)
- If Skbkekas is right in saying that previous discussion has resulted in a consensus for keeping both sections then I don't see that a merger is justified. I should also like to put it on record that I agree that it is better to keep both of them. JamesBWatson (talk) 08:32, 4 November 2009 (UTC)
- But the question is how much and what material should be in both. The use of the "Main" tag to point to the Pearson correlation article would mean that what should be here is only a summary plus whatever other stuff is required that is relevant to the main topic of the current article. Do we agree that the topic should be "pairwise association measures in general"? I think that topic does deserve an article of its own and this is the way the article starts I think, and the direction the article was being pushed. But there are a number of problems with the articles taken together that can hopefully be reduced by having an appropriate separation of topics. For example in the case of the product-moment correlation there are three separate concepts: the population value, the "raw" estimate obtained by the usual formula, other estimates of correlation derived from appropriate non-normal joint distributions. It is hardly made clear which of these is being thought of for the various points being discussed. Melcombe (talk) 10:32, 4 November 2009 (UTC)
- I don't think I said that the consensus of the earlier discussion was to keep both sections. The earlier discussion dealt with how to divide material between the two articles. I like Melcombe's proposal to retitle the correlation article as something along the lines of "correlation and dependence." As far as "sensitivity to the data distribution" goes, I think a section like that belongs in nearly every article about a summary statistic. However, the contents of the section would obviously differ. If the correlation article moves to "correlation and dependence," I'm not sure if there are any general statements that can be made that are applicable in general to correlation and dependence, whereas it is of course possible to say things specifically about Pearson correlation. Skbkekas (talk) 19:57, 4 November 2009 (UTC)
Correction of a misunderstanding
The following comment was placed in the article by 220.127.116.11, in the section Pearson's product-moment coefficient.
- It is important to appreciate that the above description applies to the population and not a small sample. It does not take into account the degrees of freedom. A simple test in Excel shows that the covarience of an array divided by the product of the standard deviations does not give the correct value. For example when all x=y, r does not = 1. However, if the product of the z scores of each (x,y) pair are divided by (n-1) rather than n then the correct value is obtained.
Firstly, this comment belongs here, not in the article, so I have moved it. Secondly, I shall try to clear up the misunderstanding. The covariance of a sample is calculated by dividing by n, while dividing by n-1 is used to calculate an unbiased estimate of the covariance of the population. Exactly the same applies to calculating the variance of a sample and an unbiased estimate of a population variance. The standard definition, as given in the article, uses the sample covariance and the sample variances. Alternatively you can use unbiased estimates of population values in both cases: the result is exactly the same. However, the result is not the same if you mix the sample covariance and unbiased estimates of population variances: you have to be consistent. I do not normally use Excel, but to prepare for writing this I have looked at it. The function COVAR calculates the covariance of the numbers given (which may or may not be a sample: that is irrelevant). On the other hand the function VAR does not calculate the variance of the numbers given, but rather an estimate of a variance of a population which the numbers are assumed to be a sample from. In order to calculate the actual variance of the numbers given, you have to use the function VARP. Why VAR and COVAR work inconsistently is something only Microsoft programmers can explain. Unfortunately the Excel help files make things even more confusing: for example they say that VARP "Calculates variance based on the entire population", although the numbers are frequently not a population at all. I think it comes down to Microsoft programmers being programmers with a little knowledge of statistical techniques, rather than statisticians. JamesBWatson (talk) 20:41, 19 November 2009 (UTC)
- I think there is a (n-1) missing from the denominator the second form of the 'sample correlation coefficient'. Shouldn't it be:
- Aaron McDaid (talk - contribs) 21:06, 24 October 2012 (UTC)
Which one is known as canonical correlation
- None of them: see Canonical correlation. However, this page is not for questions of this kind: it is for discussing editing of the article. JamesBWatson (talk) 19:34, 6 June 2010 (UTC)
Pearson correlation "mainly sensitive" to linear relationships??
In the second paragraph we have the sentence:
- [...] The Pearson correlation coefficient, [is] mainly sensitive to a linear relationship between two variables.
Shouldn't the word "mainly" be changed to "strictly"?
- I disagree. To the contrary, I think "mainly" is already too strict. I think it should be "somewhat more". Skbkekas (talk) 05:09, 12 September 2010 (UTC)
This example shows the linear relationship between x and log(x), which is present to a large extent. log(x) is not linear towards zero but it is very linear out near 1. Thus the correlation coefficient is reduced by the former but not the latter. I coded up this short program in Python to demonstrate. The visual demonstration of the Pearson correlation is linear regression as shown below (blue line is log(x), red line is the regression of course). Note that the correlation coefficient actually is near .787
The code for this example is as follows:
import numpy as N import matplotlib.pyplot as plt from scipy.stats.stats import pearsonr,linregress x = N.linspace(.00000001,1,1000) y = N.log(x) (a,b,r,tt,stderr)=linregress(x,y) z = a*x+b print r plt.plot(x,z,'r') plt.plot(x,y) plt.savefig('x_vs_logx.png') r,p = pearsonr(x,y) print r
it returns the above plot as well as this output of the Pearson correlation coefficent (calculated in two places independently):
This isn't a big deal, but the numerical calculation above is giving the wrong answer, since the numerical approximation to the definite integral is very sensitive to how the limiting behavior at zero is handled. Doing the calculation analytically, you get -1/4 for E(X*Y) (using integration by parts), and you get -1/2 for EX*EY (using the fact that Y follows a standard exponential distribution). Thus cov(X,Y) = 1/4. The variance of X is 1/12 and the variance of Y is 1. Thus the correlation is sqrt(12)/4 = 0.866.
The larger issue about whether the Pearson correlation is "strictly" sensitive to a linear relationship amounts to how you interpret the word "strictly". Many people would incorrectly interpret this as implying that the Pearson correlation is blind to relationships that aren't perfectly linear. I also would argue that the plot above exaggerates the approximate linearity of log(x), based on the very large range of the vertical axis.Skbkekas (talk) 02:12, 16 September 2010 (UTC)
- Thanks for that correction, Skbkekas. I was playing fast and loose with my numerical approx and you're totally right about the limiting behavior at zero, considering log(x) shoots off to -inf. I reran my code with my function representation parameters maxed out, i.e. changing the line
x = N.linspace(.00000001,1,1000)
x = N.linspace(1e-150,1,6.5e7)
(the smallest value for interval start and largest vector size, respectively which Python running on my computer are capable of)
- and I get the value
- Note that the correlation actually 'increases towards your analytic limit because using a smaller discretization of the x-axis amounts to giving less weight in the calculation to the values of log(x) really close to and including x=1e-150.
- As to how correlation handles non-linear relationships, I think we're getting caught up on the word "relationship". Yes log(x) has an explicit non-linear relationship to x, but that's a different sense of the word "relationship" than what Pearson correlation measures. Pearson correlation measures the degree to which log(x) is "linear-ish" to use some colloquial language. That is, it measures what the relationship of log(x) is not to x, but to a linear approximation of itself along the specified interval of x. And, as the last paragraph points out, there is only a relatively small sub-interval that log(x) is not linear.
- To demonstrate this, I ran my code again, but now with the above line changed to
x = N.linspace(.2,1,6.5e7)
- Your comment about the log(x) axs is fair (I let Python choose them before), and I plotted again with the axis restricted to a min of -8. Including the figure here and also a figure showing the regression on the sub-interval [.2,1] mentioned above.
@Watson The log(x) function on the unit interval (i.e., [0:1]) is not a suitable example for a numerical solution. By this means, the correlation coefficient for a linear coefficient is ill-defined. Actually, in regard of your edit history, you should have come to this conclusion yourself. How can you seriously be changing the value of this correlation coefficient in the article without understanding that even the new value is not the true value, simply because you cannot find it with your tool. Remember, at first you though 0.787 was close enough to the true value. Tomeasy T C 06:50, 23 September 2010 (UTC)
There is a citation given, near the bold term anticorrelation (ref number 5) to Dowdy, S. and Wearden, S. (1983). "Statistics for Research". Wiley. ISBN 0471086029 pp 230. This is the first edition and the latest is the 3rd (Detail and online subscription version)... can anyone say whether this term does (still) appear and so update the reference and page number? Melcombe (talk) 14:17, 21 September 2010 (UTC)
Mistakes in Formulas
Yesterday I changed some major mistakes in the correlation coefficient (look at the edits eliminating the (n-1) in the denominator. I think these kind of mistakes are unacceptable and unexcusable, and that a warning should be added to the article saying that it's reliability or quality is poor, at least until a couple of experts devote some time verifying the quality of the info. 18.104.22.168 (talk) 16:08, 17 January 2011 (UTC)
- It was right before, with the n–1 terms in the divisors. However, these just serve to cancel out the n–1 in the formula for s given at Standard deviation#With sample standard deviation. I'll add another expression to the first display formula for rxy to make this a bit clearer. One quick way to see that the formulas you left had to be wrong is to consider what would happen if you computed the correlation for a sample with two copies of all the observations in the original sample. Clearly this should have no effect on the estimated correlation. --Qwfp (talk) 21:15, 17 January 2011 (UTC)
The alarm clock and the dawn
Why is the correlation between alarm clocks ringing and dawn such a bad example of correlation without causation? A ringing alarm clock does not cause the dawn, nor does the dawn cause the alarm clock to ring. The dawn (or anticipation of it) causes a person to set the alarm clock, which causes the alarm clock to ring, but this is not the same thing. Anyway, would the correlation of the forces on various objects and the time rate of change of their momenta be a better example? Probably not, its a correlation resulting from the definition of force, an imposed correlation. Maybe two planets of similar mass rotating around two stars of similar mass - they would have a correlation in their position, but that smells like an imposed correlation too. I can't think of an example of unimposed correlation without some series of causative links. PAR (talk) 05:07, 10 June 2013 (UTC)
Correlation and linearity
When discussing the scatterplots of Anscombe's quartet, the comments about normality are debatable, especially:
The first one (top left) seems to be distributed normally
Based only on the 11 points, it's hard to say anything, as the data could come from a normal distribution, as well as a Student, Laplace, Cauchy, or many other distributions. Even the Uniform distribution seems hard to be ruled out.
The second one (top right) is not distributed normally
There are actually no more evidence for the normality of the data on the first graph than on the second graph. The difference is that the variance seems very low in the second graph. That does not mean that the data is not normally distributed.
Also it is not clear in the section what is/isn't normally distributed. I'm guessing that the contributor was referring to the variable Y, as in both top graphs, the X variable seems more uniformly distributed than normally distributed.
Overall, I think that we should improve this section. I'm willing to do so, for example by removing all the comments about normality. Has someone a better idea? --Nicooo (talk) 02:57, 22 February 2015 (UTC)
- From our article Anscombe's quartet, it appears that the comments about normality may appear in Anscombe's paper. Unfortunately I cannot access the paper. Can you? Maybe that paper makes it clearer what is intended. Loraof (talk) 19:34, 27 February 2016 (UTC)
Dependence does not demonstrate a causal relationship
The page says "dependence is not sufficient to demonstrate the presence of such a causal relationship". Is it? I thought that correlation doesn't imply causality, but that dependence does. If we consider all forms of causality and all forms of dependence, shouldn't dependence at least imply some sort of causality? 7804j (talk) 00:35, 7 June 2016 (UTC)
- I understand your point—"dependence" sounds like it means "causal dependence". But it does not mean that in the present context—paragraph 3 of the lead says "Formally, dependence refers to any situation in which random variables do not satisfy a mathematical condition of probabilistic independence." This is a feature of joint probability distributions, which are descriptions that do not address the issue of dependence. Loraof (talk) 15:24, 11 June 2016 (UTC)
- So do you agree that the relationship between "dependence" and "causality" is not clearly explained in this article? Maybe this should be clarified? Because the page refers a lot to the fact that correlation is not an indication of causality, but never clearly addresses the case of dependence. And I have seen multiple people on other websites such as Quora having some very strong conflicting opinions about whether dependence should always be considered as implying causality or not.7804j (talk) 22:46, 11 June 2016 (UTC)
- I think the sentence I quoted above is quite clear. Also, the first sentence of the article is also clear: "In statistics, dependence is any statistical relationship between two random variables or two sets of data." But I'll add the phrase "whether causal or not" to this sentence.