Talk:Spearman's rank correlation coefficient

From Wikipedia, the free encyclopedia
Jump to: navigation, search
WikiProject Statistics (Rated C-class, High-importance)
WikiProject icon

This article is within the scope of the WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page or join the discussion.

C-Class article C  This article has been rated as C-Class on the quality scale.
 High  This article has been rated as High-importance on the importance scale.
 
WikiProject Mathematics (Rated C-class, Low-importance)
WikiProject Mathematics
This article is within the scope of WikiProject Mathematics, a collaborative effort to improve the coverage of Mathematics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
Mathematics rating:
C Class
Low Importance
 Field:  Probability and statistics
One of the 500 most frequently viewed mathematics articles.

Potential Copyright Violation:[edit]

I was reading "Quantitative Measurement of Scores by Ranks" by Gayatri and Prasad and noticed substantial similarities in the text describing the Spearman's rank correlation coefficient and this wikipedia entry. The text above is a conference paper from the "2011 International Conference on Advancements in Information Technology" and can be accessed HERE. (scroll down to the section on the Spearman CC for the text in question.

The following text from the Gayatri publication is a perfect match with the current wikipedia text:

"In statistics, Spearman's rank correlation coefficient or Spearman's rho, named after Charles Spearman and often denoted by the Greek letter ρ (rho) or as rs, is a non-parametric measure of statistical dependence between two variables. It assesses how well the relationship between two variables can be described using a monotonic function. If there are no repeated data values, a perfect Spearman correlation of +1 or −1 occurs when each of the variables is a perfect monotone function of the other."

The text in question dates from 2011, and the wiki article text is present in the history prior to this date.

216.15.21.56 (talk) 21:10, 9 January 2016 (UTC)

Apparent Technical Error/Inconsistency[edit]

The method introduced at the beginning of the article and used during the bulk of the article is not appropriate for data with tied ranks, but the very first example includes tied ranks, along with a confusing note that for data without tied ranks a more convenient form exists (presumably the form just presented)? Only towards the end of the article is there a note that for data with tied ranks, Pearson's formula should be used. — Preceding unsigned comment added by 212.44.45.225 (talk) 17:15, 18 February 2015 (UTC)

This is indeed confusing. According to the Simple English Wikipedia on Spearman's rho the formula as given can also be used for ties, citing statistics4u.info. However, top Google result Laerd statistics claims that this is not allowed, and presents a square root formula for this case. The Dutch Wikipedia on Spearman's rho makes a clear distinction and provides yet different (derived) formulas. The German Wikipedia claims that the formula, although strictly not applicable, results in only minor deviations if the number of ties is small, citing a suprisingly clear paper from 1942 on this topic. Nevertheless, the German page proceeds with providing a separate formula for ties, still different from the English and the Dutch ones. The Italian Wikipedia agrees with the German page on the neglectibility of the problem and does not provide an alternative formula. Any suggestions?

Move[edit]

This page was formerly at "Spearman's ρ" -- however, this breaks "move page" and is against the general naming principle that names should be the most common name, in English, where available. User:The Anome

Well, this may not be the most common name in English, but the change makes sense because it facilitates linking, especially now that the main alternative titles have redirect pages. I'll remember that next time. Thanks. User:Jfitzg
Would Spearman's rank correlation coefficient be nicer? -- Oliver P. 16:37 28 May 2003 (UTC)
That might definitely be better. Perhaps the main article should be there. John F.
So I moved it. Should have all the bases covered now. John F.

Clarifications[edit]

Schemilix (talk) 17:02, 12 October 2010 (UTC)Just a question on the graphic for the formula... I was under the impression it was n cubed, not n squared, that was part of the denominator for the equation. It's d^2 and n^3.

—Preceding unsigned comment added by Schemilix (talkcontribs) 17:01, 12 October 2010 (UTC)

From the article:

The value of ρ is equivalent to the Pearson product-moment correlation coefficient for the correlation between the ranked data.

Is this an identity or an approximation? If so, it it by definition, by co-incidence, or just for some particular family of distributions?

-- —Preceding unsigned comment added by 217.158.203.203 (talkcontribs)

Thanks. I'll clarify this. It's a special case of the Pearson. -- —Preceding unsigned comment added by Jfitzg (talkcontribs)

You say that "...Unlike the Pearson product-moment correlation coefficient, Spearman's rank correlation coefficient does not require the assumption that the relationship between the variables is linear, nor does it require the variables to be measured on interval scales; it can be used for variables measured at the ordinal level...." and "..However, Spearman's rho DOES ASSUME that subsequent ranks indicate equi-distant positions on the variable measured...". On the other hand, in the definition of interval scales scale they say: "..The numbers assigned to objects have all the features of ordinal measurements, and in addition equal differences between measurements represent equivalent intervals...". So, technically, Spearman's rank correlation DOES require the variables to be measured on interval scales?

The articles states that the values are converted to rankings. From the example given, this looks like fractional rankings. Is this correct? 130.88.90.73 (talk) 14:55, 8 October 2008 (UTC)


In the section on determining significance, the z-score is defined with reference to a function F(r) - what is this mystery function? Flies 1 (talk) 20:13, 30 June 2010 (UTC)

Questions[edit]

Does it anywhere in this article say that p is always a number between 1 and -1, and what high and low values mean? -- —Preceding unsigned comment added by Matt me (talkcontribs)

Request for help[edit]

its related to spearson`s rank correlation. the coffecient of correlation of two firms is -.10714 find the number of compnies if sum of square of diference is 62 -- —Preceding unsigned comment added by 61.2.189.36 (talkcontribs)

Suggestion[edit]

It may be helpful to explain how the data is ranked eg from highest to lowest... and apparently Spearman's does assume the direction of relationship is constant eg rising or falling. -- —Preceding unsigned comment added by 82.37.10.168 (talkcontribs)

  • I agree, i think maybe having an example question would be very helpful and an explanation on the ranks as well. 82.198.250.66 08:20, 19 September 2006 (UTC)

One tailed, two tailed[edit]

The difference between a one talied and two talied test isn't mentioned, not that I know what it is, neither is the need for a null hypothesis. Sam Hayes 09:20, 17 April 2006 (UTC)


the working out of this correllation is wrong, the answer is actually -0.28. this page needs to be reviewed and its source's validility questioned. i am only 15 and have worked this out correctly. lauren campbell, mon 26th february. by the way im not meaning to brag or anything :)

help[edit]

it would be helpful if this was explained easier since i am only a gcse student. i do not understand what is going on. :( -- —Preceding unsigned comment added by 86.7.149.45 (talkcontribs)

Yeah, it would help if it were made intelligible for those who can't understand maths. I don't understand the page and I've been doing this in college and uni. :'( —Preceding unsigned comment added by 86.1.198.198 (talkcontribs)
the reason you don't understand this is because the working is all wrong, the data has to be ranked from highest to lowest, not lowest to highest as has been shown. lauren campbell mon 26th february 2007. its not actually that difficult.

And the formula is (sometimes) incorrect. I know this and I'm only 14. Someone needs to do something about this, I would but I don't know how to do the symbols. George bennett 09:01, 9 July 2007 (UTC)

Some suggestion for improving readability: 1. In the first table (immediately below the formula)the ranking is labelled as descending although the text says it should be ascending and the ranking actually is ascending (cf. comment by Lauren Campbell above)- I think the writer has got in a muddle about which is which and which it should be. In any case this seems a needlessly confusing data set: I had to read it twice to make sense of it - simple numbers like 8-12-12-23 and say 80 or even 30(rather than 180) would do the job better than a set of tiny numbers to 1 d.p. contrasted with the huge 18 with no decimal places 2. Something has been left out in the editing because I cannot find the figure of 1.06 (referred to by dfranknow under Calculating a confidence interval) anywhere 3. The graphs are awkward to read a) because the Y-axis has numbers oriented to be read in the same plane as those on the X-axis, but the actual label Y is placed opon its side, and looks more like an arrow, particularly at step 4 of the worked example; and b) the graphs are laid out to interrupt the text, and one graph is unecessarily repeated 4. An explanation, in the worked example, of how to use the P-value would be sensible, since this page is used by people of varying backgrounds — Preceding unsigned comment added by Ardj (talkcontribs) 10:39, 27 February 2012 (UTC) Ardj (talk) 10:44, 27 February 2012 (UTC)

Hypothesis test with the student t-distribution[edit]

The authors don't mention the number of degrees of freedom for the aforementioned t-distribution which makes the information useless in practical terms :-|

89.164.3.138 07:55, 22 December 2006 (UTC)

I've just added an example[edit]

That walks through the process of doing spearman's rank by hand. I'm not sure if I got the right tone but hey. I just think this article was crying out for an example. The data is my own of course. --Grimboy 14:38, 10 February 2007 (UTC)

Thanks! The example was good but I've just changed it to remove all ties. It seems the formula cannot be used for ties. There is a tie-corrected formula and it seems the Japanese wikipedia page has it, but I cannot read it... --Rayjapan 05:53, 29 May 2007 (UTC)


The value of "d" in the example is lacking the correct sign. For clarity, it should be |d| or the minus sign should be added where appropriate. —Preceding unsigned comment added by 131.130.41.124 (talk) 13:39, 15 January 2008 (UTC)

Spearman's rho vs. Spearman's rank correlation[edit]

The article states that Spearman's rho is a case of Spearman's rank correlation. Really? I think, these are two names for the same thing, and the formula named here "Spearman's rho" is just one of estimators used to estimate Spearman's rank correlation/Spearman's rho in population. Olaf m (talk) 00:49, 16 May 2008 (UTC)

Fixed. Olaf m (talk) 19:11, 28 May 2008 (UTC)

If anyone has a spare moment…[edit]

Over on Judgment of Paris (wine) there's a statement "The original rankings appear to have been valid. The original and the Ashenfelter and Quandt re-calculations demonstrate a very high Spearman rank order correlation coefficient of .923." I have no idea whether this is 100% accurate or 100% BS. If anyone has a spare moment, could they have a look at this? Nunquam Dormio (talk) 15:46, 14 July 2008 (UTC)

Pearson correlation[edit]

Pearson product-moment correlation coefficient, a similar correlation method that instead relies on the data being linearly correlated.

In what sense does it "rely" on the data being linearly correlated? It's legitimate for any bivariate distribution. 72.75.93.12 (talk) 22:33, 9 April 2009 (UTC)

joint frequency distribution of the variables[edit]

I am concerned that the statement "without making any assumptions about the joint frequency distribution of the variables" may be somewhat misleading to readers. In particular that it would be read to imply that this is something special to Spearman's rank correlation coefficient; however other common measures of correlation, e.g. Kendall's tau and Pearson product-moment do not make any assumptions about the joint frequency distributions either; maybe rephrasing to something like "and whose properties are robust to assumptions about the joint frequency distribution of the variables"? Aetheogamous (talk) 16:31, 21 April 2009 (UTC)

The intent is to say that it does not prejudge the form of the relationship between variables in the sense that the Pearson coefficient "looks for" a linear relationship. It is this aspect of the joint distribution that is being targetted in what is written, not the marginal distributions and not the distribution of the residuals as might be implied by considerations of robustness. Of course, there may also be separate considerations of robustness, but I am not sure that you can do/say anything about aboutness unless you start making assumptions about the form of relationship. However, there may be something worth saying (if not already said) that the value of the coeeficient is invariant to montone transformations of either or both variables. Melcombe (talk) 16:57, 21 April 2009 (UTC)
Since there are several aspects of the joint distribution at play here and one in particular is being targeted, maybe the text could directly target the aspect that is intended. Something like "without making any assumptions about the nature of the monotonicity"? Aetheogamous (talk) 17:27, 21 April 2009 (UTC)
I have put in an attempt to do this, but the extra sentence added now needs further thought. Melcombe (talk) 08:59, 22 April 2009 (UTC)

MOSMATH[edit]

I just found the following in this article:

with degrees of freedom N-2.

I changed it to this:

with degrees of freedom N − 2.

Even today, some people don't know the norms of WP:MOSMATH. Michael Hardy (talk) 20:42, 10 June 2009 (UTC)

 Even today, some people don't know the difference between typesetting and knowledge.  —Preceding unsigned comment added by 24.62.203.42 (talk) 20:24, 28 July 2010 (UTC) 

Implementations of Spearman's rho[edit]

The MS Excel formulas

{=CORREL(RANK(x,x),RANK(y,y))}

or (for Excel 2010):

{=CORREL(RANK.EQ(x;x);RANK.EQ(y;y))}

work fine when data set has no ties.


When data set has ties use:

{=CORREL(RANK(x;x)+(COUNTIF(x;x)-1)/2;RANK(y;y)+(COUNTIF(y;y)-1)/2)}

or (for Excel 2010)

{=CORREL(RANK.AVE(x;x);RANK.AVE(y;y))}


Reference: Spearman's rank correlation coefficient in Microsoft Excel (German) — Preceding unsigned comment added by Christian.sch (talkcontribs) 11:57, 21 July 2011 (UTC)

And what is this doing in an encyclopaedic article? I strongly believe this should be removed. Tomcrocker (talk) 12:50, 10 October 2011 (UTC)
This is an article talk page. I agree that that information almost certainly should not be in the article, per WP:NOT#HOWTO, but the request does not seem absurd. — Arthur Rubin (talk) 13:46, 10 October 2011 (UTC)
Sorry, I didn't see it in the article, at first. I removed it there. — Arthur Rubin (talk) 13:48, 10 October 2011 (UTC)

Calculating a confidence interval[edit]

It would be nice to explicitly add how to calculate a confidence interval for the correlation.

StackExchange discusses how to turn the z-score in Spearman's_rank_correlation_coefficient#Determining_significance this section into a confidence interval. However, where does the 1.06 in the Wikipedia article come from? dfrankow (talk) 18:47, 26 November 2011 (UTC)

From one or both of the citations given at the end of the sentence where 1.06 appears. Melcombe (talk) 10:13, 28 November 2011 (UTC)


It would also be nice to provide a description, evidence or link on how to get the p-value from the ρ-value, "(from the t-distribution)" does not explain how spearmans rank helps determine a p-value. Template:Bladavier

According to the documentation for the test implemented in R, "For Spearman's test, p-values are computed using algorithm AS 89 for n < 1290 and exact = TRUE, otherwise via the asymptotic t approximation..." where AS 89 refers to the algorithm published in "D. J. Best & D. E. Roberts (1975), Algorithm AS 89: The Upper Tail Probabilities of Spearman's rho. Applied Statistics, 24, 377–379." The asymptotic t approximation is that the statistic

has an approximate Student t distribution with N-2 degrees of freedom. Mathstat (talk) 23:13, 8 March 2012 (UTC)

Actually, if we run a regression (with intercept) of y-ranks on x-ranks (or vice versa), them the above t-test is exactly the test for the slope coefficient (the regression coefficient) be equal to zero. (I changed the p-value to the correct one in the example.) [Lang 21:26 June 2014 (UTC)]

Standard Error[edit]

Is the formula given for the standard error in rho (Attributed to Pearson, 1907) really correct? It is basically independent of n (except for n very small) or anything else, and equal to 0.65. How can this be right? Vortimer (talk) 11:57, 3 September 2013 (UTC)

Null hypothesis, what null hypothesis ?[edit]

In classical/frequentist statistics, hypothesis testing is about population distribution. Therefore, if one is interested in a given parameter θ (something such as the expectation distribution, or anything else, "parameter" being understood in a broad sense), the hypothesis under test should be about θ. Testing something about a sample characteristic such as the empirical correlation coefficient, or the empirical mean (average sample value) is meaningless.

Now, lets talk about the Spearman correlation coefficient and call it Γ. It is correctly defined in the article and it should be clear to everyone that its value is entirely known as soon as we observe the sample values of (X_i,Y_i), i=1,...,n. In other words, the Spearman coefficient cannot be a population parameter. As such, things such as "testing the value of the Spearman coefficient" or "testing whether it is zero" make no sense.

Taking things a bit further, one may think of the Spearman coefficient Γ as a random variable resulting from making n identical independent draws at random in the distribution of (X,Y). Now Γ has as distribution that depends on that of the pair (X,Y) and on the number n of pairs drawn to constitute the sample. This distribution certainly has an expectation and moments of any order (by construction, Γ is a bounded random variable). Now this distribution and its moments are a population characteristics. If one talks about testing something on Γ, the hypothesis under test should be about the (population) distribution of Γ (or its parameters).

Now my question : when the artcle talks about "testing if rho is 0", what is the null hypothesis ? In what aspect this null hypothesis does restrict the possible population distribution of (X,Y) ? Is the null hypothesis saying that X and Y are independent (P(X ∈ A, Y ∈ B) = P(X ∈ A) P(Y ∈ B)) ? — Preceding unsigned comment added by 81.66.244.154 (talk) 17:57, 10 June 2015 (UTC)

Potential Copyright Violation:

I was reading "Quantitative Measurement of Scores by Ranks" by Gayatri and Prasad and noticed substantial similarities in the text describing the Spearman's rank correlation coefficient and this wikipedia entry. The text above is a conference paper from the "2011 International Conference on Advancements in Information Technology" and can be accessed HERE. (scroll down to the section on the Spearman CC for the text in question.

The following text from the Gayatri publication is a perfect match with the current wikipedia text:

"In statistics, Spearman's rank correlation coefficient or Spearman's rho, named after Charles Spearman and often denoted by the Greek letter ρ (rho) or as rs, is a non-parametric measure of statistical dependence between two variables. It assesses how well the relationship between two variables can be described using a monotonic function. If there are no repeated data values, a perfect Spearman correlation of +1 or −1 occurs when each of the variables is a perfect monotone function of the other."

The text in question dates from 2011, and the wiki article text is present in the history prior to this date.

Related to document retrieval results ranking?[edit]

This article is in Category:Information retrieval evaluation but skimming it I don't see how it's related to ranking of document retrieval results, like for Internet search engines. If it is related, this should be explained in the article; otherwise, I think it should be removed from this category. -- Beland (talk) 16:54, 11 April 2016 (UTC)

pvrank[edit]

The article contains the following text about pvrank:

pvrank is a very recent R package that computes rank correlations and their p-values with various options for tied ranks. It is possible to compute exact Spearman coefficient test p-values for n ≤ 26. Note: Package ‘pvrank’ was removed from the CRAN repository on 2017-04-24 "as check errors were not corrected despite reminders." It is still available in the CRAN archive.

I would argue that we should remove the paragraph, given the (seemingly -- ass assesses by CRAN) lack of quality of the package.

Also, the package is not "very recent" at all. ~ Jotomicron 11:26, 27 July 2017 (UTC) 11:26, 27 July 2017 (UTC)

Desperately needs simplifying[edit]

There's no way a lay reader would understand the definition given at the start. It might be technically accurate but it's in no way sufficient to explain what a Spearman rank correlation is. — Preceding unsigned comment added by 82.12.149.135 (talk) 20:46, 23 October 2017 (UTC)