Jump to content

Spearman's rank correlation coefficient: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
No edit summary
Gdupont (talk | contribs)
No edit summary
Line 1: Line 1:
In [[statistics]], '''Spearman's rank correlation coefficient''' or '''Spearman's rho''', named after [[Charles Spearman]] and often denoted by the Greek letter [[rho|<math>\rho</math>]] (rho) or as <math>r_s</math>, is a [[non-parametric statistics|non-parametric]] measure of [[correlation]] &ndash; that is, it assesses how well an arbitrary [[monotonic]] function could describe the relationship between two [[variable]]s, without making any assumptions about the [[frequency distribution]] of the [[variables]].

== Calculation ==
In principle, ρ is simply a special case of the [[Pearson product-moment correlation coefficient|Pearson product-moment coefficient]] in which two sets of data <math>X_i</math> and <math>Y_i</math> are converted to [[ranking]]s <math>x_i</math> and <math>y_i</math> before calculating the coefficient.<ref name="myers2003">{{cite book
| last = Myers
| first = Jerome L.
| coauthors = Arnold D. Well
| title = Research Design and Statistical Analysis
| publisher = Lawrence Erlbaum
| year = 2003
| edition = second edition
| isbn = 0805840370
| pages = p. 508
}}</ref> In practice, however, a simpler procedure is normally used to calculate ρ. The [[raw score]]s are converted to ranks, and the differences <math>d_i</math> between the ranks of each observation on the two variables are calculated.

If there are no tied ranks, i.e.
<math>\neg\exists_{i,j} (i\ne j \wedge (X_i=X_j \vee Y_i=Y_j))</math>

then ρ is given by:

:<math> \rho = 1- {\frac {6 \sum d_i^2}{n(n^2 - 1)}}</math>

where:

:<math>d_i = x_i - y_i</math> = the difference between the ranks of corresponding values <math>X_i</math> and <math>Y_i</math>, and

:''n'' = the number of values in each data set (same for both sets).

If tied ranks exist, classic Pearson's [[correlation coefficient]] between ranks has to be used instead of this formula:<ref name="myers2003"/>

:<math>
\rho=\frac{n(\sum x_iy_i)-(\sum x_i)(\sum y_i)}
{\sqrt{n(\sum x_i^2)-(\sum x_i)^2}~\sqrt{n(\sum y_i^2)-(\sum y_i)^2}}.
</math>

One has to assign the same rank to each of the equal values. It is an average of their positions in the ascending order of the values:

'''An example of averaging ranks'''

In the table below, notice how the rank of values that are the same is the mean of what their ranks would otherwise be.

{|class="wikitable"
!Variable <math>X_i</math> !! Position in the descending order !! Rank <math>x_i</math>
|-
|0.8||5||5
|-
|1.2||4||<math>\frac{4+3}{2}=3.5\ </math>
|-
|1.2||3||<math>\frac{4+3}{2}=3.5\ </math>
|-
|2.3||2||2
|-
|18||1||1
|}

Spearman's rank correlation coefficient is equivalent to Pearson correlation on ranks. The first formula above is a short-cut to its product-moment form, assuming no tie (i.e. no equal ranks in either column). The second, product-moment form can be used in both tied and untied cases.

== Example ==
== Example ==
The raw data used in this example is shown below.
The raw data used in this example is shown below where we want to calculate the correlation between the [[IQ]] of someone with the number of hours spend in front of [[TV]] per week.
{| class="wikitable"
{| class="wikitable"
|-
|-
Line 182: Line 125:
:<math> \rho = 1- {\frac {6\times194}{10(10^2 - 1)}}</math>
:<math> \rho = 1- {\frac {6\times194}{10(10^2 - 1)}}</math>


which evaluates to <math> \rho = -0.175758</math>. In the case of ties in the original values, this formula should not be used. Instead, the Pearson correlation coefficient should be calculated on the ranks (where ties are given ranks, as described above).
which evaluates to <math> \rho = -0.175758</math> which show that the correlation between IQ and hour spend between TV is really low (barrely no correlation). In the case of ties in the original values, this formula should not be used. Instead, the Pearson correlation coefficient should be calculated on the ranks (where ties are given ranks, as described above).

== Determining significance ==
The modern approach to testing whether an observed value of ρ is significantly different from zero (we will always have 1 ≥ ρ ≥ &minus;1) is to calculate the probability that it would be greater than or equal to the observed ρ, given the [[null hypothesis]], by using a [[Resampling (statistics)|permutation test]]. This approach is almost always superior to traditional methods, unless the [[data set]] is so large that computing power is not sufficient to generate permutations, or unless an algorithm for creating permutations that are logical under the null hypothesis is difficult to devise for the particular case (but usually these algorithms are straightforward).

Although the permutation test is often trivial to perform for anyone with computing resources and programming experience, traditional methods for determining significance are still widely used. The most basic approach is to compare the observed ρ with published tables for various levels of significance. This is a simple solution if the significance only needs to be known within a certain range or less than a certain value, as long as tables are available that specify the desired ranges. A reference to such a table is given below. However, generating these tables is computationally intensive and complicated mathematical tricks have been used over the years to generate tables for larger and larger sample sizes, so it is not practical for most people to extend existing tables.

An alternative approach available for sufficiently large sample sizes is an approximation to the [[Student's t-distribution]]. For sample sizes above about 20, the variable
:<math>t = \frac{\rho}{\sqrt{(1-\rho^2)/(n-2)}}</math>
:<math>\rho = \frac{t}{\sqrt{n-2+t^2}}</math>
has a Student's t-distribution in the null case (zero correlation). In the non-null case (i.e. to test whether an observed ρ is significantly different from a theoretical value, or whether two observed ρs differ significantly) tests are much less powerful, though the ''t''-distribution can again be used.

A generalization of the Spearman coefficient is useful in the situation where there are three or more conditions, a number of subjects are all observed in each of them, and we predict that the observations will have a particular order. For example, a number of subjects might each be given three trials at the same task, and we predict that performance will improve from trial to trial. A test of the significance of the trend between conditions in this situation was developed by E. B. Page and is usually referred to as [[Page's trend test]] for ordered alternatives.

== Correspondence analysis based on Spearman's rho ==
Classic [[correspondence analysis]] is a statistical method which gives a score to every value of two nominal variables, in this way that Pearson's [[correlation coefficient]] between them is maximized.

There exists an equivalent of this method, called [[grade correspondence analysis]], which maximizes Spearman's rho or [[Kendall's tau]]<ref>{{cite book|last=Kowalczyk|first=T.|coauthors=Pleszczyńska E. , Ruland F. (eds.)| year=2004|title=Grade Models and Methods for Data Analysis with Applications for the Analysis of Data Populations|series=Studies in Fuzziness and Soft Computing vol. 151|publisher=Springer Verlag|location=Berlin Heidelberg New York|isbn=9783540211204}}</ref>.

==See also==
* [[Kendall tau rank correlation coefficient]]
* [[Rank correlation]]
* [[Chebyshev's sum inequality]], [[rearrangement inequality]] (These two articles may shed light on the mathematical properties of Spearman's ρ.)
* [[Pearson product-moment correlation coefficient]], a similar correlation method that instead relies on the data being linearly correlated.

==External links==
*[http://www.sussex.ac.uk/Users/grahamh/RM1web/Rhotable.htm Table of critical values of ρ for significance with small samples]
*[http://www.wessa.net/rankcorr.wasp Online calculator]
*[http://faculty.vassar.edu/lowry/webtext.html Chapter 3 part 1 shows the formula to be used when there are ties]
*[http://udel.edu/~mcdonald/statspearman.html Spearman's rank correlation]: Simple notes for students with an example of usage by biologists and a spreadsheet for [[Microsoft Excel]] for calculating it (a part of materials for a ''Research Methods in Biology'' course).

== References ==
<div class="references-small">
<references />
* C. Spearman, "The proof and measurement of association between two things" Amer. J. Psychol. , 15 (1904) pp. 72–101
* M.G. Kendall, "Rank correlation methods" , Griffin (1962)
* M. Hollander, D.A. Wolfe, "Nonparametric statistical methods" , Wiley (1973)
</div>

{{Statistics}}

[[Category:Covariance and correlation]]
[[Category:Statistical dependence]]
[[Category:Statistical tests]]
[[Category:non-parametric statistics]]

[[de:Rangkorrelationskoeffizient]]
[[es:Coeficiente de correlación de Spearman]]
[[it:Coefficiente di correlazione per ranghi di Spearman]]
[[he:מתאם ספירמן]]
[[lv:Spīrmena rangu korelācijas koeficients]]
[[nl:Spearmans rangcorrelatiecoëfficiënt]]
[[ja:スピアマンの順位相関係数]]
[[pl:Współczynnik korelacji rangowej Spearmana]]
[[pt:Coeficiente de correlação de postos de Spearman]]

Revision as of 17:49, 20 July 2008

Example

The raw data used in this example is shown below where we want to calculate the correlation between the IQ of someone with the number of hours spend in front of TV per week.

IQ, Hours of TV per week,
106 7
86 0
100 27
101 50
99 28
103 29
97 20
113 12
112 6
110 17

The first step is to sort this data by the second column. Next, two more columns are created ( and ). The last of these columns () is assigned 1,2,3,...n, and then the data is sorted by the first original column (). The first of the newly created columns () is assigned 1,2,3,...n. Then a column is created to hold the differences between the two rank columns ( and ). Finally another column should be created. This is just column squared.

After doing this process with the example data you should end up with something like:

IQ, Hours of TV per week, rank rank
86 0 1 1 0 0
97 20 2 6 -4 16
99 28 3 8 -5 25
100 27 4 7 -3 9
101 50 5 10 -5 25
103 29 6 9 -3 9
106 7 7 3 4 16
110 17 8 5 3 9
112 6 9 2 7 49
113 12 10 4 6 36

The values in the column can now be added to find . The value of n is 10. So these values can now be substituted back into the equation,

which evaluates to which show that the correlation between IQ and hour spend between TV is really low (barrely no correlation). In the case of ties in the original values, this formula should not be used. Instead, the Pearson correlation coefficient should be calculated on the ranks (where ties are given ranks, as described above).