Cohen's kappa: Difference between revisions

Content deleted Content added

Inline

Revision as of 22:12, 17 May 2010

Cohen's kappa coefficient is a statistical measure of inter-rater agreement for qualitative (categorical) items. It is generally thought to be a more robust measure than simple percent agreement calculation since κ takes into account the agreement occurring by chance. Some researchers (e.g. Strijbos, Martens, Prins, & Jochems, 2006) have expressed concern over κ's tendency to take the observed categories' frequencies as givens, which can have the effect of underestimating agreement for a category that is also commonly used; for this reason, κ is considered an overly conservative measure of agreement.

Others (e.g., Uebersax, 1987) contest the assertion that kappa "takes into account" chance agreement. To do this effectively would require an explicit model of how chance affects rater decisions. The so-called chance adjustment of kappa statistics supposes that, when not completely certain, raters simply guess—a very unrealistic scenario.

Nevertheless, and despite potentially better alternatives^[1], Cohen's kappa enjoys continued popularity. A possible reason for this is that kappa is, under certain conditions, equivalent to the intraclass correlation coefficient.

Calculation

Cohen's kappa measures the agreement between two raters who each classify N items into C mutually exclusive categories. The first mention of a kappa-like statistic is attributed to Galton (1892), see Smeeton (1985).

The equation for κ is:

\kappa ={\frac {\Pr(a)-\Pr(e)}{1-\Pr(e)}},\!

where Pr(a) is the relative observed agreement among raters, and Pr(e) is the hypothetical probability of chance agreement, using the observed data to calculate the probabilities of each observer randomly saying each category. If the raters are in complete agreement then κ = 1. If there is no agreement among the raters (other than what would be expected by chance) then κ ≤ 0.

The seminal paper introducing kappa as a new technique was published by Jacob Cohen in the journal Educational and Psychological Measurement in 1960.

A similar statistic, called pi, was proposed by Scott (1955). Cohen's kappa and Scott's pi differ in terms of how Pr(e) is calculated.

Note that Cohen's kappa measures agreement between two raters only. For a similar measure of agreement (Fleiss' kappa) used when there are more than two raters, see Fleiss (1971). The Fleiss kappa, however, is a multi-rater generalization of Scott's pi statistic, not Cohen's kappa.

Example

Suppose that you were analyzing data related to people applying for a grant. Each grant proposal was read by two people and each reader either said "Yes" or "No" to the proposal. Suppose the data were as follows, where rows are reader A and columns are reader B:

	Yes	No
Yes	20	5
No	10	15

Note that there were 20 proposals that were granted by both reader A and reader B, and 15 proposals that were rejected by both readers. Thus, the observed percentage agreement is Pr(a)=(20+15)/50 = 0.70.

To calculate Pr(e) (the probability of random agreement) we note that:

Reader A said "Yes" to 25 applicants and "No" to 25 applicants. Thus reader A said "Yes" 50% of the time.
Reader B said "Yes" to 30 applicants and "No" to 20 applicants. Thus reader B said "Yes" 60% of the time.

Therefore the probability that both of them would say "Yes" randomly is 0.50*0.60=0.30 and the probability that both of them would say "No" is 0.50*0.40=0.20. Thus the overall probability of random agreement is Pr("e") = 0.3+0.2 = 0.5.

So now applying our formula for Cohen's Kappa we get:

\kappa ={\frac {\Pr(a)-\Pr(e)}{1-\Pr(e)}}={\frac {0.70-0.50}{1-0.50}}=0.40\!

Inconsistent results

One of the problems with Cohen's Kappa is that it does not always produce the expected answer.^[1] For instance, in the following two cases there is equal agreement between A and B (60 out of 100 in both cases) so we would expect the relative values of Cohen's Kappa to reflect this. However, calculating Cohen's Kappa for each:

	Yes	No
Yes	45	15
No	25	15

$\kappa ={\frac {0.60-0.54}{1-0.54}}=0.1304$

	Yes	No
Yes	25	35
No	5	35

$\kappa ={\frac {0.60-0.46}{1-0.46}}=0.2593$

we find that it shows greater similarity between A and B in the second case, compared to the first.

Significance

Landis and Koch^[1] gave the following table for interpreting $\kappa$ values. This table is however by no means universally accepted; Landis and Koch supplied no evidence to support it, basing it instead on personal opinion. It has been noted that these guidelines may be more harmful than helpful^[2], as the number of categories and subjects will affect the magnitude of the value. The kappa will be higher when there are fewer categories.^[3]

$\kappa$	Interpretation
< 0	No agreement
0.0 — 0.20	Slight agreement
0.21 — 0.40	Fair agreement
0.41 — 0.60	Moderate agreement
0.61 — 0.80	Substantial agreement
0.81 — 1.00	Almost perfect agreement

Online calculators

Notes

^ Landis, J. R. and Koch, G. G. (1977) pp. 159—174
^ Gwet, K. (2010). "Handbook of Inter-Rater Reliability (Second Edition)" ISBN 978-0970806222
^ Sim, J. and Wright, C. C. (2005) pp. 257—268

References

^ ^a ^b Kilem Gwet (2002). "Inter-Rater Reliability: Dependency on Trait Prevalence and Marginal Homogeneity". Statistical Methods For Inter-Rater Reliability Assessment. 2: ???. {{cite journal}}: Unknown parameter |month= ignored (help) http://www.stataxis.com/files/articles/inter_rater_reliability_dependency.pdf

Banerjee, M. et al. (1999). "Beyond Kappa: A Review of Interrater Agreement Measures" The Canadian Journal of Statistics / La Revue Canadienne de Statistique, Vol. 27, No. 1, pp. 3-23 <http://www.jstor.org/stable/3315487>
Brennan, R. L. and Prediger, D. J. (1981) "Coefficient λ: Some Uses, Misuses, and Alternatives" Educational and Psychological Measurement, 41, 687-699.
Cohen, Jacob (1960), A coefficient of agreement for nominal scales, Educational and Psychological Measurement Vol.20, No.1, pp. 37–46.
Fleiss, J.L. (1971) "Measuring nominal scale agreement among many raters." Psychological Bulletin, Vol. 76, No. 5 pp. 378—382
Fleiss, J. L. (1981) Statistical methods for rates and proportions. 2nd ed. (New York: John Wiley) pp. 38—46
Fleiss, J.L. and Cohen, J. (1973) "The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability" in Educational and Psychological Measurement, Vol. 33 pp. 613—619
Galton, F. (1892). Finger Prints Macmillan, London.
Gwet, K. (2008). "Computing inter-rater reliability and its variance in the presence of high agreement." British Journal of Mathematical and Statistical Psychology, Vol. 61, pp. 29-48 <http://agreestat.com/yahoo_site_admin/assets/docs/Reprint139.11442829.pdf>
Gwet, K. (2008). "VARIANCE ESTIMATION OF NOMINAL-SCALE INTER-RATER RELIABILITY WITH RANDOM SELECTION OF RATERS." Psychometrika, Vol. 73, No. 3, pp. 407-430 <http://agreestat.com/yahoo_site_admin/assets/docs/2008_Variance_Estimation_of_Nominal-Scale_Inter-Rater.11443415.pdf>
Gwet, K. (2008). "INTRARATER RELIABILITY." Wiley Encyclopedia of Clinical Trials, Copyright 2008 John Wiley & Sons, Inc., <http://agreestat.com/yahoo_site_admin/assets/docs/INTRARATER_RELIABILITYeoct631.11443740.pdf>
Landis, J.R. and Koch, G. G. (1977) "The measurement of observer agreement for categorical data" in Biometrics. Vol. 33, pp. 159—174
Scott, W. (1955). "Reliability of content analysis: The case of nominal scale coding." Public Opinion Quarterly, 17, 321-325.
Sim, J. and Wright, C. C. (2005) "The Kappa Statistic in Reliability Studies: Use, Interpretation, and Sample Size Requirements" in Physical Therapy. Vol. 85, pp. 257—268
Smeeton, N.C. (1985) "Early History of the Kappa Statistic" in Biometrics. Vol. 41, p.795.
Strijbos, J., Martens, R., Prins, F., & Jochems, W. (2006). Content analysis: What are they talking about? Computers & Education, 46, 29-48.
Uebersax JS. Diversity of decision-making models and the measurement of interrater agreement. Psychological Bulletin, 1987, 101, 140-146.

External links

[Gwet2002-1] Kilem Gwet (2002). "Inter-Rater Reliability: Dependency on Trait Prevalence and Marginal Homogeneity". Statistical Methods For Inter-Rater Reliability Assessment. 2: ???. {{cite journal}}: Unknown parameter |month= ignored (help) http://www.stataxis.com/files/articles/inter_rater_reliability_dependency.pdf

[1]

@@ Line 144: / Line 144: @@
 * Fleiss, J.L. and Cohen, J. (1973) "The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability" in ''Educational and Psychological Measurement'', Vol. 33 pp.&nbsp;613—619
 * Galton, F. (1892). ''Finger Prints'' Macmillan, London.
+* Gwet, K. (2008). "Computing inter-rater reliability and its variance in the presence of high agreement." ''British Journal of Mathematical and Statistical Psychology'', Vol. 61, pp.&nbsp;29-48 <http://agreestat.com/yahoo_site_admin/assets/docs/Reprint139.11442829.pdf>
-* Gwet, K. (2001) ''Statistical Tables for Inter-Rater Agreement''. Gaithersburg : StatAxis Publishing)
+* Gwet, K. (2008). "VARIANCE ESTIMATION OF NOMINAL-SCALE INTER-RATER RELIABILITY WITH RANDOM SELECTION OF RATERS." ''Psychometrika'', Vol. 73, No. 3, pp.&nbsp;407-430 <http://agreestat.com/yahoo_site_admin/assets/docs/2008_Variance_Estimation_of_Nominal-Scale_Inter-Rater.11443415.pdf>
+* Gwet, K. (2008). "INTRARATER RELIABILITY." ''Wiley Encyclopedia of Clinical Trials, Copyright 2008 John Wiley & Sons, Inc.'', <http://agreestat.com/yahoo_site_admin/assets/docs/INTRARATER_RELIABILITYeoct631.11443740.pdf>
 * Landis, J.R. and Koch, G. G. (1977) "The measurement of observer agreement for categorical data" in ''Biometrics''. Vol. 33, pp.&nbsp;159—174
 * Scott, W. (1955). "Reliability of content analysis: The case of nominal scale coding." ''Public Opinion Quarterly'', 17, 321-325.