Talk:Sensitivity and specificity

From Wikipedia, the free encyclopedia
Jump to: navigation, search
          This article is of interest to the following WikiProjects:
WikiProject Statistics (Rated C-class, Mid-importance)
WikiProject icon

This article is within the scope of the WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page or join the discussion.

C-Class article C  This article has been rated as C-Class on the quality scale.
 Mid  This article has been rated as Mid-importance on the importance scale.
WikiProject Medicine (Rated C-class, Mid-importance)
WikiProject icon This article is within the scope of WikiProject Medicine, which recommends that medicine-related articles follow the Manual of Style for medicine-related articles and that biomedical information in any article use high-quality medical sources. Please visit the project page for details or ask questions at Wikipedia talk:WikiProject Medicine.
C-Class article C  This article has been rated as C-Class on the project's quality scale.
 Mid  This article has been rated as Mid-importance on the project's importance scale.


request merge of "negative predictive value" "positive predictive value" and "Sensitivity and specificity". these terms are intimiately related, and should be in one place, possibly with a discussion of ROC. Further, suggest modeling on <>> this is a great expostion of this complicated stuff.Cinnamon colbert (talk) 22:25, 16 November 2008 (UTC) PS: the three articles are a great start

I also think that the entries on precision and recall should be some how linked to this page. Recall is the same things as sensitivity, furthermore, specificity is merely your recall with regards to negative data-points. (Wyatt) —Preceding unsigned comment added by (talk) 15:17, 5 November 2010 (UTC)

action-required: new-image[edit]

OK, I give up. Adding Images to Wiki is a nightmare.

I made a new image for this page that I think is more intuitive.

Someone with more edits to their credit (most of mine have been made anon due to laziness) should add that image to the page. Cheers, -- Dave — Preceding unsigned comment added by Ddopson (talkcontribs) 18:12, 6 September 2013 (UTC)

Merger proposal[edit]

The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.
Decision was merge

I don't believe I can be the only person who thinks it would be better to have a single page for sensitivity and specificity than separate pages for Sensitivity (tests) and Specificity (tests). At present Sensitivity and specificity is a redirect to Binary classification. One section of that, Binary classification#Evaluation of binary classifiers is covering the same ground again. One possibility would be to locate the merged page here at Sensitivity and specificity, replacing the redirect. Binary classification#Evaluation of binary classifiers could then have a "main article on this topic: ..." link to here too. Thoughts? --Qwfp (talk) 08:28, 28 February 2008 (UTC)

I've just realised (thanks to WhatamIdoing) that Test sensitivity should also be included in this discussion. (There's no corresponding test specificity article thank goodness.) --Qwfp (talk) 08:55, 28 February 2008 (UTC)

I agree with merging the two articles as the interpretation of one of them needs the other one also. —Preceding unsigned comment added by (talk) 04:36, 11 March 2008 (UTC)

I also agree with merging the discussed articles, i think one should know things about both values. —Preceding unsigned comment added by (talk) 15:28, 20 March 2008 (UTC)

I also agree with the proposal. Most of the times these two notions are taught, calculated and used as a pair.--[16:36, 21 March 2008 (UTC)

I concur. —Preceding unsigned comment added by (talk) 17:54, 29 March 2008 (UTC)

When describing medical diagnostic tests, sensitivity and specificity always appear as a pair. I am all for merging the two articles. —Preceding unsigned comment added by (talk) 01:04, 7 April 2008 (UTC)

I'm going to go one step further and say that it doesn't make sense to talk about either of sensitivity or specifity without the other. Why make it hard on the user by cross-referencing them instead of just putting them together? -- Jon Miller —Preceding comment added by Sighthndman (talk) 16:53, 2 May 2008 (UTC)

I absolutely agree with the previous comments. It does not make sense to talk about one without the other. —Preceding unsigned comment added by (talk) 18:07, 9 May 2008 (UTC)

Make it so. —Preceding unsigned comment added by (talk) 18:37, 23 May 2008 (UTC)

Yes I completely agree —Preceding unsigned comment added by (talk) 15:03, 31 May 2008 (UTC)

Okay, merging done! Still needs a bit of work though. For the old talk pages, see Talk:Sensitivity (tests) and Talk:Specificity (tests). -3mta3 (talk) 11:55, 12 June 2008 (UTC)

There is an error in the text. A test that is high in sensitivity has a low type I error not tyoe II, and vice versa. It was recorded wrong, but i caught it after much thought. —Preceding unsigned comment added by (talk) 15:26, 14 June 2010 (UTC)

The above discussion is closed. Please do not modify it. Subsequent comments should be made in a new section.


Common usage should play no role in article name, as normally-present concerns (e.g. unfindability) are compensated for by redirects. The same holds true in respect to layman's terminology (gingiva vs. gums). By reversing the order, the article invertes intuitiveness by presenting and elucidating that which relates to type 2 errors prior to what which relates to type 1 errors. DRosenbach (Talk | Contribs) 12:40, 1 October 2009 (UTC)

The Wikipedia:Naming conventions policy, specifically the section Wikipedia:Naming conventions#Use common names, says otherwise: "Articles are normally titled using the most common English-language name of a person or thing that is the subject of the article". 'Gingiva' is different as 'Gums' is ambiguous (see Gum) so the next section Wikipedia:Naming conventions#Be precise when necessary comes into play, but in the case of this article neither order is more or less ambiguous or precise than the other. Presenting specificity first because it relates to Type I error is more logical only to those who already know about Type I and Type II errors. To most non-statisticians, these terms are more confusing and less familiar than sensitivity and specificity. Qwfp (talk) 07:22, 3 October 2009 (UTC)

Specificity/Sensitivity in Bioinformatics[edit]

In bioinformatics literature, the terms 'specificity' and 'sensitivity' are used, but are different from what is shown in this article. The terms are used as synonyms for precision and recall. In other words, sensitivity is used correctly, but specificity seems to be used for positive predictive value. I haven't been able to find a source explicitly exposing this difference, though, so I haven't edited the article to include this caveat. —Preceding unsigned comment added by Kernco (talkcontribs) 17:53, 9 April 2010 (UTC)

In fact,the terms 'specificity' and 'sensitivity' are same when in bioinformatics literatures.Maybe some authors used the wrong formula definition in one bioinformatics literature.Someone point out the error in a comment.

I suspect that in being virtual, informatics has issues with what is the real and correct (ground) truth when operationally defining what are "actual positives" and "actual negatives". Perhap a section on informatics definitions would be beneficial in this article. Zulu Papa 5 * (talk) 02:11, 6 November 2010 (UTC)

In the year since I posted this, and continued my phd research in bioinformatics, it's definitely apparent to me that there's no consensus in the community on what the formula for specificity should be, though my impression is still that the most common usage of specificity in the literature is as a synonym for positive predictive value. It's probably just the case of a wrong usage in the past propagating forward, since you must use the same evaluation metrics to compare your own results with previous ones, so I'm not pushing for this page to be changed in any way. I think it's an interesting anomaly, though. Kernco (talk) 20:34, 19 April 2011 (UTC)

Suggested edits to be made under 'Specificity'[edit]

Suggested edits to be made under 'Specificity':

Hi, I'm new to editing here and don't know how to properly do it. However, I hope that somebody who cares, can see if the following edits help. Because I'm unfamiliar with 'code' used by wikipaedia, please accept my improvised 'formatting' which I've devised as follows.

  • What is italics are the original text which should be deleted.
  • What is in bold is what I think should be added.

Rgds email kingnept(at)

Specificity: "...A specificity of 100% means that the test recognizes all actual negatives - for example in a test for a certain disease, all healthy disease free people will be recognized as healthy disease free. Because 100% specificity means no positives disease free persons are erroneously tagged as diseased. , a positive result in a A high specificity test is normally used to confirm the disease. The maximum can trivially be achieved by a test that claims everybody healthy regardless of the true condition. Unfortunately a 100%-specific test standard can also be ascribed to a 'bogus' test kit whereby nobody, not even those who are truly diseased, ever get tagged as diseased. Therefore, the specificity alone does not tell us how well the test recognizes positive cases. We also need to know the sensitivity of the test. A test with a high specificity has a low type I error rate." —Preceding unsigned comment added by (talk) 21:49, 9 July 2010 (UTC)


Postscript: IMHO: 'sensitivity' and 'specificity' are both VERY BAD 'misnomers' that serve confusion rather then clarity. The more accurate description of each would be 'True-Positive-Rate', 'True-Neg-Rate' respectively; alas, the terms 'sensitivity' and 'specificity' seem poor colloquials that have long served to confuse and mislead. Too bad that they now seem to be convention, but perhaps someone with clout and commitment should no less clarify this ambiguity. (talk) 00:48, 10 July 2010 (UTC)Rgds, Kingnept119.74.145.154 (talk) 00:48, 10 July 2010 (UTC)

Coming to this article by searching for False positive I too wonder if there is scope for describing the simplest concepts in terms that will make obvious sense to an interested layman? False positive, false negative, are both important and if grasped do allow the layperson to understand quite a lot of the more important issues in real life. I would hope to see them in the lede. Richard Keatinge (talk) 18:59, 6 August 2012 (UTC)

Simple summary or interpretation[edit]

I tried to summarise with

"==Medical example==
Eg. a medical diagnostic criteria quoted as having sensitivity = 43% and specificity = 96%
means that 43% of the people with the criteria have the disease, and 96% of the people without
the criteria do not have the disease. Hence 'sensitive' tests help confirm a diagnosis whilst
'specific' tests help exclude a diagnosis."

but it seems to conflict with the Worked Example. Can anyone confirm or correct my summary above please. Does it mean 43% of the people with the disease have the criteria and 96% of the people without the disease do not have the criteria. ? If that's the case wouldn't we more usefully characterise tests by PPV and NPV rather than sensitivity and specificity ? Rod57 (talk) 13:33, 17 August 2010 (UTC)

I believe the point is in the context of how the test is to be applied. Sensitivity and specificity are the standard method to achieving Receiver operating characteristics which are great to optimize diagnostic performance, but say little about how they are applied to benefit a decision. The predictive values are focused on the test outcome while the Sens. and Spec. are a description of the quality of the test as to its performance as a standard. The example given means to illustrate that a test may be good at "confirming" or "excluding" a diagnosis. In this example with 96% specificity, means the test is better at excluding. How the diagnosis is framed to benefit, sets the next question with this given test. Zulu Papa 5 * (talk) 02:23, 6 November 2010 (UTC)

[First-time user] The above summary confuses Sensitivity/Specificity with Positive Predictive Value (PPV)/Negative Predictive Value (NPV). Correction of the above summary: A medical test is 34% sensitive; if we are given a person with the condition, then the test has a 34% chance of being positive. A medical test is 69% specific; if we are given a person without the condition, then the test has a 69% chance of being negative.

A test has a 43% PPV; if we are given a positive test, then there is a 43% chance that the person has the condition. If we are given 100 positive tests (one per person), it is likely that 43% of the people have the condition. A test has a 96% NPV; if we are given a negative test, then there is a 96% chance that the person does not have the condition. If we are given 100 negative tests, it is likely that 96% of the people do not have the condition.

The original page also confuses Specificity with PPV, with the following: "If a test has high specificity, a positive result from the test means a high probability of the presence of disease.[1]" It should read, "For a test with a high specificity, if we are given a person without the condition then there is a high chance that the test is negative." The original page must be corrected, by an experienced editor. No sources to cite. (talk) 09:25, 2 November 2011 (UTC)

Sufficient sample size for sensitivity?[edit]

Responding to this comment here:

Hence with large numbers of false positives and few false negatives, a positive FOB screen test is in itself poor at confirming cancer (PPV = 10%) and further investigations must be undertaken, it will, however, pick up 66.7% of all cancers (the sensitivity).

In: the worked example.

Only three people were tested so if the test were done on, let's say a 100 people, with bowel cancer then maybe there would be a different proportion then 66.7%.

So is it correct to say "the sensitivity of the test is 66.7%"? Wouldn't we need to test it on more people who have bowel cancer?

Although perhaps we could have said something like "the sample sensitivity is 66.7%" as contrasted with the theoretical sensitivity.

At least Wolfram's MathWorld calls "sensitivity" the probability that a positive value tests positive -- so we may not have enough samples to get an estimate of the probability.

MathWorld entry for sensitivity Jjjjjjjjjj (talk) 08:24, 8 March 2011 (UTC)

Although there were only three people who tested positive, the test was done on all two hundred and three people. More number would give you a better estimate of the probability - I think the most you can say is that although the sensitivty appears to be 66.7%, the confidence limits on that figure would necessarily be very wide. A larger sample may or may not show a changed figure, but the more cases the narrower the confidence interval should be. I guess when talking about a test the numbers come from samples of the whole population and are always an estimate. Egmason (talk) 23:47, 30 March 2011 (UTC)
I removed the clarify tag -- it looks like somebody changed the numbers around so that it is now thirty people who have bowel cancer rather than only three people. I just changed the wording around a little bit consistent with the idea that one study does not necessarily determine the performance of the particular test. There may be further uncertainty about the performance.
Jjjjjjjjjj (talk) 05:20, 10 May 2011 (UTC)

Confidence intervals[edit]

Would it be useful to have something about confidence intervals (or credible intervals for Bayesians) here? You would presumably calculate the Binomial proportion confidence interval, or would find the posterior distribution given an uninformative Beta(1,1) prior. In the given example you would compute a 95% confidence interval for the sensitivity of (0.5060, 0.8271) and a 95% credible interval of (0.4863, 0.8077). You would compute a 95% confidence interval for the specificity of (0.8971, 0.9222) and 95% credible interval of (0.8967, 0.9218). These calculations assume that prevalence is exactly that in the data. Ts4079 (talk) 14:42, 1 November 2011 (UTC)

Such confidence intervals could be easily misinterpreted since sensitivity and specificity are often closely related, as demonstrated in the ROC curve. Something more advanced is probably necessary. Ts4079 (talk) 15:14, 1 November 2011 (UTC)

Maybe if it had sources? Zulu Papa 5 * (talk) 02:10, 2 November 2011 (UTC)

Denominators of definitions[edit]

The section Sensitivity provides the following definition:

I do not understand how the denominator "number of positives" is related to the denominator that precedes it: "number of true positives + number of false negatives". I thought the number of positives would instead equal the number of true positives + number of false positives. Thus, I believe this to be a typo and that "number of positives" should be replaced with something like "number of ill people", which is number of true positives + number of false negatives.

As the page currently stands, it appears that the first and last expressions in this three part equation correspond to "sensitivity" as defined here. The middle expression, which I am questioning, instead appears to correspond to the Positive predictive value as defined on that page. (The above talk section on Sensitivity in Bioinformatics suggests that sometimes the word "sensitivity" is used for positive predictive value. While that might be true, I think we should not switch between the two definitions mid equation.)

I have a similar concern about the definition of "specificity" in the section Specificity. There I believe that the denominator "number of negatives" should be something like "number of well people".

Mikitikiwiki (talk) 00:27, 18 June 2013 (UTC)

I think this is a valid point; both the sensitivity and specificity equations use "positive" to mean both a positive test result (which is either true or false relative to the population ground truth) or a positive (i.e., actual) occurrence of illness in a member of the population without sufficiently distinguishing between the two meanings. I agree that changing "number of positives" to something like "number of ill people" or "number of actual positives" (and making an analogous change for the specificity equation) would clarify matters. (talk) 19:12, 18 June 2013 (UTC)

How "highly" sensitive or specific does a test have to be before "SNOUT" and "SPIN" apply?[edit]

The article states (under the "Sensitivity" and "Specificity" sections respectively) that "negative results in a high sensitivity test are used to rule out the disease" (referred to by the mnemonic "SNOUT" later in the article) and that "a positive result from a test with high specificity means a high probability of the presence of disease" (described by the mnemonic "SPIN"). However, the example calculation (for a test with a specificity of 91% and a sensitivity of 67%) demonstrates a case in which a positive result from a high specificity test (SPIN) clearly does not correspond to a high probability of the presence of disease (PPV is 10% in the example). Although this depends on prevalence, it seems to indicate that the numerous SNOUT/SPIN-type statements throughout the article are inaccurate as written. (Another such statement is in the "Medical Examples" section, which states that "[a] highly specific test is unlikely to give a false positive result: a positive result should thus be regarded as a true positive".)

Accordingly, I think these statements should be modified to give some idea of exactly how high sensitivity/specificity have to be before SNOUT/SPIN apply (as noted above, the worked example uses a specificity of 91%, which is fairly high, yet a positive test result clearly does not correspond to a high probability of the disease being present) and note the effect of disease prevalence on these assertions or be removed entirely. (talk) —Preceding undated comment added 19:40, 18 June 2013 (UTC)

Type I/II errors, and FP and FN's[edit]

It seems to me that the big table near the top of the page has Type I and Type II errors reversed. According to elsewhere (both in my stats book and to the explicit links to Type I and Type II errors in the table itself), a Type I error is "test rejects when hypothesis is actually true", in other words it's a false negative; and a Type II error is "accept when actually false", ie., a false positive. This is exactly backwards from where the Type I and Type II are located. I think the table should be corrected, but I'll leave it to the experts.

-- Wayne Hayes Associate Professor of Computer Science, UC Irvine.

PS: This is exactly why I *HATE* the terms "type I" and "type II". Let's just call them false positives and false negatives, for Pete's sake!! (Yeah, I'm trying to change decades of nomenclature here... :-)

You are wrong and the article is right. In the phrase "test rejects when hypothesis is actually true", the hypothesis refers to the null hypothesis, i.e. that there is nothing special going on. Rejecting the null hypothesis means triggering an alarm, i.e. saying that something special is going on. Therefore this is the false positive (false alarm) case.
These are the quirks of statistical hypothesis testing. They all have things kind of "backwards" as they use frequentist concepts as opposed to Bayesian ones. Qorilla (talk) 16:13, 6 May 2015 (UTC)

complement or dual? TruePos vs FalseNeg[edit]

The primary def'ns currently say "Sensitivity (also called the true positive rate) is complementary to the false negative rate." -- I take complement to mean "sums to completion (i.e., 1)" which is "not" in probability. I'm not sure if True Positive is a dual to False Negative, but that sounds much more plausible. not-just-yeti (talk) 05:32, 14 July 2015 (UTC)

Update: hearing no dissent, I've removed those comments. Presumably 'dual' is meant, and if so that's a detail that's not appropriate for the opening, defining sentences. not-just-yeti (talk) 18:35, 19 July 2015 (UTC)

I just added to the lead a comment tying in FPs and FNs, but without reference to either complementarity or duality.—PaulTanenbaum (talk) 21:21, 1 September 2015 (UTC)
Interesting, maybe best to look for sources about ROC Exclusive_or, possibility a question frame and/or relative/ultimate truth issue to clarify the classifier. Sounds like a subject object issue between receiver/operator nor operator/receiver. N={0,1} Zulu Papa 5 * (talk) 21:48, 1 September 2015 (UTC)

Circularity about selectivity[edit]

I came to Wikipedia to learn about a notion of selectivity that seems to be related to sensitivity and specificity. Sadly, nothing I've found in Wikipedia on the matter is at all helpful. The disambiguation page points to articles on binding selectivity and functional selectivity, but neither of them clearly defines the (presumably more fundamental) term selectivity. The former article does point to the article on the reactivity-selectivity principle, but it too is fairly obtuse on exactly what selectivity itself is.

My last hope from the disambiguation page was the link to this article on sensitivity and specificity. But, wouldn't you know it, the only occurrence here of the term selectivity is under the "See also" and links to (you guessed it) the disambiguation page!

Have the evil gremlins carefully crafted this self-referential knot of definitional vacuity?

I'd cut the Gordian knot if I knew the topic well enough to do so, but recall that what led me to discover the knot in the first place was my attempt to learn something about selectivity.

Would somebody please mend this tiny corner of Wikipedia.

PaulTanenbaum (talk) 21:02, 1 September 2015 (UTC)

Is this relevant Feature selection? Typically wiki just wants to educated folks in practicing to edit. Articles always require improvement with sources, so look at sources (existing and new) to find your real education, then contribute. Zulu Papa 5 * (talk) 21:56, 1 September 2015 (UTC)

Re. Worked Example[edit]

"However as a screening test, a negative result is very good at reassuring that a patient does not have the disorder (NPV = 99.5%)"

There are only 30 out of 2030 who have the disease. If I use a test whereby I tell everyone that they are negative and do not have the disease, I will be correct 98.5% of the time. How then is "a negative result ... very good at reassuring that a patient does not have the disorder (NPV=99.5%)", when I can do just about as well by just telling everyone they are negative? — Preceding unsigned comment added by (talk) 01:02, 10 December 2015 (UTC)

  • That's the point of the test. In reality you don't know whether people who don't have the disease actually don't, and that's the point of the test. Maybe you're confusing PPV with specificity? In other words, having a negative result is "trustworthy", but this tells you nothing on how the test performs with e.g. diseased people, or what you say to the patient if the result is positive. --Cpt ricard (talk) 07:48, 23 October 2016 (UTC)

Order of "True Condition" and "Predicted Condition" On Confusion Matrix[edit]

I would suggest reversing the order of the "True Condition" and "Predicted Condition" on the confusion matrix diagram. While there is certainly no standard or formal way to create a confusion matrix, in my experience, significantly more of the time in such a matrix, we see the "Predicted Condition" on the left and the "True Condition" on the top. See, for instance, James Jekel, Epidemiology, Biostatistics, and Preventative Medicine, pp. 108-109. In the current diagram, there is the reverse, with the "Predicted Condition" on the top and the "True Condition" on the left. In my opinion, the current arrangement is the minority view, and may confuse some readers. Edit: Upon further research, there seems to be differing conventions between the computer science/machine learning ordering of confusion matrix, and the orderings coming out of the biological, statistics, and medicine. Computer science seems to put the predicted class on top, whereas the others tend to put the true state on top. — Preceding unsigned comment added by (talk) 20:19, 4 November 2016 (UTC)