Gold standard (test)

From Wikipedia, the free encyclopedia

In medicine and statistics, a gold standard test is usually the diagnostic test or benchmark that is the best available under reasonable conditions.[1] In other words, a gold standard is the most accurate test possible without restrictions.

The meanings may differ in the two fields, because in medicine with some conditions only an autopsy guarantees diagnostic certainty, thus the gold standard test would be the best one that keeps the patient alive instead of the autopsy.

In medicine[edit]

"Gold standard" can refer to the criteria by which scientific evidence is evaluated. For example, in resuscitation research, the "gold standard" test of a medication or procedure is whether or not it leads to an increase in the number of neurologically intact survivors that walk out of the hospital.[2] Other types of medical research might regard a significant decrease in 30-day mortality as the gold standard.

The AMA Style Guide has preferred the phrase Criterion Standard instead of "gold standard." Other journals have also issued mandates in their instructions for contributors. For instance, Archives of biological Medicine and Rehabilitation specifies this usage.[3] When the criterion is a whole clinical testing procedure it is usually referred to as clinical case definition.

In practice however, the uptake of this term by authors, as well as enforcement by editorial staff, is notably poor, at least for AMA journals.[4]

A hypothetical ideal "gold standard" test has a sensitivity of 100% with respect to the presence of the disease (it identifies all individuals with a well defined disease process; it does not have any false-negative results) and a specificity of 100% (it does not falsely identify someone with a condition that does not have the condition; it does not have any false-positive results). In practice, there are sometimes no true gold standard tests.[5]

As new diagnostic methods become available, the "gold standard" test may change over time. For instance, for the diagnosis of aortic dissection, the gold standard test used to be the aortogram, which had a sensitivity as low as 83% and a specificity as low as 87%. Since the advancements of magnetic resonance imaging, the magnetic resonance angiogram (MRA) has become the new gold standard test for aortic dissection, with a sensitivity of 95% and a specificity of 92%. Before widespread acceptance of any new test, the former test retains its status as the "gold standard".

Test calibration[edit]

Because tests can be incorrect (yielding a false-negative or a false-positive), results should be interpreted in the context of the history, physical findings, and other test results in the individual being tested. It is within this context that the sensitivity and specificity of the "gold standard" test is determined.

When the gold standard is not a perfect one, its sensitivity and specificity must be calibrated against more accurate tests or against the definition of the condition.[6] This calibration is especially important when a perfect test is available only by autopsy. It is important to emphasize that a test has to meet some interobserver agreement, to avoid some bias induced by the study itself.[7]

Calibration errors can lead to misdiagnosis.[8]


Sometimes "gold standard test" refers to the best performing test available. In these cases, there is no other criterion against which it can be compared and it is equivalent to a definition. When referring to this meaning, gold standard tests are normally not performed at all. This is because the gold standard test may be difficult to perform or may be impossible to perform on a living person (i.e. the test is performed as part of an autopsy or may take too long for the results of the test to be available to be clinically useful).

Other times, "gold standard" does not refer to the best performing test available, but the best available under reasonable conditions. For example, in this sense, an MRI is the gold standard for brain tumour diagnosis, though it is not as good as a biopsy. In this case the sensitivity and specificity of the gold standard are not 100% and it is said to be an "imperfect gold standard" or "alloyed gold standard".[6]

The term ground truth refers to the underlying absolute state of information; the gold standard strives to represent the ground truth as closely as possible. While the gold standard is a best effort to obtain the truth, ground truth is typically collected by direct observations.In machine learning and information retrieval, "ground truth" is the preferred term even when classifications may be imperfect; the gold standard is assumed to be the ground truth.[citation needed]

See also[edit]


  1. ^ Versi, E (July 1992). ""Gold standard" is an appropriate term". BMJ. 305 (6846): 187. doi:10.1136/bmj.305.6846.187-b. PMC 1883235. PMID 1515860.
  2. ^ ACLS: Principles and Practice. p. 62. Dallas: American Heart Association, 2003. ISBN 0-87493-341-2.
  3. ^ "ARCHIVES OF PHYSICAL MEDICINE AND REHABILITATION: Guide for Authors". Elsevier. February 2007. Retrieved 2007-09-11.
  4. ^ "Criterion Standard - AMA Style Insider". Retrieved 2021-05-18.
  5. ^ LM Troy, KB Michels, DJ Hunter, D Spiegelman (1996). "Self-reported birthweight and history of having been breastfed among younger women: an assessment of validity". Int J Epidemiol. 25 (1): 122–127. doi:10.1093/ije/25.1.122. PMID 8666479.{{cite journal}}: CS1 maint: uses authors parameter (link)
  6. ^ a b Donna Spiegelman, Sebastian Schneeweiss and Aidan McDermott (1997). "Measurement Error Correction for Logistic Regression Models with an "Alloyed Gold Standard"". American Journal of Epidemiology. 145 (2): 184–196. doi:10.1093/oxfordjournals.aje.a009089. PMID 9006315.{{cite journal}}: CS1 maint: uses authors parameter (link)
  7. ^ Stein PD, Athanasoulis C, Alavi A, Greenspan RH, Hales CA, Saltzman HA, Vreim CE, Terrin ML, Weg JG (1992). "Complications and Validity of Pulmonary Angiography in Acute Pulmonary Embolism Circulation". Circulation. 85 (2): 462–468. doi:10.1161/01.CIR.85.2.462. PMID 1735144.
  8. ^ The Impact of Calibration Error in Medical Decision Making