Jump to content

False discovery rate

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by 208.77.214.129 (talk) at 17:14, 21 October 2013 (Clarify difference between FDR and FWER control procedures in intro paragrah.). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

False discovery rate (FDR) control is a statistical method used in multiple hypothesis testing to correct for multiple comparisons. In a list of findings (i.e. studies where the null-hypotheses are rejected), FDR procedures are designed to control the expected proportion of incorrectly rejected null hypotheses ("false discoveries").[1] FDR controlling procedures exert a less stringent control over false discovery compared to familywise error rate (FWER) procedures (such as the Bonferroni correction), which seek to reduce the probability of even one false discovery, as opposed to the expected proportion of false discoveries. Thus FDR procedures have greater power at the cost of increased rates of type I errors, i.e., rejecting the null hypothesis of no effect when it should be accepted.[2]

History

Technological motivations

The modern widespread use of the FDR is believed to stem from, and be motivated by, the development in technologies that allowed the collection and analysis of a large number of distinct variables in several individuals (e.g., the expression level of each of 10,000 different genes in 100 different persons).[3] By the late 1980s and 1990s, the development of "high-throughput" sciences, such as genomics, allowed for rapid data acquisition. This, coupled with the growth in computing power, made it possible to seamlessly perform hundreds and thousands of statistical tests on a given data set. The technology of microarrays was a prototypical example, as it enabled thousands of genes to be tested simultaneously for differential expression between two biological conditions.[4]

As high-throughput technologies became common, technological and/or financial constraints led researchers to collect datasets with relatively small sample sizes (e.g. few individuals being tested) and large numbers of variables being measured per sample (e.g. thousands of gene expression levels). In these datasets, too few of the measured variables showed statistical significance after classic correction for multiple tests with standard multiple comparison procedures. This created a need within many scientific communities to abandon FWER and unadjusted multiple hypothesis testing for other ways to highlight and rank in publications those variables showing marked effects across individuals or treatments that would otherwise be dismissed as non-significant after standard correction for multiple tests. In response to this, a variety of error rates have been proposed—and become commonly used in publications—that are less conservative than FWER in flagging possibly noteworthy observations. As a side effect, standard correction for multiple tests has literally disappeared from all but those publications which present results with huge sample sizes.

The false discovery rate concept was formally described by Yoav Benjamini and Yosi Hochberg in 1995[1] as a less conservative and arguably more appropriate approach for identifying the important few from the trivial many effects tested. The FDR has been particularly influential, as it was the first alternative to the FWER to gain broad acceptance in many scientific fields (especially in the life sciences, from genetics to biochemistry, from oncology to plant sciences).[3] In 2005, the Benjamini and Hochberg paper from 1995 was identified as one of the 25 most-cited statistical papers.[5]

Prior to the 1995 introduction of the FDR concept, various precursor ideas had been considered in the statistics literature. In 1979, Holm proposed the Holm procedure,[6] a stepwise algorithm for controlling the FWER that is at least as powerful as the well-known Bonferroni adjustment. This stepwise algorithm sorts the p-values and sequentially rejects the hypotheses starting from the smallest p-value.

Benjamini (2010)[3] said that the false discovery rate, and the paper Benjamini and Hochberg (1995), had its origins in two papers concerned with multiple testing:

  • The first paper is by Schweder and Spjotvoll (1982)[7] who suggested plotting the ranked p-values and assessing the number of true null hypotheses () via an eye-fitted line starting from the largest p-values. The p-values that deviate from this straight line then should correspond to the false null hypotheses. This idea was later developed into an algorithm and incorporated the estimation of into procedures such as Bonferroni, Holm or Hochberg.[8] This idea is closely related to the graphical interpretation of the BH procedure.
  • The second paper is by Branko Soric (1989)[9] which introduced the terminology of "discovery" in the multiple hypothesis testing context. Soric used the expected number of false discoveries divided by the number of discoveries as a warning that "a large part of statistical discoveries may be wrong". This led Benjamini and Hochberg to the idea that a similar error rate, rather than being merely a warning, can serve as a worthy goal to control.

The q-value quantity (defined below) was first proposed by John Storey.[10]

Definitions

Classification of m hypothesis tests

The following table gives a number of errors committed when testing null hypotheses. It defines some random variables that are related to the hypothesis tests.

Null hypothesis is True (H0) Alternative hypothesis is True (H1) Total
Declared significant
Declared non-significant
Total
  • is the total number hypotheses tested
  • is the number of true null hypotheses
  • is the number of true alternative hypotheses
  • is the number of false positives (Type I error) (also called "false discoveries")
  • is the number of true positives (also called "true discoveries")
  • is the number of false negatives (Type II error)
  • is the number of true negatives
  • is the number of rejected null hypotheses (also called "discoveries")
  • In hypothesis tests of which are true null hypotheses, is an observable random variable, and , , , and are unobservable random variables.

The FDR

Based on previous definitions we can define as the proportion of false discoveries among the discoveries . And the false discovery rate is given by:[1]

,   where is defined to be 0 when .

And one wants to keep this value below a threshold (or q).

q-value

The q-value is defined to be the FDR analogue of the p-value. The q-value of an individual hypothesis test is the minimum FDR at which the test may be called significant. One approach is to directly estimate q-values rather than fixing a level at which to control the FDR.[10]

Properties

The FDR is the expected proportion of false positives among all discoveries (rejected null hypotheses); for example, if the null hypotheses of 1000 hypothesis tests were experimentally rejected, and a maximum FDR level (q-value) for these tests was 0.10, then less than 100 of these rejections would be expected to be false positives.

Adaptive and scalable

Using a multiplicity procedure that controls the FDR criterion is adaptive and scalable. Meaning that controlling the FDR can be very permissive (if the data justify it), or conservative (acting close to control of FWER for sparse problem) - all depending on the number of hypotheses tested and the level of significance.[3]

The FDR criterion adapts so that the same number of false discoveries (V) will mean different things, depending on the total number of discoveries (R). This contrasts the family wise error rate criterion. For example, if inspecting 100 hypotheses (say, 100 genetic mutations or SNPs for association with some phenotype in some population):

  • If we make 4 discoveries (R), having 2 of them be false discoveries (V) is often unbearable. Whereas,
  • If we make 50 discoveries (R), having 2 of them be false discoveries (V) is often bearable.

The FDR criterion is scalable in that the same proportion of false discoveries out of the total number of discoveries (Q), remains sensible for different number of total discoveries (R). For example:

  • If we make 100 discoveries (R), having 5 of them be false discoveries () can be bearable.
  • Similarly, if we make 1000 discoveries (R), having 50 of them be false discoveries (as before, ) can still be bearable.

The FDR criterion is also scalable in the sense that when making a correction on a set of hypotheses, or two corrections if the set of hypotheses were to be split into two - the discoveries in the combined study are (about) the same as when analyzed separately. For this to hold, the sub-studies should be large with some discoveries in them. [citation needed]

Dependency in the test statistics

Controlling the FDR using the linear step-up BH Procedure, at level q, has several properties related to the dependency structure between the test statistics of the null hypothesis that are being corrected for. If the test statistics are:

  • Independent:[11]
  • Independent and continuous:[1]
  • Positive dependent:[11]
  • In the general case:[11]

Proportion of true hypotheses

If all of the null hypotheses are true (), then controlling the FDR at level q guarantees control over the FWER (this is also called "weak control of the FWER"): .[1] But if there are some true discoveries to be made () then . In that case there will be room for improving detection power. It also means that any procedure that controls the FWER will also control the FDR.

Bayesian approaches

Connections have been made between the FDR and Bayesian approaches (including empirical Bayes methods),[12][13][14] thresholding wavelets coefficients and model selection,[15][16][17][18] and generalizing the confidence interval into the False coverage statement rate (FCR).[19]

Controlling procedures

The settings for many procedures is such that we have null hypotheses tested and their corresponding p-values. We order these p-values in increasing order and denote them by . A small p-value often corresponds to a high test statistic. A procedure that goes from a small p-value to a large one will be called a step-up procedure. In a similar way, in a "step-down" procedure we move from a small corresponding test statistic to larger ones.

Benjamini–Hochberg procedure

The Benjamini–Hochberg procedure (BH step-up procedure) controls the false discovery rate (at level ).[1] The procedure works as follows:

  1. For a given , find the largest such that
  2. Then reject (i.e. declare positive discoveries) all for .

The BH procedure is valid when the tests are independent, and also in various scenarios of dependence.[11] It also satisfies the inequality:

If an estimator of is inserted into the BH procedure, it is no longer guaranteed to achieve FDR control at the desired level.[3] Adjustments may be needed in the estimator and several modifications have been proposed.[20][21][22][23]

The BH procedure was proven to control the FDR in 1995 by Benjamini and Hochberg.[1] In 1986, R. J. Simes offered the same procedure as the "Simes procedure", in order to control the FWER in the weak sense (under the intersection null hypothesis).[24] In 1988, G. Hommel showed that it does not control the FWER in the strong sense.[25] Based on the Simes procedure, Yossi Hochberg discovered Hochberg's step-up procedure (1988) which does control the FWER in the strong sense.[26]

Note that the mean for these tests is , the Mean(FDR ) or MFDR, adjusted for independent (or positively correlated, see below) tests. The MFDR calculation shown here is for a single value and is not part of the Benjamini and Hochberg method; see AFDR below.

Benjamini–Hochberg–Yekutieli procedure

The Benjamini–Hochberg–Yekutieli procedure controls the false discovery rate under positive dependence assumptions.[11] This refinement modifies the threshold and finds the largest such that:

  • If the tests are independent or positively correlated:
  • Under arbitrary dependence:

In the case of negative correlation, can be approximated by using the Euler–Mascheroni constant.

Using MFDR and formulas above, an adjusted MFDR, or AFDR, is the min(mean ) for  dependent tests = MFDR .

The other way to address dependence is by bootstrapping and rerandomization.[4][27][28]

Estimating the FDR

Let be the proportion of true null hypotheses, and be the proportion of true alternative hypotheses.[10] Then times the average p-value of rejected effects divided by the number of rejected effects gives an estimate of the FDR.

False coverage rate

The False coverage rate (FCR) is the FDR equivalent to the idea of confidence interval. FCR indicates the average rate of false coverage, namely, not covering the true parameters, among the selected intervals. The FCR gives a simultaneous coverage at a level for all of the parameters considered in the problem. Intervals with simultaneous coverage probability 1−q can control the FCR to be bounded by q. There are many FCR procedures such as: Bonferroni-Selected–Bonferroni-Adjusted,[citation needed] Adjusted BH-Selected CIs (Benjamini and Yekutieli (2005)),[19] Bayes FCR (Yekutieli (2008)),[citation needed] and other Bayes methods.[29] The incentive of choosing one procedure over another is the length of the CI we will want it to be narrow as possible while controlling the FCR.

The discovery of the FDR was preceded and followed by many other types of error rates. These include:

  • (per-comparison error rate) is defined as: . Testing individually each hypothesis at level guarantees that (this is testing without any correction for multiplicity)
  • (the family wise error rate, in the weak sense) is defined as: . There are numerous procedures that control the FWER.
  • (the family wise error rate, in the strong sense) is defined as: . There are numerous procedures that control the FWER.
  • (The tail probability of the False Discovery Proportion), suggested by Lehmann and Romano, van der Laan at al [citation needed], is defined as: .
  • (Suggested by Sarkar[citation needed]) is defined as: .
  • is the proportion of false discoveries amound the discoveries", suggested by Soric in 1989,[9] and is defined as: . This is a mixture of expectations and realizations, and has the problem of control for .[1]
  • (or Fdr) was used by Benjamini and Hochberg,[3] and later called "Fdr" by Efron (2008) and earlier.[14] It is defined as: . Controlling this error rate does not provide a weak control of the FWER.
  • (or pFDR) was used by Benjamini and Hochberg,[3] and later called "pFDR" by Storey (2002).[10] It is defined as: . Controlling this error rate does not provide a weak control of the FWER.
  • False exceedance rate (the tail probability of FDP), defined as:[30]
  • (Weighted FDR). Associated with each hypothesis i is a weight , the weights capture importance/price. The W-FDR is defined as: .
  • (False Discovery Cost Rate). Stemming from statistical process control: associated with each hypothesis i is a cost and with the intersection hypothesis a cost . The motivation is that stopping a production process may incur a fixed cost. It is defined as:
  • (per-family error rate), at level, is defined as: .
  • (False Non-Discovery Rates) by Sarkar; Genovese and Wasserman [citation needed] is define as:
  • is defined as:
  • The local fdr is defined as:

References

  1. ^ a b c d e f g h Benjamini, Yoav; Hochberg, Yosef (1995). "Controlling the false discovery rate: a practical and powerful approach to multiple testing" (PDF). Journal of the Royal Statistical Society, Series B. 57 (1): 289–300. MR 1325392.
  2. ^ Shaffer J.P. (1995) Multiple hypothesis testing, Annual Review of Psychology 46:561-584, Annual Reviews
  3. ^ a b c d e f g Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi:10.1111/j.1467-9868.2010.00746.x, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with |doi=10.1111/j.1467-9868.2010.00746.x instead.
  4. ^ a b Storey, John D.; Tibshirani, Robert (2003). "Statistical significance for genome-wide studies" (PDF). Proceedings of the National Academy of Sciences. 100 (16): 9440–9445. Bibcode:2003PNAS..100.9440S. doi:10.1073/pnas.1530509100. PMC 170937. PMID 12883005.
  5. ^ Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi:10.1080/02664760500079373, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with |doi=10.1080/02664760500079373 instead.
  6. ^ Holm, S. (1979). "A simple sequentially rejective multiple test procedure". Scandinavian Journal of Statistics. 6 (2): 65–70. JSTOR 4615733. MR 0538597.
  7. ^ Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi:10.1093/biomet/69.3.493, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with |doi=10.1093/biomet/69.3.493 instead.
  8. ^ Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi:10.1002/sim.4780090710, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with |doi=10.1002/sim.4780090710 instead.
  9. ^ a b Soric, Branko (June 1989). "Statistical "Discoveries" and Effect-Size Estimation". Journal of the American Statistical Association. 84 (406): 608–610. JSTOR 2289950.
  10. ^ a b c d Storey, John D. (2002). "A direct approach to false discovery rates" (PDF). Journal of the Royal Statistical Society, Series B. 64 (3): 479–498. doi:10.1111/1467-9868.00346.
  11. ^ a b c d e Benjamini, Yoav; Yekutieli, Daniel (2001). "The control of the false discovery rate in multiple testing under dependency" (PDF). Annals of Statistics. 29 (4): 1165–1188. doi:10.1214/aos/1013699998. MR 1869245.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  12. ^ Storey, John D. (2003). "The positive false discovery rate: A Bayesian interpretation and the q-value" (PDF). Annals of Statistics. 31 (6): 2013–2035. doi:10.1214/aos/1074290335.
  13. ^ Efron, Bradley (2010). Large-Scale Inference. Cambridge University Press. ISBN 978-0-521-19249-1.
  14. ^ a b Efron B (2008). "Microarrays, empirical Bayes and the two groups model". Statistical Science. 23: 1–22. doi:10.1214/07-STS236. Cite error: The named reference "Efron2008" was defined multiple times with different content (see the help page).
  15. ^ Abramovich F, Benjamini Y, Donoho D, Johnstone IM; Benjamini; Donoho; Johnstone (2006). "Adapting to unknown sparsity by controlling the false discovery rate". Annals of Statistics. 34 (2): 584–653. arXiv:math/0505374. Bibcode:2005math......5374A. doi:10.1214/009053606000000074.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  16. ^ Donoho D, Jin J; Jin (2006). "Asymptotic minimaxity of false discovery rate thresholding for sparse exponential data". Annals of Statistics. 34 (6): 2980–3018. arXiv:math/0602311. Bibcode:2006math......2311D. doi:10.1214/009053606000000920.
  17. ^ Benjamini Y, Gavrilov Y; Gavrilov (2009). "A simple forward selection procedure based on false discovery rate control". The Annals of Applied Statistics. 3 (1): 179–198. arXiv:0905.2819. Bibcode:2009arXiv0905.2819B. doi:10.1214/08-AOAS194.
  18. ^ Donoho D, Jin JS; Jin (2004). "Higher criticism for detecting sparse heterogeneous mixtures". Annals of Statistics. 32 (3): 962–994. arXiv:math/0410072. Bibcode:2004math.....10072D. doi:10.1214/009053604000000265.
  19. ^ a b Benjamini Y, Yekutieli Y (2005). "False discovery rate controlling confidence intervals for selected parameters". Journal of the American Statistical Association. 100 (469): 71–80. doi:10.1198/016214504000001907.
  20. ^ Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi:10.1111/j.1467-9868.2004.00439.x, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with |doi=10.1111/j.1467-9868.2004.00439.x instead.
  21. ^ Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi:10.1093/biomet/93.3.491, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with |doi=10.1093/biomet/93.3.491 instead.
  22. ^ Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi:10.1214/07-AOS586, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with |doi=10.1214/07-AOS586 instead.
  23. ^ Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi:10.1214/08-EJS180, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with |doi=10.1214/08-EJS180 instead.
  24. ^ Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi:10.1093/biomet/73.3.751, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with |doi=10.1093/biomet/73.3.751 instead.
  25. ^ Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi:10.1093/biomet/75.2.383, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with |doi=10.1093/biomet/75.2.383 instead.
  26. ^ Hochberg, Yosef (1988). "A Sharper Bonferroni Procedure for Multiple Tests of Significance" (PDF). Biometrika. 75 (4): 800–802. doi:10.1093/biomet/75.4.800.
  27. ^ Yekutieli D, Benjamini Y (1999). "Resampling based False Discovery Rate controlling procedure for dependent test statistics". J. Statist. Planng Inf. 82: 171–196. doi:10.1016/S0378-3758(99)00041-5.
  28. ^ van der Laan, M. J. and Dudoit, S. (2007). Multiple Testing Procedures with Applications to Genomics. New York: Springer.{{cite book}}: CS1 maint: multiple names: authors list (link)
  29. ^ Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi:10.1111/j.1467-9868.2012.01033.x, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with |doi=10.1111/j.1467-9868.2012.01033.x instead.
  30. ^ Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi:10.1002/bimj.200900299, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with |doi=10.1002/bimj.200900299 instead.