Misunderstandings of p-values: Difference between revisions

Misunderstandings of p-values are an important problem in scientific research and scientific education.

Dividing results into significant and non-significant effects can be highly misleading.[1][2] For instance, analysis of nearly identical datasets can result in p-values that differ greatly in significance.[1] In medical research, p-values were a considerable improvement over previous approaches, but misunderstandings of p-values have become more important for reasons such as the increased statistical complexity of published research.[2] It has been suggested that in fields such as psychology, where studies typically have low statistical power, using significance testing can lead to increased error rates.[1][3]

The data obtained by comparing the p-value to a significance level will yield one of two results: either the null hypothesis is rejected, or the null hypothesis cannot be rejected at that significance level (which however does not imply that the null hypothesis is true). In Fisher's formulation, there is a disjunction: a low p-value means either that the null hypothesis is true and a highly improbable event has occurred or that the null hypothesis is false. However, people interpret the p-value in many incorrect ways and try to draw other conclusions from p-values, which do not follow.

The p-value does not in itself allow reasoning about the probabilities of hypotheses, which requires multiple hypotheses or a range of hypotheses, with a prior distribution of likelihoods between them, as in Bayesian statistics. There, one uses a likelihood function for all possible values of the prior instead of the p-value for a single null hypothesis.

The p-value refers only to a single hypothesis, called the null hypothesis, and does not make reference to or allow conclusions about any other hypotheses, such as the alternative hypothesis in Neyman–Pearson statistical hypothesis testing. In that approach, one instead has a decision function between two alternatives, often based on a test statistic, and computes the rate of type I and type II errors as α and β. However, the p-value of a test statistic cannot be directly compared to these error rates α and β. Instead, it is fed into a decision function.

1. The p-value is not the probability that the null hypothesis is true or the probability that the alternative hypothesis is false. It is not connected to either. In fact, frequentist statistics does not and cannot attach probabilities to hypotheses. Comparison of Bayesian and classical approaches shows that a p-value can be very close to zero and the posterior probability of the null is very close to unity (if there is no alternative hypothesis with a large enough a priori probability that would explain the results more easily), Lindley's paradox. There are also a priori probability distributions in which the posterior probability and the p-value have similar or equal values.[7]
2. The p-value is not the probability that a finding is "merely a fluke." Calculating the p-value is based on the assumption that every finding is a fluke, the product of chance alone. The phrase "the results are due to chance" is used to mean that the null hypothesis is probably correct. However, that is merely a restatement of the inverse probability fallacy since the p-value cannot be used to figure out the probability of a hypothesis being true.
3. The p-value is not the probability of falsely rejecting the null hypothesis. That error is a version of the so-called prosecutor's fallacy.
4. The p-value is not the probability that replicating the experiment would yield the same conclusion. Quantifying the replicability of an experiment was attempted through the concept of p-rep.
5. The significance level, such as 0.05, is not determined by the p-value. Rather, the significance level is decided by the person conducting the experiment (with the value 0.05 widely used by the scientific community) before the data are viewed, and it is compared against the calculated p-value after the test has been performed. (However, reporting a p-value is more useful than simply saying that the results were or were not significant at a given level and allows readers to decide for themselves whether to consider the results significant.)
6. The p-value does not indicate the size or importance of the observed effect. The two vary together, however, and the larger the effect, the smaller the sample size that will be required to get a significant p-value (see effect size).

The p-value fallacy

The p-value fallacy is a common misinterpretation of the meaning of a p-value whereby a binary classification of experimental results as true or false is made, based on whether or not they are statistically significant. It derives from the assumption that a p-value can be used to summarize an experiment's results, rather than being a heuristic that is not always useful.[1][2] The term "p-value fallacy" was coined in 1999 by Steven N. Goodman.[2][8]

In the p-value fallacy, a single number is used to represent both the false positive rate under the null hypothesis H0 and also the strength of the evidence against H0. However, there is a trade-off between these factors, and it is not logically possible to do both at once.[2] Neyman and Pearson described the trade-off as between being able to control error rates over the long term and being able to evaluate conclusions of specific experiments in the short term, but a common misinterpretation of p-values is that the trade-off can be avoided.[2] Another way to view the error is that studies in medical research are often designed using a Neyman-Pearson statistical approach but analyzed with a Fisherian approach.[9] However, this is not a contradiction between frequentist and Bayesian reasoning, but a basic property of p-values that applies in both cases.[8]

This fallacy is contrary to the intent of the statisticians who originally supported the use of p-values in research.[2][4] As described by Sterne and Smith, "An arbitrary division of results, into 'significant' or 'non-significant' according to the P value, was not the intention of the founders of statistical inference."[4] In contrast, common interpretations of p-values discourage the ability to distinguish statistical results from scientific conclusions, and discourage the consideration of background knowledge such as previous experimental results.[2] The correct use of p-values is to guide behavior, not to classify results;[1] that is, to inform a researcher's choice of which hypothesis to accept, not provide an inference about which hypothesis is true.[2]

False discovery rate

The misinterpretation of p-values as frequentist error rates occurs because the false discovery rate is not taken into account.[10] The false discovery rate (FDR) refers to the probability[dubious ] that the p-value will indicate that a result is significant when it actually is not, i.e. the odds of incorrectly rejecting the null hypothesis (also known as type I error). The FDR will increase with the number of tests performed.[10][11] In general, for n independent hypotheses to test with the criteria p < 0.05, the probability of obtaining a false positive is given as

${\displaystyle P(\mathrm {false~positive} )=1-(1-0.05)^{n}}$

Webcomic artist and science popularizer Randall Munroe of xkcd parodied the mainstream media's general unawareness of the p-value fallacy by portraying scientists investigate the claim that eating jellybeans caused acne.[10][11][12][13] Scientists test the claim, and find no link between the consumption of jellybeans and the prevalence of acne, to p > 0.05, the usual 1-in-20 threshold that the results are due to statistical effects[dubious ] rather than a true correlation. Then, when a new claim is made that only jellybeans of certain colors cause acne, they proceed to investigate 20 different colors of jellybeans, one of which (green) is found to correlate with acne, with p < 0.05. The general media then runs the sensationalistic headline "Green jellyeans linked to acne! 95% confidence! Only 5% chance of coincidence!", ignoring that this corresponds to the 1-in-20 chance of a statistical oddity that one would expect when using the criteria of p > 0.05.

When doing 20 tests with a criteria of p < 0.05, like in the xkcd comic, there is a 64% chance of having at least one false positive result (assuming there are no real effects). If the number of test is increased to 100 instead, there will be a 99% chance of a false positive result.[11]

Alternatives to significance testing

The use of significance testing as the basis for decisions has also been criticized as a whole, because of the p-value fallacy and other widespread misunderstandings about the process.[2][4][5] For example, p-values do not address the probability of the null hypothesis being true or false, and the choice of significance threshold should not be arbitrary but instead informed by the consequences of a false positive.[1] It is possible to use Bayes factors for calibration, which allows the use of p-values while reducing the impact of the p-value fallacy, although these approaches introduce other biases as well.[8]

References

1. Dixon P (2003). "The p-value fallacy and how to avoid it". Canadian Journal of Experimental Psychology. 57 (3): 189–202. PMID 14596477.
2. Goodman SN (1999). "Toward evidence-based medical statistics. 1: The P value fallacy". Annals of Internal Medicine. 130 (12): 995–1004. PMID 10383371.
3. ^ Hunter JE (1997). "Needed: A Ban on the Significance Test". Psychological Science. 8 (1): 3–7. doi:10.1111/j.1467-9280.1997.tb00534.x. |access-date= requires |url= (help)
4. ^ a b c d Sterne JA, Smith GD (2001). "Sifting the evidence–what's wrong with significance tests?". BMJ. 322 (7280): 226–231. doi:10.1136/bmj.322.7280.226. PMC 1119478. PMID 11159626.
5. ^ a b Schervish MJ (1996). "P Values: What They Are and What They Are Not". The American Statistician. 50 (3): 203. doi:10.2307/2684655. JSTOR 2684655.
6. ^ Wasserstein, Ronald L.; Lazar, Nicole A. (2016). "The ASA's statement on p-values: context, process, and purpose". The American Statistician. doi:10.1080/00031305.2016.1154108.
7. ^ Casella, George; Berger, Roger L. (1987). "Reconciling Bayesian and Frequentist Evidence in the One-Sided Testing Problem". Journal of the American Statistical Association. 82 (397): 106–111. doi:10.1080/01621459.1987.10478396.
8. ^ a b c Sellke T, Bayarri M, Berger JO (2001). "Calibration of p values for testing precise null hypotheses". The American Statistician. 55 (1): 62–71. doi:10.1198/000313001300339950. |access-date= requires |url= (help)
9. ^ de Moraes AC, Cassenote AJ, Moreno LA, Carvalho HB (2014). "Potential biases in the classification, analysis and interpretations in cross-sectional study: commentaries - surrounding the article "resting heart rate: its correlations and potential for screening metabolic dysfunctions in adolescents"". BMC Pediatrics. 14: 117. doi:10.1186/1471-2431-14-117. PMC 4012522. PMID 24885992.
10. ^ a b c Colquhoun, David (19 November 2014). "An investigation of the false discovery rate and the misinterpretation of p-values". Royal Society Open Science. 1 (3): 140216–140216. doi:10.1098/rsos.140216.
11. ^ a b c Reinhart, A. (2015). Statistics Done Wrong: The Woefully Complete Guide. No Starch Press. pp. 47–48. ISBN 9781593276201.
12. ^ Munroe, R. "Significant". xkcd. Retrieved 2016-02-22.
13. ^ Barsalou, M. (2 June 2014). "Hypothesis Testing and P Values". Minitab blog. Retrieved 2016-02-22.