Talk:Statistical significance

From Wikipedia, the free encyclopedia
Jump to: navigation, search
WikiProject Statistics (Rated C-class, High-importance)
WikiProject icon

This article is within the scope of the WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page or join the discussion.

C-Class article C  This article has been rated as C-Class on the quality scale.
 High  This article has been rated as High-importance on the importance scale.
 
WikiProject Mathematics (Rated C-class, Mid-importance)
WikiProject Mathematics
This article is within the scope of WikiProject Mathematics, a collaborative effort to improve the coverage of Mathematics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
Mathematics rating:
C Class
Mid Importance
 Field: Probability and statistics
One of the 500 most frequently viewed mathematics articles.

Comment on Dubious[edit]

The citation (USEPA December 1992) contains numerous statistical tests, some presented as p-values and some as confidence intervals. Figures 5-1 through 5-4 show some of the test statistics used in the citation. In two of the figures the statistics are for individual studies and can be assumed to be prospective. In the other two, the statistics are for pooled studies and can be assumed to be retrospective. Table 5-9 includes the results of the test of the hypothesis RR = 1 versus RR > 1 for individual studies and for pooled studies. In the cited report, no distinction between prospective tests and retrospective tests was made. This is a departure from the traditional scientific method, which makes a strict distinction between predictions of the future and explanations of the past. Gjsis (talk)

Sources showing that the leading definition "due to chance" of statistical significance is wrong[edit]

Because it is difficult to follow the discussion about the leading sentence above, I have made a new section. Some people have requested that I produce exact sentences from the sources I mentioned that reject the definition of statistical significance as "low probability that an observed effect would have occurred due to chance" and embrace the definition "the probability to observe at least as extreme results, given the truth of the null hypothesis, is low". Here they are:

"The p value is the probability of getting our observed results, or a more extreme results, if the null hypothesis is true. So p is defined in relation to a stated null hypothesis, and requires as the basis for calculation that we assume the null is true. It's a common error to think p gives the probability that the null is true: That's the inverse probability fallacy." (p. 27)

"Note carefully that .011 was the probability of particular extreme results if the null hypothesis is true. Surprising results may reasonably lead you to doubt the null, but p is not the probability that the null is true. Some statistics textbooks say that p measures the probability that "the results are due to chance" -- in other words, the probability that the null hypothesis is correct. However, that's merely a restatement of the inverse probability fallacy. It is completely wrong to say that the p value is the probability that the results are due to chance." (p. 28)

Cumming, G. (2012). Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. New York: Routledge.

"Misinterpretations of p Values

Next we consider common misunderstands about the probabilities generated by statistical tests, p values. Let us first review their correct interpretation. Recall that statistical tests measure the discrepancy between a sample statistic and the value of the population parameter specified in the null hypothesis, H0, taking account of sampling error. The empirical test statistic is converted to a probability within the appropriate central test distribution. This probability is the conditional probability of the statistic assuming H0 is true." (p. 63)

"Fallacy Number 1

A p value is the probability that the result is a result of sampling error; thus, p < 0.05 says that there is less than a 5% likelihood that the results happened by chance. This false belief is the odds-against-chance fantasy. It is wrong because p values are computed under the assumption that sampling error is what causes sample statistics to depart from the null hypothesis. That is, the likelihood of sampling error is already taken to be 1.00 when a statistical test is conducted. It is thus illogical to view p values as measuring the probability of sampling error." (p. 63)

Kline, R. B. (2005). Beyond Significance Testing: Reforming Data Analysis Methods in Behavioral Research. Washington, DC: American Psychological Association.

"In the process of calculating the P value, we assumed that H0 was true and that x was drawn from H0. Thus, a small P value (for example, P = 0.05) merely tells us that an improbable event has occurred in the context of this assumption."

Krzywinski, M., & Altman, N. (2013). Points of significance: Significance, P values and t-tests. Nat Meth, 10(11), 1041-1042.

"The P value, which was introduced earlier by Fisher in the context of significance testing, is defined as the probability of obtaining — among the values of T generated when H0 is true — a value that is at least as extreme as that of the actual sample (denoted as t). This can be represented as P = P(T ≥ t | H0)."

Sham, P. C., & Purcell, S. M. (2014). Statistical power and significance testing in large-scale genetic studies. Nat Rev Genet, 15(5), 335-346.

"The P value from a classical test is the maximum probability of observing a test statistic as extreme, or more extreme, than the value that was actually observed, given that the null hypothesis is true."

Johnson, V. E. (2013). Revised standards for statistical evidence. Proceedings of the National Academy of Sciences, 110(48), 19313-19317.

"The p-value can be interpreted as the probability of obtaining the observed differences, or one more extreme, if the null hypothesis is true." (p. 103)

Machin, D., Campbell, M. J., Walters, S. J. (). Medical Statistics. A textbook for the health sciences. 4th edition. West Sussex, England: Wiley.

"The P value is the probability of having our observed data (or more extreme data) when the null hypothesis is true." (p. 167)

"A common misinterpretation of the P value is that it is the probability of the data having arisen by chance, or equivalently, that P is the probability is the observed effect is not a real one. The distinction between this incorrect definition and the true definition given earlier is the absence of the phrase when the null hypothesis is true." (p. 170)

Altman, D. G. (1999). Practical Statistics for Medical Research. New York: Chapman & Hall/CRC.

Based on these reference, the leading definition should be changed to something like:

A result is statistically significant if the probability of observing the results, or more extreme results, given the truth of the null hypothesis, is low.

EmilKarlsson (talk) 13:09, 20 June 2014 (UTC)

EmilKarlsson, thanks for producing these sources. Point taken and I think your proposed lead definition is acceptable. Right now, I have only two requests. Could you:
  • begin your lead definition as "statistical significance is XXXX"? (WP:BEGINNING)
  • Take a look at the first lead paragraph to make sure that your proposed lead definition "flows" with the rest of the paragraph. If not, feel free to propose additional suggested edits.
I know I have been a pain in the rear about demanding sources, etc. But believe me, for the contents of this page to endure, sources and consensus are critical. Thanks again for your time. danielkueh (talk) 17:52, 20 June 2014 (UTC)
Here are some candidate suggestions that may fulfill the requirements above:
(1) "Statistical significance obtains when the probability of an observed result (or a more extreme result) is low, given the truth of the null hypothesis."
(2) "Statistical significance obtains when the probability of an observed result (or a more extreme result) is low, had the null hypothesis been true."
(3) "Statistical significance obtains when the probability of an at least as extreme result is low, given the truth of the null hypothesis."
(4) "Statistical significance obtains when the probability of an at least as extreme result is low, had the null hypothesis been true."
Constructing a good sentence that captures both "probability of the results or more extreme results" as well as "given the null hypothesis" is a bit tricky. Does any of the candidates above seem to have more clarity and be less convoluted than the others? I tend to write things too bulky, so I would be more than grateful for input. EmilKarlsson (talk) 14:47, 21 June 2014 (UTC)
If I was a naive reader and I read these sentences for the first time, I will get an impression of how to achieve statistical significance but I don't think I will understand what statistical significance itself is. It is like knowing the process of achieving a gold medal (getting first place) without actually knowing what a gold medal really is. Much of the definition (1-4) is focused on conditional probability, which I have no problems with, but the emphasis should be on what statistical significance is. It is when the probability is "low," i.e., below a predefined threshold, like p <.05. How can we capture that in the lead sentence without overwhelming a naive reader? Here's a definition from another source that I think does just that, which is reasonably broad and is consistent with the definitions you put forward:
  • Statistical significance refers to whether or not the value of a statistical tests exceeds some prespecified level. ([1]) p. 35 of Redmond and Colton. 2001. Biostatistics in Clinical Trials.
We should strive for something that is just as simple and succinct and consistent with available sources. If we need to add more information, we will have plenty of space to do so in the first lead paragraph. danielkueh (talk) 02:36, 22 June 2014 (UTC)
Here is another suggestion that attempts to incorporate some of the points that you made above:
(5) Statistical significance occurs when the probability of getting a result or a more extreme result (given that the null hypothesis is true) is lower than a certain pre-specified cutoff.
It is still a little bit bulky. How can we improve this? EmilKarlsson (talk) 19:13, 22 June 2014 (UTC)
Better. But I think we can simplify it a little further. We both agree that for there to be statistical significance, the p-value has to be less than a predefined value like p<.05. Without qualifying terms like "low" or "below a predefined value," we're really just defining p-values. So how about something like this:
(6) Statistical significance is the low probability of getting at least as extreme results given that the null hypothesis is true.
The key term here is "low," which could be explained further in subsequent sentences. danielkueh (talk) 19:27, 22 June 2014 (UTC)
I think that (6) is a really good definition and it has my support. Perhaps the leading paragraph needs to be modified, both to explain what "low probability" means in this context and to fix some of the subsequent sentences that assumes the "due to chance" definition. Here is one potential suggestion:
(a) Statistical significance is the low probability of getting at least as extreme results given that the null hypothesis is true. It is an integral part of statistical hypothesis testing where it helps investigators to decide if a null hypothesis can be rejected. In any experiment or observation that involves drawing a sample from a population, there is always the possibility that an observed effect could have occurred due to sampling error alone. But if the probability of obtaining at least as extreme result (given the null hypothesis) is lower than a pre-determined threshold (such as 5%), then an investigator can conclude that it is unlikely that the observed effect could only be explained by sampling error.
How can we make this better? Is there anything that is obviously missing for a good leading paragraph? Does it feel to repetitive? EmilKarlsson (talk) 10:34, 29 June 2014 (UTC)
I think this looks good. Nice work. My only comment is whether "extreme results" needs explaining. But overall, I think the first lead paragraph is descent but the second and third may need some tuning. We should not start finalizing the references for the first lead paragraph. I recommend that we start copying and pasting all the reference tags into the proposed lead paragraph and just edit out the ones that are not relevant and insert the ones that are. danielkueh (talk) 22:30, 29 June 2014 (UTC)
Here's the newly proposed lead with the all the original references:
Statistical significance is the low probability of obtaining at least as extreme results given that the null hypothesis is true.[1] In any experiment or observation that involves drawing a sample from a population, there is always the possibility that an observed effect would have occurred due to sampling error alone.[2][3] But if the probability of obtaining at least as extreme result, given the null hypothesis is true, is less than a pre-determined threshold (e.g. 5% chance), then an investigator can conclude that the observed effect actually reflects the characteristics of the population rather than just sampling error.[4]
Aside from the need toI have "edited out" the references that do not directly support the new definition. I have replaced them with the Redmon and Colton (2001) reference, which provides the closest definition. If you know of another source or two, please add them., I think it is also important to keep the point at the end that an "effect actually reflects the characteristics of the population," which is the whole point of inferential statistics. Plus, it also allows readers to better understand what the alternative conclusion to sampling error is. danielkueh (talk) 14:42, 30 June 2014 (UTC)
I am fine with keeping the "effect actually.." addition. If we can use some of the sources I listed in the first paragraph of this subsection, we get something like this:
Statistical significance is the low probability of obtaining at least as extreme results given that the null hypothesis is true.[1][5][6][7][8][9] In any experiment or observation that involves drawing a sample from a population, there is always the possibility that an observed effect would have occurred due to sampling error alone.[2][3] But if the probability of obtaining at least as extreme result, given the null hypothesis is true, is less than a pre-determined threshold (e.g. 5% chance), then an investigator can conclude that the observed effect actually reflects the characteristics of the population rather than just sampling error.[4]
Some of the references I added might look a bit weird as I am not a frequent contributer to Wikipedia (although I have read the temple pages for citing references). For instance, the Johnson reference (currently 8) is missing a page number and other stuff, but it is an early release at this point so it has not been assigned any. Any suggestions for improving the completeness of the added reference? EmilKarlsson (talk) 18:33, 3 July 2014 (UTC)
I have looked at each of the references that you inserted and have found them to be relevant and of high quality. They are more than sufficient for now. That said, I still think there should be one sentence or less that quickly explains what an "extreme" result is. Maybe such a sentence is better placed in the main body of this article? Either way, I think this newly revised first lead paragraph is ready for prime time. danielkueh (talk) 21:35, 6 July 2014 (UTC)
How about this?
Statistical significance is the low probability of obtaining at least as extreme results given that the null hypothesis is true.[1][5][6][7][8][9] In any experiment or observation that involves drawing a sample from a population, there is always the possibility that an observed effect would have occurred due to sampling error alone.[2][3] But if the probability of obtaining at least as extreme result, given the null hypothesis is true, is less than a pre-determined threshold (e.g. 5% chance), then an investigator can conclude that the observed effect actually reflects the characteristics of the population rather than just sampling error.[4] In this context, more extreme results usually refers to a larger difference between the experimental and control group than was observed.
This is a simplified explanation of "more extreme" that does not go into the details about distribution tails or other scenarios besides the two-groups design (although I assume this is the most common set-up, at least in biology, medicine and social sciences). Is this sufficient, or can it be phrased in such a way that it encompasses more complexity without it becoming too bulky? EmilKarlsson (talk) 12:37, 7 July 2014 (UTC)
I moved the new sentence to a different location:
Statistical significance is the low probability of obtaining at least as extreme results given that the null hypothesis is true.[1][5][6][7][8][9] In any experiment or observation that involves drawing a sample from a population, there is always the possibility that an observed effect would have occurred due to sampling error alone.[2][3] But if the probability of obtaining at least as extreme result (large difference between two or more sample means), given the null hypothesis is true, is less than a pre-determined threshold (e.g. 5% chance), then an investigator can conclude that the observed effect actually reflects the characteristics of the population rather than just sampling error.[4]
I also substituted "control and experimental" with two or more sample means. danielkueh (talk) 15:20, 7 July 2014 (UTC)

──────────────────────────────────────────────────────────────────────────────────────────────────── Please remember that the lead of the article needs to give the reader enough context to understand the precise definition, too. In other words, the first sentence must begin with "Statistical significance is a concept in statistics that describes..." or something similar. ElKevbo (talk) 16:28, 7 July 2014 (UTC)

ElKevbo, fair point. We could just restore the second sentence from the present version to the new lead as follows:
Statistical significance is the low probability of obtaining at least as extreme results given that the null hypothesis is true.[1][5][6][7][8][9] It is an integral part of statistical hypothesis testing where it helps investigators to decide if a null hypothesis can be rejected.[4][10] In any experiment or observation that involves drawing a sample from a population, there is always the possibility that an observed effect would have occurred due to sampling error alone.[2][3] But if the probability of obtaining at least as extreme result (large difference between two or more sample means), given the null hypothesis is true, is less than a pre-determined threshold (e.g. 5% chance), then an investigator can conclude that the observed effect actually reflects the characteristics of the population rather than just sampling error.[4]
danielkueh (talk) 19:45, 7 July 2014 (UTC)
I am in full agreement. Any final issues before we deploy the new lead paragraph? EmilKarlsson (talk) 10:38, 8 July 2014 (UTC)
Yes, I agree it is ready as well. danielkueh (talk) 16:37, 8 July 2014 (UTC)
On a side note, I just noticed that the article itself uses the "given the null hypothesis" definition with a reference (17). I have not read that reference, so I do not know the specifics of what it is stating, but we could either subsume it in the lead sentence or leave it. EmilKarlsson (talk) 14:25, 8 July 2014 (UTC)
Here's the link [2] and quote from p. 329 of Devore (reference 17):
"The P-value is the probability, calculated assuming that the null hypothesis is true, of obtaining a value of the test statistic at least as contradictory to Ho as the value calculated from the available sample."
I agree it can be included in the lead paragraph danielkueh (talk) 16:37, 8 July 2014 (UTC)

Here's the lead with the Devore reference inserted:

Statistical significance is the low probability of obtaining at least as extreme results given that the null hypothesis is true.[1][5][6][7][8][9][11] It is an integral part of statistical hypothesis testing where it helps investigators to decide if a null hypothesis can be rejected.[4][10] In any experiment or observation that involves drawing a sample from a population, there is always the possibility that an observed effect would have occurred due to sampling error alone.[2][3] But if the probability of obtaining at least as extreme result (large difference between two or more sample means), given the null hypothesis is true, is less than a pre-determined threshold (e.g. 5% chance), then an investigator can conclude that the observed effect actually reflects the characteristics of the population rather than just sampling error.[4]

I believe we are set to go. Does anyone else have any other comments before we replace the existing lead? danielkueh (talk) 16:19, 9 July 2014 (UTC)

It is always possible to quibbles over details when it comes to definitions in statistics. For instance, it is possible to have overpowered studies that can obtain statistical significance although the two groups have essentially identical sample means (so there is technically a third option besides sampling error and reflecting true characteristics of populations). But I am satisfied with the lead paragraph as it is written above. I think we are all in pretty much agreement at this point, so I have gone ahead and been bold and deployed our new leading paragraph. Good work everyone. EmilKarlsson (talk) 22:32, 9 July 2014 (UTC)
I agree. Great work. danielkueh (talk) 14:52, 10 July 2014 (UTC)
  1. ^ a b c d e f Redmond, Carol; Colton, Theodore (2001). "Clinical significance versus statistical significance". Biostatistics in Clinical Trials. Wiley Reference Series in Biostatistics (3rd ed.). West Sussex, United Kingdom: John Wiley & Sons Ltd. pp. 35–36. ISBN 0-471-82211-6. 
  2. ^ a b c d e f Babbie, Earl R. (2013). "The logic of sampling". The Practice of Social Research (13th ed.). Belmont, CA: Cengage Learning. pp. 185–226. ISBN 1-133-04979-6. 
  3. ^ a b c d e f Faherty, Vincent (2008). "Probability and statistical significance". Compassionate Statistics: Applied Quantitative Analysis for Social Services (With exercises and instructions in SPSS) (1st ed.). Thousand Oaks, CA: SAGE Publications, Inc. pp. 127–138. ISBN 1-412-93982-8. 
  4. ^ a b c d e f g h Sirkin, R. Mark (2005). "Two-sample t tests". Statistics for the Social Sciences (3rd ed.). Thousand Oaks, CA: SAGE Publications, Inc. pp. 271–316. ISBN 1-412-90546-X. 
  5. ^ a b c d e Cumming, Geoff (2012). Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. New York, USA: Routledge. pp. 27–28. 
  6. ^ a b c d e Krzywinski, Martin; Altman, Naomi (30 October 2013). "Points of significance: Significance, P values and t-tests". Nature Methods (Nature Publishing Group) 10 (11): 1041–1042. doi:10.1038/nmeth.2698. Retrieved 3 July 2014. 
  7. ^ a b c d e Sham, Pak C.; Purcell, Shaun M (17 April 2014). "Statistical power and significance testing in large-scale genetic studies". Nature Reviews Genetics (Nature Publishing Group) 15 (5): 335–346. doi:10.1038/nrg3706. Retrieved 3 July 2014. 
  8. ^ a b c d e Johnson, Valen E. (October 9 2013). "Revised standards for statistical evidence". Proceedings of the National Academy of Sciences (National Academies of Science). doi:10.1073/pnas.1313476110. Retrieved 3 July 2014.  Check date values in: |date= (help)
  9. ^ a b c d e Altman, Douglas G. (1999). Practical Statistics for Medical Research. New York, USA: Chapman & Hall/CRC. p. 167. ISBN 978-0412276309. 
  10. ^ a b Borror, Connie M. (2009). "Statistical decision making". The Certified Quality Engineer Handbook (3rd ed.). Milwaukee, WI: ASQ Quality Press. pp. 418–472. ISBN 0-873-89745-5. 
  11. ^ Devore, Jay L. (2011). Probability and Statistics for Engineering and the Sciences (8th ed.). Boston, MA: Cengage Learning. pp. 300–344. ISBN 0-538-73352-7. 

Ban Statistically Significant"?[edit]

William Briggs, a consulting statistician who blogs as "Statistician to the stars!", writes, re Statistically Significant:

If I were emperor, besides having my subjects lay me in an amply supply of duck tongue, I’d forever banish this term. Anybody found using it would be exiled to Brussels or to any building that won an architectural award since 2000. I’d also ban the theory that gave rise to the term. More harm has been done to scientific thought with this phrase than with any other. It breeds scientism. --Source [3]

Probably too informal for here, but he's so right. --Pete Tillman (talk) 18:10, 3 August 2014 (UTC)

Merge in "Statistical threshold"[edit]

The proposal has been made by editor Animalparty to merge in the content from Statistical threshold. Since the concepts are very different, and a "statistical threshold" is an arbitrarily determined limit, I do not think that such a merge would be provident. --Bejnar (talk) 15:55, 12 August 2014‎ (UTC)

On 17 August 2014, after discussion at Afd, Statistical threshold was redirected to Statistical hypothesis testing. Any further discussion should be on that talk page. --Bejnar (talk) 13:37, 17 August 2014 (UTC)