# Talk:Statistical significance

WikiProject Statistics (Rated C-class, High-importance)

This article is within the scope of the WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page or join the discussion.

C  This article has been rated as C-Class on the quality scale.
High  This article has been rated as High-importance on the importance scale.
WikiProject Mathematics (Rated C-class, Mid-importance)
This article is within the scope of WikiProject Mathematics, a collaborative effort to improve the coverage of Mathematics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
Mathematics rating:
 C Class
 Mid Importance
Field: Probability and statistics
One of the 500 most frequently viewed mathematics articles.

## Comment on Dubious

The citation (USEPA December 1992) contains numerous statistical tests, some presented as p-values and some as confidence intervals. Figures 5-1 through 5-4 show some of the test statistics used in the citation. In two of the figures the statistics are for individual studies and can be assumed to be prospective. In the other two, the statistics are for pooled studies and can be assumed to be retrospective. Table 5-9 includes the results of the test of the hypothesis RR = 1 versus RR > 1 for individual studies and for pooled studies. In the cited report, no distinction between prospective tests and retrospective tests was made. This is a departure from the traditional scientific method, which makes a strict distinction between predictions of the future and explanations of the past. Gjsis (talk)

## The History section

The existing section may slightly overstate Fisher's role. A search on "statistical significance" in Google Scholar will show that the term was not unknown before Fisher's book was published. Three citations from Google:

Wilson, Edwin Bidwell. "The statistical significance of experimental data." Science 58.1493 (1923): 93-100.

Boring, Edwin G. "The number of observations upon which a limen may be based." The American Journal of Psychology 27.3 (1916): 315-319.

Boring, Edwin G. "Mathematical vs. scientific significance." Psychological Bulletin 16.10 (1919): 335. 172.250.105.20 (talk) 20:57, 17 January 2014 (UTC)

That is true. The term did precede Fisher. But the "modern" (for the lack of better word) practice of tying p-values to statistical significance did originate from him. The question now is how best to write a text on the pre-Fisherian use of the term. It's great that you found these primary sources. But it would be better if we have a 1-2 secondary sources that has done all the work for us. danielkueh (talk) 22:53, 17 January 2014 (UTC)
Can we please discuss what it is that should be included in the History section? I have no problems expanding the history section but these two newly inserted sentences appear to be original research. Plus, it would be helpful to specify the exact page numbers of those two citations. Again, I have no problems expanding the section. In fact, I enthusiastically support it. But I think we need to craft something that is consistent and faithful to the sources. danielkueh (talk) 22:49, 20 January 2014 (UTC)
The two cited books on the history of statistics completely support the sentences that you removed. As supplemental evidence I have both primary and secondary sources (none of which need be cited in the text because the books are better histories):
Arbuthnot, John. 1710. An Argument for Divine Providence, taken from the Constant Regularity observ'd in the Births of Both Sexes. Philosophical Transactions of the Royal Society, 27: 186-90. An early example of a statistical significance test. The null hypothesis was that the number of males and females born were equal. The data showed a consistent excess of males. The odds of that were so low as to cast doubt on the null hypothesis.
Can you provide a specific quote that uses the word "significance" or "statistical significance?" Excess of males is a criterion, but that doesn't mean it is an example of statistical significance. Can you also a provide a secondary source that cites this study as an early pre-Fisherian example of statistical significance? Otherwise, this is an example of original research. danielkueh (talk) 21:55, 24 January 2014 (UTC)
"[Fisher] was the first to investigate the problem of making sound deductions from small collections of measurements. This was a new topic in statistics. Recall that Pearson had used large data sets." Probability and Statistics; The Science of Uncertainty. John Tabak. ISBN= 9780816049561 page 144. "Although it is no longer used in quite the way that Pearson preferred the [chi-squared] test is still one of the most widely used statistical techniques for testing the reasonableness of a hypothesis." page 141. The first person discussed in a chapter "The Birth of Modern Statistics" was Karl Pearson. The second was Ronald Fisher.
Yes, Pearson had used large data sets and preferred the chi square, but where is the term "statistical significance"? Another example of original research danielkueh (talk) 21:55, 24 January 2014 (UTC)
Fisher described Pearson's Chi-squared distribution as "the measure of discrepancy between observation and hypothesis" in the Introductory to the 11th Edition of Fisher's Statistical Methods for Research Workers. Pearson's work was repeatedly cited by Fisher.
Just because you cite someone repeatedly doesn't mean they came up with the idea that you are proposing. It could just mean that you are building on what they have discovered or formulated. In any event, did Fisher specifically referred to Pearson's work as tests of significance? danielkueh (talk) 21:55, 24 January 2014 (UTC)
"[F]rom the first edition it has been one of the chief purposes of this book to make better known the effect of [Gosset's] researches...", from the Introductory to the 11th Edition of Fisher's Statistical Methods for Research Workers. Gosset's work was repeatedly cited by Fisher. Gosset's work provided the foundation for the t-test.
Where is the word "significance?" danielkueh (talk) 21:55, 24 January 2014 (UTC)
"The history of Student [Gosset] and the history of Fisher are inextricably linked. Fisher not only championed Student’s work, but Student exerted a profound influence on the nature and direction of Fisher’s research for nearly two decades." On Student’s 1908 Article “The Probable Error of a Mean”. S. L. Zabell. Journal of the American Statistical Association March 2008, Vol. 103, No. 481 DOI 10.1198/016214508000000030 — Preceding unsigned comment added by 172.250.105.20 (talk) 20:25, 24 January 2014 (UTC)
I'm afraid this doesn't tell us anything about the origins of statistical significance." danielkueh (talk) 21:55, 24 January 2014 (UTC)
I now have enough data to reject my null hypothesis of good faith.172.250.105.20 (talk) 19:16, 25 January 2014 (UTC)
Sorry to see you feel that way. But you really need to review WP's policy of original research wp:or. Cheers. danielkueh (talk) 21:25, 25 January 2014 (UTC)
"Wikipedia articles must not contain original research. The phrase "original research" (OR) is used on Wikipedia to refer to material—such as facts, allegations, and ideas—for which no reliable, published sources exist. This includes any analysis or synthesis of published material that serves to advance a position not advanced by the sources. To demonstrate that you are not adding OR, you must be able to cite reliable, published sources that are directly related to the topic of the article, and directly support the material being presented." No problem.
Historical Origins of Statistical Testing Practices: The Treatment of Fisher Versus Neyman-Pearson Views in Textbooks, Carl J. Huberty, Journal of Experimental Education, 61(4), pages 317-333, 1993. Table 1 of Statistical Testing Applications lists Arbuthnot, LaPlace, K Pearson and Gosset with dates well before Fisher in 1925. Huberty notes that the logic of (significance) testing was present even if the modern terminology was not. Huberty's sources for the table are 4 histories of probability and statistics.
What are your remaining objections to my proposed edit (in enough detail to address them)? You have objected to two sentences while I see nothing wrong with either. "While antecedents extend centuries into the past, statistical significance is largely a development of the early twentieth century." Given that p-values were computed centuries ago (P-value#History), the first sentence is largely immune from attack. "Major contributors include Karl Pearson, William Sealy Gosset, Ronald Fisher, Jerzy Neyman and Egon Pearson." Which names do you object to and why?172.250.105.20 (talk) 20:01, 27 January 2014 (UTC)
I have already given a point-by-point reply to each of your statements above. If you read them carefully and think it through, you will see why your interpretations and conclusions of the sources are inconsistent with these two WP policies: WP:OR and WP:SYNTH. If you go over WP:OR carefully, you will see that it clearly states that "any interpretation of primary source material requires a reliable secondary source for that interpretation." And if you look at WP:SYNTH, you will see that it states "Do not combine material from multiple sources to reach or imply a conclusion not explicitly stated by any of the sources." danielkueh (talk) 21:05, 27 January 2014 (UTC)
Did you look at Huberty? "Statistical testing was applied by the English scholar John Arbuthnot nearly 300 years ago." p 317172.250.105.20 (talk) 19:31, 28 January 2014 (UTC)
Where is the word "significance"? danielkueh (talk) 20:31, 28 January 2014 (UTC)
Huberty does not use the adjective. Look at the title of the paper, or better yet, read it.172.250.105.20 (talk) 00:52, 30 January 2014 (UTC)
If he doesn't use it, then we can't use it as well. This article is specifically about statistical significance. Not hypothesis testing. You should take your own advice and learn the policies. Better yet, read them. danielkueh (talk) 01:06, 30 January 2014 (UTC)
Ah! A difficult distinction to make and one that usually is not made. Thanks for the clarification. I was eventually planning to ask the reason for two articles.172.250.105.20 (talk) 01:17, 30 January 2014 (UTC)

## Reverted recent lead edit

I know I am supposed to be retired and should not be editing anymore but I could not help myself when I saw this recent edit to the lead by Ehrenkater, which I reverted. I did so for two reasons:

• The recent edits (definition and couple of alpha and p-values) are inconsistent with the cited sources. In WP, we go with the sources, not with personal opinions, expertise, or interpretations (See WP:V and WP:RS).
• Statistical significance cannot be a test. A t-test is a test. An ANOVA is a test. But a statistical significance is not a test. When we see p-values that are less than .05, we would normally say they are statistically significant. p-values are probabilities, not tests. Plus, whether significant p-values help us form a view is entirely dependent on the individual researcher. And statistical significance, can in principle, be applied to all data, not just statistical ones.

Those are my two cents. danielkueh (talk) 20:14, 11 April 2014 (UTC)

The lead currently starts with the sentence: "Statistical significance is the probability that an effect is not due to just chance alone." That is simply not true. For a start, "statistical significance" is not even a number, it is just the name for the general concept of significance testing. Second -- and this is the point which is most likely to confuse readers -- if a result is stated to be significant at the 2.5% level (say), that does not mean that it has a 2.5% probability of being true. In order to work out the probability of the result being true, you would have to come up with an a priori probability and then apply Bayes' theorem.----Ehrenkater (talk) 22:03, 13 April 2014 (UTC)
Of course statistical significance is not just probability. It is the "probability that an effect is not due to just chance alone" (See p. 19 of first citation [1] and p. 58 of second citation [2]). You need to look at the entire sentence, and not just that one world alone. danielkueh (talk) 03:03, 14 April 2014 (UTC)
The sentence is still wrong. "The probability that an effect is not due to just chance alone" means "The probability that the null hypothesis is false", which is wrong and commits the inverse probability fallacy. I suggest something like "The probability that an effect would be observed due to chance alone" meaning "The probability of seeing the effect GIVEN that the null hypothesis is true". Bjfar (talk) 06:18, 15 May 2014 (UTC)
Uhmmm, the current sentence does not imply what you think it implies. It merely states that whatever mean differences that you see is not due to just chance alone. It does not commit the inverse fallacy. And the alternative proposal makes no sense. If the mean differences are due to chance "alone," they are not significant. Why on earth would anyone care if it is all just due to chance?!?! Plus, the current sentence is consistent with the sources (see above), which means, it is also consistent with WP policy (see WP:V and WP:RS.) danielkueh (talk) 20:30, 15 May 2014 (UTC)
The definition of statistical significance is extremely confused. A p-value is the probability of obtaining at least as extreme results as you got, given the truth of the null hypothesis. A result is statistically significance if the p-value is lower than alpha. Statistical significance is not by itself a probability. It is not even a probability given some other factors. A p-value is not the probability that the results are due to chance. This is because a p-value is calculated under the assumption that all deviations from the null hypothesis is due to chance. Clearly, a p-value cannot be a measure of the probability that results are due to chance if you already assume that probability is 100%. As an other example, suppose you achieved p = 0.04 for an experiment testing if copper bracelets are an effective treatment for diabetes. Clearly, the probability that that result is due to chance is 100%, not 4% (since copper bracelets do not work for diabetes). Instead, it is better to write something like "A difference is statistically significance if the probability of obtaining at least as extreme results, given the truth of the null hypothesis, is very low". "By chance alone" should be banned from the conversation. It is a classic misunderstanding of statistical significance testing. EmilKarlsson (talk) 11:18, 23 May 2014 (UTC)
Do you have a reference for your proposed definition, preferably a secondary source? danielkueh (talk) 14:19, 23 May 2014 (UTC)
You do not seem to be addressing the arguments. Yes, there are many sources, such as Kline (2004), Gigerenzer (2004), Goodman (2008), Sterne (2001), or even the classic Cohen (1994). You can also read the Wikipedia article on p-values: https://en.wikipedia.org/wiki/P-value as it has additional sources. EmilKarlsson (talk) 20:27, 25 May 2014 (UTC)
Please specify exact page numbers, book/chapter titles, and specific sentences that support your proposed definition. Also, provide specific links. If you want to propose a change to the lead definition, then you need to do a bit of work. There are already sources for the present definition (see above). Also, please review WP policy pages such as WP:V, WP:RS, WP:lead, and WP:OR. They are really helpful. danielkueh (talk) 23:07, 25 May 2014 (UTC)
I find it incredibly embarrassing for wikipedia to have an incorrect definition of statistical significance as the lead definition. Starting off the definition of the page with a well known fallacy is terrible. This definition is, in fact, the very first misconception listed on the entry for p-values - the probability of being due "to chance alone" having the most straight-forward interpretation as being the probability that the null hypothesis is true. In support of EmilKarlsson's definition, see Agresti and Franklin (2013), Chapter 9, page 413 for a definition of the significance level as the p-value needed to reject the null and a result is significant if the null is rejected (i.e. p < significance level is observed). --24.136.34.125 (talk) 18:07, 5 June 2014 (UTC)
24.136.34.125, Thank you for the reference but parts of the book by Agresti and Franklin (2013) are not accessible online. Could you just type out the relevant sentences or provide a link to a screenshot of the page? That would be helpful. Also, bear in mind that the definition of the current article is on statistical significance, not significance level. A subtle but very important difference. And if you already have a WP account, please log into that account so that we can keep track of the comments on this thread. Thanks. danielkueh (talk) 19:01, 5 June 2014 (UTC)

────────────────────────────────────────────────────────────────────────────────────────────────────In an attempt to address some of the criticisms of the current definition on this thread, I wrote out the sentences from the two sources that support the current definition (See p. 19 of [3] and p. 58 of second citation [4]). Here they are:

• Coolidge FL. (2012). Statistics: A gentle introduction. ISBN: 1412991714
p. 19
“Significance refers to an effect that has occurred that is not likely due to chance. In statistics, the opposite of significant is nonsignificant, and this means that an effect is likely due to chance. It will be important for you to remember that the word insignificant is a value judgment, and it typically has no place in your statistical language.”
• Norman GR, Streiner DL. (2007). Biostatistics: The bare essentials. ISBN: 1550093479
p. 58.
“Statistical significance, if you read the fine print once again, is simply an issue of the probability or likelihood that there was a difference—any difference of any size.”

Also, I found three more sources that describe statistical significance as follows:

• Surgical Research (2001), edited by Souba and Wilmore (Eds) ISBN: 0126553300 [5])
p. 121
"Statistical significance is the degree of certainty with which an inference can be drawn. It is not a threshold level beyond which the inference clearly represents truth."
• Gad and Chengelis (2009) Acute Toxicology Testing: Perspectives and Horizons. ISBN: 0936923067 [6]
p. 347
"Statistical significance is expressed in terms of probability (or level of confidence in a result) and has associated with it two kinds of possible outcomes."
• Sheskin (2004). Handbook of Parametric and Nonparametric Statistical Procedures. ISBN: 1584884401 [7]
p. 56
"The term statistical significance implies that one is determing whether or not an obtained difference in an experiment is due to chance or is likely to be due to the presence of a genuine experimental effect."
p. 58
"Statistical significance only implies that the outcome of a study is highly unlikely to have occurred as a result of chance."

If you believe that the above cited sources are not representative of mainstream statistics, and that more reputable and reliable sources (See WP:RS) are available, then by all means, provide them. I would be delighted if the definition of this article can be improved much further. That said, please note that Wikipedia is not a forum (see WP:forum) and that every statement needs to be verified (WP:V). We cannot change definitions or write whole articles based on arguments alone (See WP:OR). danielkueh (talk) 19:41, 5 June 2014 (UTC)

With respect to Agresti, the relevant definitions are
"Definition: significance level is a number such that we reject H_0 if the P-value is less than or equal to that number."
and
"Definition: When we reject H_0, we say the results are statistically significant."
hence a result is "significant" simply means that we observed a p-value smaller than some prescribed cutoff, and that as a result we rejected the null hypothesis.
The following quotation from Sir David Cox supports defining a result as statistically significant at a given level if we observe a p-value of a test statistic (chosen a priori) that is less than that level. The whole paper is available here http://www.phil.vt.edu/dmayo/PhilStatistics/a%20Cox%201982%20Statistical%20Significance%20Tests.pdf
"The value of P is called the (achieved) significance level. If P < 0.05, we say the departure is significant at the 0.05 level, if P < 0.01 that the departure is significant at the 0.01 level, and so on"
I would consider anything Cox says to be essentially canonical. This is a slightly different definition than the definition used in Agresti, although it is clear from context that they are both saying the same thing. But I agree with the original phrasing (from months ago) that statistical significance can refer to many things and isn't subject to one strict definition. But neither of these two definitions (both coming from famous pure statisticians) support the notion that statistical significance is a single number, let alone the probability of something having occurred not due to chance (the only reasonable interpretation of the current lead definition is that the statistical significance is the posterior probability that the alternative is true, and this is false and is effectively contradicted in the p-value article).
With respect to the definitions you've provided, a few things should be born in mind - these aren't intended to be direct arguments for supporting a change, but are more to frame how we should interpret the passages you quoted. (1) Statisticians are apt to use the word "likely" even when they are not (and by assumption, cannot) be talking about a probability. This usage of "likely" and "unlikely" goes all the way back to R.A. Fisher, the father the p-value, significance testing, and the likelihood function. In frequentist statistics, for example, we do not assign the null and alternative hypotheses probabilities - the null is either true, or false. Nevertheless, we might talk about the null being "unlikely" given the evidence. In fact, Jim Berger (another famous statistician) does so here at the beginning of Section 2: http://www.stat.duke.edu/~berger/papers/02-01.pdf but it is very clear from the article that he is not interpreting "unlikely" to mean "with low probability" - see in particular the Criticisms of p-values section. In fact, people jump through these semantic hoops specifically to avoid making probabilistic statements. In light of this, the citations you gave of Collidge; Souba and Wilmore; and Sheskin do not necessarily buttress the definition currently in the article. I suspect from the second quotation from Sheskin is being made with the intent to remind the reader precisely that statistical significance does not imply that the null is true with low probability, but I don't have the book with me for context. (2) p-values are of course linked with probability, but not the probability of the null, it is the probability of something occurring assuming the null is true. So a simple reference to p-values representing probabilities does not support the current definition - in particular, I don't see how Gad and Chenglis are supporting the current definition. Finally, the quote provided from Norman and Streiner lacks enough context for me to infer what they are claiming.--Thetrickstermage (talk) 01:27, 6 June 2014 (UTC)
Thetrickstermage, thank you for your reply. I am going to assume you are ip 24.136.34.125? I will address somewish to clarify a couple of the points you gavemade :
• I provided links (numbers in brackets) to the quotes above. If you want more context, you can click on those links and they will take you to Google Books where you can read whole passages.
• These quotes were provided not just to support the current definition but to address some of the comments made by other contributors as well. For instance, you don't have to "interpret" or overthink about the sentence by Gad and Chengelis. In fact, you just have to pay attention to the first part of the sentence, "statistical significance is expressed in terms of probabilities...." This part of the statement alone contradicts the claim made by Ehrenkater, one of the editors on this thread, that "statistical significance is not even a number."
• I am deeply perplexed when you say "citations you gave of Collidge; Souba and Wilmore; and Sheskin do not necessarily buttress the definition currently in the article." The current lead definition is essentially a paraphrase of those definitions. Anyone who puts them up side by side will notice the obvious similarities. However, if you believe the lead definition could be tweaked to closely reflect those definitions, then we can certainly work on that. For example, I am open to changing "probability" to "likelihood."
• With respect to the definition by Agresti, statistical significance is not synonymous with significance level. In fact, the definition of significance level in this article and the way it is used are consistent with the Agresti definitions (or more correctly, description of the method to identify significance). But it doesn't contradict the lead definition. Yes, we would reject the null if the p-value is less than a predetermined significance level. This procedure is also described in this article. No contradictions there. However, in light of this discussion, I do believe the second paragraph of the lead could be clarified and improved.
• Based on your response and those of other editors, there appears to be confusion with respect to the method used to identify statistical significance and statistical significance itself. A simple but crude analogy would be to confuse science for the scientific method.
• The Cox source does not contradict the lead definition at all. If anything, it is very consistent with it. In fact, take a look at p. 327 where Cox describes statistical significance as follows:
"Statistical significance is concerned with whether, for instance, the direction of such and such an effect is reasonably firmly established by the data under analysis." Much of his article is devoted to describing mechanics of significance tests, not the concept of statistical significance itself.
Thus, I still haven't seen any source that explicitly describes statistical significance in a way that challenges the current lead definition. If anything, I see more sources that are very consistent with the lead definition. danielkueh (talk) 02:39, 6 June 2014 (UTC)
Thetrickstermage, after much thought, I am starting to see your point about the way probability is used in the lead sentence. We have two options. We could delete "not" from the lead definition and change probability to low probability such that:
• "Statistical significance is the low probability that an effect is due to just chance alone."
or we could change probability to likelihood like so:
• "Statistical significance is the likelihood that an effect is not due to just chance alone."
I generally prefer the former as I find the term, likelihood, to be wishy washy. So I went ahead and deleted "not" from the lead sentence. But if you prefer, we can change it to the latter. Thank you for your efforts. danielkueh (talk) 13:12, 6 June 2014 (UTC)
Okay, let me try to troubleshoot our disagreement. There is no shortage in the literature of sources we can appeal to, so lets set that aside for the moment and if we agree on the content we can address proper citations later.
1. Do you agree that "the probability that an effect is due to chance alone" has the most natural interpretation as Pr(random chance generated the observed effect | the data)?
Not necessarily. The problem is the lack of a qualifier. In this case, changing "probability" to "high probability" would have clarified the sentence. But that is not what the current lead sentence say now, so we can move on. danielkueh (talk) 19:40, 6 June 2014 (UTC)
• If you do, do you recognize that this is a Bayesian posterior probability (it is equivalent to Pr(the null hypothesis | the data)) and that if statistical significance is defined in these terms then frequentists cannot talk about statistical significance since for a frequentist Pr(the null | the data) is either 0 or 1?
NA. danielkueh (talk) 19:40, 6 June 2014 (UTC)
• If you do not, don't you think that the fact that several editors have pointed out that they believe that this is the most natural interpretation suggests that - whether they are right or wrong - something more precise/better could be used instead so that it is clear that this is not the interpretation the article intends? I think, especially for issues as subtle as this, that we err on the side of more detail than less.
Yes, hence, my proposals and recent revisions to this article. danielkueh (talk) 19:40, 6 June 2014 (UTC)
1. Do you agree that the probabilities associated with "statistical significance" are associated with probabilities such as Pr(the test statistic happened to be at least as extreme as we actually observed | random chance generated the data)? If so, do you recognize that this is different than the previous probability?
Yes, I agree. It is different because there is a lack of a qualifier. See above. danielkueh (talk) 19:40, 6 June 2014 (UTC)
1. Do you insist that statistical significance should itself be a number? Agresti gives an explicit definition of statistically significant - the only explicit definition I've seen in any of the sources - that merely assess a result as statistically significant if the null hypothesis is rejected. I find this very agreeable.
I am not insisting anything. That is what the sources say. You must have overlooked them. For instance, take a look at p. 306 of Statistics for the Social Science by Sirkin [8] where he describes statistical significance as follows:
"Statistical significance: The high probability that the difference between two means or other finding based on a random sample is not the result of sampling error but reflects the characteristics of the population from which the same was drawn."
Here's another passage on p. 66 of Statistics Explained: An Introductory Guide for Life Scientists by McKillup (2012) [9]
"Many statistics texts do not mention this and students often ask "What if you get a probability of exactly 0.05?" Here the result would be considered not significant since significance has been defined as a probability of less than 0.05 (<0.05). Some texts define a significant result as one where the probability is 5% or less (≤ 0.05). In practice, this makes very little difference, but because Fisher proposed the 'less than 0.05' definition, which is also used by most scientific publications, it will be used here."
There are numerous other sources (including the ones mentioned above) that make similar points as well. danielkueh (talk) 19:41, 6 June 2014 (UTC)
I think both of the new proposed definitions given are inadequate; people do not typically say things like "the statistical significance is some number" - Cox would rather say that the attained statistical significance level is some number'. Perhaps we might say individual results are statistically significant or not, and that statistical significance itself is the topic of whether results are statistically significant or not. And I agree that likely and unlikely are is wishy-washy when used in this sense but it is better to be wishy-washy than to be wrong.
But, if you insisted on some modification of the current lead, the closest thing to correct would roughly be "statistical significance is the a priori probability that an effect would be at least as big as what the effect that was actually observed assuming chance alone generated the data" although this could probably be phrased better and some thought would have to go into explaining how the definition covers two-sided tests (taking an absolute value covers it). But this definition just coincides with one of the definitions of a p-value, which I don't think is what we really want. I don't particularly like that this excludes a Bayesian perspective on significance either but I think that is likely to be missing regardless of how this is resolved and it is in line with how the phrase statistical significance is used in classrooms - at least, this is how it is taught at the intro level at this institution - and how it is used in practice. --Thetrickstermage (talk) 19:12, 6 June 2014 (UTC)
You have written quite a bit here and I apologize for not responding to all of them in detail as I have other duties to attend (hence my supposed retirement from Wikipedia). Plus, I have already made several changes to the lead paragraph. So if you can take a look at them and provide some feedback, that would be helpful as well. Your proposed definition is correct but because it is so jargon-laden, it is practically impenetrable to lay reader (See WP:NOTJARGON. I prefer spreading it out a little bit like the current first lead paragraph. danielkueh (talk) 19:40, 6 June 2014 (UTC)
Surely we should favor a correct definition, even if this requires some jargon. I would propose as an alternative definition "A result is statistically significant if if there is enough statistical evidence to reject the hypothesis that an observed effect occurred due to chance alone." Then one could elaborate on how we can use p-values and significance levels to assess the amount of statistical evidence in the data. This is both in line with Agresti's definition and reflective of how the term is used in practice.
In response to your passages from Sirkin; and McKillop - neither of these authors are statisticians (the former is a political scientist, the latter apparently a marine biologist?). There is a reason these issues are regarded as misconceptions. In particular, I am totally incapable of reading Sirkin's definition as anything other than the posterior probability of the null, which we both agree is wrong. They are not old misconceptions, they are alive and are being perpetuated by people outside of - and sometimes within the - discipline of statistics. It isn't surprising that there are published examples, particularly from the life or social sciences, of authors not being careful with their definitions, even if these authors use their methods in a manner which is consistent with practice. Sirkin's definition should not be favored over an explicit definition given by a renowned academic statistician. On the other hand, McKillop is not defining statistical significance itself as a number, but describing a criteria by which one determines whether a result is statistically significant - and this is in line with Agresti. If the p-value is low, we reject the null, and declare that the result is significant; but we don't define statistical significance itself as a number, a probability, or anything else. But it also isn't certain from the passage quoted that McKillop is not also making a mistake - he is calling the p-value a "probability", and while this isn't strictly wrong depending on how it is interpreted, it is dangerously close to another misconception that was explicitly discussed in the paper I link by James Berger, namely that p-values are not error probabilities.--Thetrickstermage (talk) 20:48, 6 June 2014 (UTC)
In comment to the recent edits - I think it is absolutely critical that it be emphasized that all calculations related to significance are made assuming that the null is true. Talking about probabilities that effects occurred due to chance invites the inference by the untrained or naive reader that we are allowing that the probability of the null being true is some number in (0, 1) when it is in the frequentist paradigm either 0% or 100%.--Thetrickstermage (talk) 20:55, 6 June 2014 (UTC)

────────────────────────────────────────────────────────────────────────────────────────────────────I will return to your larger comment later. To address your second comment about conditional probability (assuming that the null is true), it would help me a great deal if you could identify a specific sentence in the lead where we can insert that qualifying statement about the null being true. danielkueh (talk) 20:59, 6 June 2014 (UTC)

Ignoring my other objections to the definition, this might be incorporated by changing the lead to "low probability that the observed effect would have occurred due to chance." This is still too subtle for my taste, but the term "would have" implies that we are conditioning on "chance" (whatever that means) generating the data. --Thetrickstermage (talk) 22:23, 6 June 2014 (UTC)
Done. danielkueh (talk) 22:56, 6 June 2014 (UTC)

## Sources showing that the leading definition "due to chance" of statistical significance is wrong

Because it is difficult to follow the discussion about the leading sentence above, I have made a new section. Some people have requested that I produce exact sentences from the sources I mentioned that reject the definition of statistical significance as "low probability that an observed effect would have occurred due to chance" and embrace the definition "the probability to observe at least as extreme results, given the truth of the null hypothesis, is low". Here they are:

"The p value is the probability of getting our observed results, or a more extreme results, if the null hypothesis is true. So p is defined in relation to a stated null hypothesis, and requires as the basis for calculation that we assume the null is true. It's a common error to think p gives the probability that the null is true: That's the inverse probability fallacy." (p. 27)

"Note carefully that .011 was the probability of particular extreme results if the null hypothesis is true. Surprising results may reasonably lead you to doubt the null, but p is not the probability that the null is true. Some statistics textbooks say that p measures the probability that "the results are due to chance" -- in other words, the probability that the null hypothesis is correct. However, that's merely a restatement of the inverse probability fallacy. It is completely wrong to say that the p value is the probability that the results are due to chance." (p. 28)

Cumming, G. (2012). Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. New York: Routledge.

"Misinterpretations of p Values

Next we consider common misunderstands about the probabilities generated by statistical tests, p values. Let us first review their correct interpretation. Recall that statistical tests measure the discrepancy between a sample statistic and the value of the population parameter specified in the null hypothesis, H0, taking account of sampling error. The empirical test statistic is converted to a probability within the appropriate central test distribution. This probability is the conditional probability of the statistic assuming H0 is true." (p. 63)

"Fallacy Number 1

A p value is the probability that the result is a result of sampling error; thus, p < 0.05 says that there is less than a 5% likelihood that the results happened by chance. This false belief is the odds-against-chance fantasy. It is wrong because p values are computed under the assumption that sampling error is what causes sample statistics to depart from the null hypothesis. That is, the likelihood of sampling error is already taken to be 1.00 when a statistical test is conducted. It is thus illogical to view p values as measuring the probability of sampling error." (p. 63)

Kline, R. B. (2005). Beyond Significance Testing: Reforming Data Analysis Methods in Behavioral Research. Washington, DC: American Psychological Association.

"In the process of calculating the P value, we assumed that H0 was true and that x was drawn from H0. Thus, a small P value (for example, P = 0.05) merely tells us that an improbable event has occurred in the context of this assumption."

Krzywinski, M., & Altman, N. (2013). Points of significance: Significance, P values and t-tests. Nat Meth, 10(11), 1041-1042.

"The P value, which was introduced earlier by Fisher in the context of significance testing, is defined as the probability of obtaining — among the values of T generated when H0 is true — a value that is at least as extreme as that of the actual sample (denoted as t). This can be represented as P = P(T ≥ t | H0)."

Sham, P. C., & Purcell, S. M. (2014). Statistical power and significance testing in large-scale genetic studies. Nat Rev Genet, 15(5), 335-346.

"The P value from a classical test is the maximum probability of observing a test statistic as extreme, or more extreme, than the value that was actually observed, given that the null hypothesis is true."

Johnson, V. E. (2013). Revised standards for statistical evidence. Proceedings of the National Academy of Sciences, 110(48), 19313-19317.

"The p-value can be interpreted as the probability of obtaining the observed differences, or one more extreme, if the null hypothesis is true." (p. 103)

Machin, D., Campbell, M. J., Walters, S. J. (). Medical Statistics. A textbook for the health sciences. 4th edition. West Sussex, England: Wiley.

"The P value is the probability of having our observed data (or more extreme data) when the null hypothesis is true." (p. 167)

"A common misinterpretation of the P value is that it is the probability of the data having arisen by chance, or equivalently, that P is the probability is the observed effect is not a real one. The distinction between this incorrect definition and the true definition given earlier is the absence of the phrase when the null hypothesis is true." (p. 170)

Altman, D. G. (1999). Practical Statistics for Medical Research. New York: Chapman & Hall/CRC.

Based on these reference, the leading definition should be changed to something like:

A result is statistically significant if the probability of observing the results, or more extreme results, given the truth of the null hypothesis, is low.

EmilKarlsson (talk) 13:09, 20 June 2014 (UTC)

EmilKarlsson, thanks for producing these sources. Point taken and I think your proposed lead definition is acceptable. Right now, I have only two requests. Could you:
• begin your lead definition as "statistical significance is XXXX"? (WP:BEGINNING)
• Take a look at the first lead paragraph to make sure that your proposed lead definition "flows" with the rest of the paragraph. If not, feel free to propose additional suggested edits.
I know I have been a pain in the rear about demanding sources, etc. But believe me, for the contents of this page to endure, sources and consensus are critical. Thanks again for your time. danielkueh (talk) 17:52, 20 June 2014 (UTC)
Here are some candidate suggestions that may fulfill the requirements above:
(1) "Statistical significance obtains when the probability of an observed result (or a more extreme result) is low, given the truth of the null hypothesis."
(2) "Statistical significance obtains when the probability of an observed result (or a more extreme result) is low, had the null hypothesis been true."
(3) "Statistical significance obtains when the probability of an at least as extreme result is low, given the truth of the null hypothesis."
(4) "Statistical significance obtains when the probability of an at least as extreme result is low, had the null hypothesis been true."
Constructing a good sentence that captures both "probability of the results or more extreme results" as well as "given the null hypothesis" is a bit tricky. Does any of the candidates above seem to have more clarity and be less convoluted than the others? I tend to write things too bulky, so I would be more than grateful for input. EmilKarlsson (talk) 14:47, 21 June 2014 (UTC)
If I was a naive reader and I read these sentences for the first time, I will get an impression of how to achieve statistical significance but I don't think I will understand what statistical significance itself is. It is like knowing the process of achieving a gold medal (getting first place) without actually knowing what a gold medal really is. Much of the definition (1-4) is focused on conditional probability, which I have no problems with, but the emphasis should be on what statistical significance is. It is when the probability is "low," i.e., below a predefined threshold, like p <.05. How can we capture that in the lead sentence without overwhelming a naive reader? Here's a definition from another source that I think does just that, which is reasonably broad and is consistent with the definitions you put forward:
• Statistical significance refers to whether or not the value of a statistical tests exceeds some prespecified level. ([10]) p. 35 of Redmond and Colton. 2001. Biostatistics in Clinical Trials.
We should strive for something that is just as simple and succinct and consistent with available sources. If we need to add more information, we will have plenty of space to do so in the first lead paragraph. danielkueh (talk) 02:36, 22 June 2014 (UTC)
Here is another suggestion that attempts to incorporate some of the points that you made above:
(5) Statistical significance occurs when the probability of getting a result or a more extreme result (given that the null hypothesis is true) is lower than a certain pre-specified cutoff.
It is still a little bit bulky. How can we improve this? EmilKarlsson (talk) 19:13, 22 June 2014 (UTC)
Better. But I think we can simplify it a little further. We both agree that for there to be statistical significance, the p-value has to be less than a predefined value like p<.05. Without qualifying terms like "low" or "below a predefined value," we're really just defining p-values. So how about something like this:
(6) Statistical significance is the low probability of getting at least as extreme results given that the null hypothesis is true.
The key term here is "low," which could be explained further in subsequent sentences. danielkueh (talk) 19:27, 22 June 2014 (UTC)
I think that (6) is a really good definition and it has my support. Perhaps the leading paragraph needs to be modified, both to explain what "low probability" means in this context and to fix some of the subsequent sentences that assumes the "due to chance" definition. Here is one potential suggestion:
(a) Statistical significance is the low probability of getting at least as extreme results given that the null hypothesis is true. It is an integral part of statistical hypothesis testing where it helps investigators to decide if a null hypothesis can be rejected. In any experiment or observation that involves drawing a sample from a population, there is always the possibility that an observed effect could have occurred due to sampling error alone. But if the probability of obtaining at least as extreme result (given the null hypothesis) is lower than a pre-determined threshold (such as 5%), then an investigator can conclude that it is unlikely that the observed effect could only be explained by sampling error.
How can we make this better? Is there anything that is obviously missing for a good leading paragraph? Does it feel to repetitive? EmilKarlsson (talk) 10:34, 29 June 2014 (UTC)
I think this looks good. Nice work. My only comment is whether "extreme results" needs explaining. But overall, I think the first lead paragraph is descent but the second and third may need some tuning. We should not start finalizing the references for the first lead paragraph. I recommend that we start copying and pasting all the reference tags into the proposed lead paragraph and just edit out the ones that are not relevant and insert the ones that are. danielkueh (talk) 22:30, 29 June 2014 (UTC)
Here's the newly proposed lead with the all the original references:
Statistical significance is the low probability of obtaining at least as extreme results given that the null hypothesis is true.[1] In any experiment or observation that involves drawing a sample from a population, there is always the possibility that an observed effect would have occurred due to sampling error alone.[2][3] But if the probability of obtaining at least as extreme result, given the null hypothesis is true, is less than a pre-determined threshold (e.g. 5% chance), then an investigator can conclude that the observed effect actually reflects the characteristics of the population rather than just sampling error.[4]
Aside from the need toI have "edited out" the references that do not directly support the new definition. I have replaced them with the Redmon and Colton (2001) reference, which provides the closest definition. If you know of another source or two, please add them., I think it is also important to keep the point at the end that an "effect actually reflects the characteristics of the population," which is the whole point of inferential statistics. Plus, it also allows readers to better understand what the alternative conclusion to sampling error is. danielkueh (talk) 14:42, 30 June 2014 (UTC)
I am fine with keeping the "effect actually.." addition. If we can use some of the sources I listed in the first paragraph of this subsection, we get something like this:
Statistical significance is the low probability of obtaining at least as extreme results given that the null hypothesis is true.[1][5][6][7][8][9] In any experiment or observation that involves drawing a sample from a population, there is always the possibility that an observed effect would have occurred due to sampling error alone.[2][3] But if the probability of obtaining at least as extreme result, given the null hypothesis is true, is less than a pre-determined threshold (e.g. 5% chance), then an investigator can conclude that the observed effect actually reflects the characteristics of the population rather than just sampling error.[4]
Some of the references I added might look a bit weird as I am not a frequent contributer to Wikipedia (although I have read the temple pages for citing references). For instance, the Johnson reference (currently 8) is missing a page number and other stuff, but it is an early release at this point so it has not been assigned any. Any suggestions for improving the completeness of the added reference? EmilKarlsson (talk) 18:33, 3 July 2014 (UTC)
I have looked at each of the references that you inserted and have found them to be relevant and of high quality. They are more than sufficient for now. That said, I still think there should be one sentence or less that quickly explains what an "extreme" result is. Maybe such a sentence is better placed in the main body of this article? Either way, I think this newly revised first lead paragraph is ready for prime time. danielkueh (talk) 21:35, 6 July 2014 (UTC)
Statistical significance is the low probability of obtaining at least as extreme results given that the null hypothesis is true.[1][5][6][7][8][9] In any experiment or observation that involves drawing a sample from a population, there is always the possibility that an observed effect would have occurred due to sampling error alone.[2][3] But if the probability of obtaining at least as extreme result, given the null hypothesis is true, is less than a pre-determined threshold (e.g. 5% chance), then an investigator can conclude that the observed effect actually reflects the characteristics of the population rather than just sampling error.[4] In this context, more extreme results usually refers to a larger difference between the experimental and control group than was observed.
This is a simplified explanation of "more extreme" that does not go into the details about distribution tails or other scenarios besides the two-groups design (although I assume this is the most common set-up, at least in biology, medicine and social sciences). Is this sufficient, or can it be phrased in such a way that it encompasses more complexity without it becoming too bulky? EmilKarlsson (talk) 12:37, 7 July 2014 (UTC)
I moved the new sentence to a different location:
Statistical significance is the low probability of obtaining at least as extreme results given that the null hypothesis is true.[1][5][6][7][8][9] In any experiment or observation that involves drawing a sample from a population, there is always the possibility that an observed effect would have occurred due to sampling error alone.[2][3] But if the probability of obtaining at least as extreme result (large difference between two or more sample means), given the null hypothesis is true, is less than a pre-determined threshold (e.g. 5% chance), then an investigator can conclude that the observed effect actually reflects the characteristics of the population rather than just sampling error.[4]
I also substituted "control and experimental" with two or more sample means. danielkueh (talk) 15:20, 7 July 2014 (UTC)

──────────────────────────────────────────────────────────────────────────────────────────────────── Please remember that the lead of the article needs to give the reader enough context to understand the precise definition, too. In other words, the first sentence must begin with "Statistical significance is a concept in statistics that describes..." or something similar. ElKevbo (talk) 16:28, 7 July 2014 (UTC)

ElKevbo, fair point. We could just restore the second sentence from the present version to the new lead as follows:
Statistical significance is the low probability of obtaining at least as extreme results given that the null hypothesis is true.[1][5][6][7][8][9] It is an integral part of statistical hypothesis testing where it helps investigators to decide if a null hypothesis can be rejected.[4][10] In any experiment or observation that involves drawing a sample from a population, there is always the possibility that an observed effect would have occurred due to sampling error alone.[2][3] But if the probability of obtaining at least as extreme result (large difference between two or more sample means), given the null hypothesis is true, is less than a pre-determined threshold (e.g. 5% chance), then an investigator can conclude that the observed effect actually reflects the characteristics of the population rather than just sampling error.[4]
danielkueh (talk) 19:45, 7 July 2014 (UTC)
I am in full agreement. Any final issues before we deploy the new lead paragraph? EmilKarlsson (talk) 10:38, 8 July 2014 (UTC)
Yes, I agree it is ready as well. danielkueh (talk) 16:37, 8 July 2014 (UTC)
On a side note, I just noticed that the article itself uses the "given the null hypothesis" definition with a reference (17). I have not read that reference, so I do not know the specifics of what it is stating, but we could either subsume it in the lead sentence or leave it. EmilKarlsson (talk) 14:25, 8 July 2014 (UTC)
Here's the link [11] and quote from p. 329 of Devore (reference 17):
"The P-value is the probability, calculated assuming that the null hypothesis is true, of obtaining a value of the test statistic at least as contradictory to Ho as the value calculated from the available sample."
I agree it can be included in the lead paragraph danielkueh (talk) 16:37, 8 July 2014 (UTC)

Here's the lead with the Devore reference inserted:

Statistical significance is the low probability of obtaining at least as extreme results given that the null hypothesis is true.[1][5][6][7][8][9][11] It is an integral part of statistical hypothesis testing where it helps investigators to decide if a null hypothesis can be rejected.[4][10] In any experiment or observation that involves drawing a sample from a population, there is always the possibility that an observed effect would have occurred due to sampling error alone.[2][3] But if the probability of obtaining at least as extreme result (large difference between two or more sample means), given the null hypothesis is true, is less than a pre-determined threshold (e.g. 5% chance), then an investigator can conclude that the observed effect actually reflects the characteristics of the population rather than just sampling error.[4]

I believe we are set to go. Does anyone else have any other comments before we replace the existing lead? danielkueh (talk) 16:19, 9 July 2014 (UTC)

It is always possible to quibbles over details when it comes to definitions in statistics. For instance, it is possible to have overpowered studies that can obtain statistical significance although the two groups have essentially identical sample means (so there is technically a third option besides sampling error and reflecting true characteristics of populations). But I am satisfied with the lead paragraph as it is written above. I think we are all in pretty much agreement at this point, so I have gone ahead and been bold and deployed our new leading paragraph. Good work everyone. EmilKarlsson (talk) 22:32, 9 July 2014 (UTC)
I agree. Great work. danielkueh (talk) 14:52, 10 July 2014 (UTC)
• Redmond, Carol; Colton, Theodore (2001). "Clinical significance versus statistical significance". Biostatistics in Clinical Trials. Wiley Reference Series in Biostatistics (3rd ed.). West Sussex, United Kingdom: John Wiley & Sons Ltd. pp. 35–36. ISBN 0-471-82211-6.
• Babbie, Earl R. (2013). "The logic of sampling". The Practice of Social Research (13th ed.). Belmont, CA: Cengage Learning. pp. 185–226. ISBN 1-133-04979-6.
• Faherty, Vincent (2008). "Probability and statistical significance". Compassionate Statistics: Applied Quantitative Analysis for Social Services (With exercises and instructions in SPSS) (1st ed.). Thousand Oaks, CA: SAGE Publications, Inc. pp. 127–138. ISBN 1-412-93982-8.
• Sirkin, R. Mark (2005). "Two-sample t tests". Statistics for the Social Sciences (3rd ed.). Thousand Oaks, CA: SAGE Publications, Inc. pp. 271–316. ISBN 1-412-90546-X.
• Cumming, Geoff (2012). Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. New York, USA: Routledge. pp. 27–28.
• Krzywinski, Martin; Altman, Naomi (30 October 2013). "Points of significance: Significance, P values and t-tests". Nature Methods (Nature Publishing Group) 10 (11): 1041–1042. doi:10.1038/nmeth.2698. Retrieved 3 July 2014.
• Sham, Pak C.; Purcell, Shaun M (17 April 2014). "Statistical power and significance testing in large-scale genetic studies". Nature Reviews Genetics (Nature Publishing Group) 15 (5): 335–346. doi:10.1038/nrg3706. Retrieved 3 July 2014.
• Johnson, Valen E. (October 9 2013). "Revised standards for statistical evidence". Proceedings of the National Academy of Sciences (National Academies of Science). doi:10.1073/pnas.1313476110. Retrieved 3 July 2014.
• Altman, Douglas G. (1999). Practical Statistics for Medical Research. New York, USA: Chapman & Hall/CRC. p. 167. ISBN 978-0412276309.
• ^ a b Borror, Connie M. (2009). "Statistical decision making". The Certified Quality Engineer Handbook (3rd ed.). Milwaukee, WI: ASQ Quality Press. pp. 418–472. ISBN 0-873-89745-5.
• ^ Devore, Jay L. (2011). Probability and Statistics for Engineering and the Sciences (8th ed.). Boston, MA: Cengage Learning. pp. 300–344. ISBN 0-538-73352-7.