# Talk:Statistical significance

WikiProject Statistics (Rated C-class, High-importance)

This article is within the scope of the WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page or join the discussion.

C  This article has been rated as C-Class on the quality scale.
High  This article has been rated as High-importance on the importance scale.
WikiProject Mathematics (Rated C-class, Mid-importance)
This article is within the scope of WikiProject Mathematics, a collaborative effort to improve the coverage of Mathematics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
Mathematics rating:
 C Class
 Mid Importance
Field: Probability and statistics
One of the 500 most frequently viewed mathematics articles.

## Comment on Dubious

The citation (USEPA December 1992) contains numerous statistical tests, some presented as p-values and some as confidence intervals. Figures 5-1 through 5-4 show some of the test statistics used in the citation. In two of the figures the statistics are for individual studies and can be assumed to be prospective. In the other two, the statistics are for pooled studies and can be assumed to be retrospective. Table 5-9 includes the results of the test of the hypothesis RR = 1 versus RR > 1 for individual studies and for pooled studies. In the cited report, no distinction between prospective tests and retrospective tests was made. This is a departure from the traditional scientific method, which makes a strict distinction between predictions of the future and explanations of the past. Gjsis (talk)

## Effect size

In the paragraph on effect size, I wonder whether the clause "(in cases where the effect being tested for is defined in terms of an effect size)" could be deleted: I am not aware of cases where an effect cannot be quantified. Strasburger (talk) 13:02, 1 March 2015 (UTC)

I would advocate complete deletion of that paragraph. It seems to depict an experimenter performing an non-objective analysis: Collect some data, measure its "significance", and, then, if not significant, collect some more data, measure significance again, etc. until something significant is found. This is not the way to do things, and it should not be discussed in this article as if there is a way around such snooping. Still, I know that it happens. I just don't want to see it advocated, even if unintentionally so. If I have misinterpreted the text, then I apologize, but that is my interpretation of what we have at the moment in the text. Isambard Kingdom (talk) 14:10, 7 June 2015 (UTC)
I would add, however, that it is always good practice to report: the number of data, the "effect size" (be Pearson r, or whatever is being assessed), and p-value. Indeed, I can't understand why these quantities would not be reported. Isambard Kingdom (talk) 14:17, 7 June 2015 (UTC)
While I would certainly agree that the paragraph on effect size could (and should) be improved, I believe you did misinterpret what it tries to say. You are absolutely right about the bad practice you describe above, but reporting effect size is actually counter to that practice. The paragraph is simply an advice to report some measure of effect size in the results section. It need not be whatever is assessed, btw, just some valid measure of the size of an effect. Cohen's d seems the most common from what I see. What a valid measure of effect size is, should be part of the corresponding Wikipedia entry; the p value, in any case, is not one (although it is often misinterpreted that way). That advice is now quite common and is meant to discourage relying on significance and the p value.
That said, the fact that you misread that paragraph implies that something is not clear there. I believe it's the parenthesis that is misleading, so I will go ahead and delete it. Also misleading is it to say "the effect size" because that can be misread as referring to the measure for which p was determined. So I will change that to "an effect size". Strasburger (talk) 15:49, 7 June 2015 (UTC)
Small "effects" can be significant, of course. Perhaps this is what you are getting at? Generally speaking, small effects normally require lots of data to be resolved. Do people really not report effect size? In my experience that is sometimes all they report! I think we agree, though, good practice is to report data number N, effect size r, and p-value. Of course, p-value is estimated conditional on both N and r. So, given a null statistical hypothesis and N actual data having effect size r, there is a probability p that a sample of N synthetic data from the null hypothesis would have an effect larger than r. Whether or not this has "research significance" (as phrased in the article) depends on the situation, I think. Isambard Kingdom (talk) 16:39, 7 June 2015 (UTC)

## Wrong definition of p value

After being away from these discussions for quite some time, I noticed that large-scale modifications have been made to the lead paragraph. As a contributing party to the previous versions, I want to maintain a high level of accuracy for statistical concepts that are very misunderstood and abused. The first thing I'd like to correct is the fact that the current leading paragraph defines p value incorrectly.

The p-value is the probability of observing an effect given that the null hypothesis is true whereas the significance or alpha (α) level is the probability of rejecting the null hypothesis given that it is true.

In reality, the p value is the probability of observing an effect, or more extreme effect, given that the null hypothesis is true (i.e. the shaded areas near the tails in a classic presentation). It isn't simply the probability of a specific effect given the null, because that probability is always infinitesimal (e. g. just a sliver in a normal distribution). In fact, this is explained explicitly in one of the Nature journal references (http://www.nature.com/nmeth/journal/v10/n11/full/nmeth.2698.html) for the lead sentence. Pay special attention to Figure 1c. The figure text reads:

The statistical significance of the observation x is the probability of sampling a value from the distribution that is at least as far from the reference, given by the shaded areas under the distribution curve.

This is further reinforced by another already existing reference, namely that to the Nature Reviews Genetics paper (http://www.nature.com/nrg/journal/v15/n5/full/nrg3706.html). Although not open access, the relevant section states:

The P value, which was introduced earlier by Fisher in the context of significance testing, is defined as the probability of obtaining — among the values of T generated when H0 is true — a value that is at least as extreme as that of the actual sample (denoted as t).

The PNAS reference about revised standards for statistical evidence also agrees (http://www.pnas.org/content/110/48/19313.full):

The P value from a classical test is the maximum probability of observing a test statistic as extreme, or more extreme, than the value that was actually observed, given that the null hypothesis is true.

Furthermore, I propose changing the word "effect" to "results", since "effect" is a bit equivocal since it can both mean "something that is of interest to measure" (i.e. "effect" part of the word "effect size") and the measurement itself (i.e. "results").

Based on these considerations, I propose changing the lead paragraph from:

In statistics, statistical significance (or a statistically significant result) is attained when a p-value is less than the significance level.[1][2][3][4][5][6][7] The p-value is the probability of observing an effect given that the null hypothesis is true whereas the significance or alpha (α) level is the probability of rejecting the null hypothesis given that it is true.[8] As a matter of good scientific practice, a significance level is chosen before data collection and is usually set to 0.05 (5%).[9] Other significance levels (e.g., 0.01) may be used, depending on the field of study.[10]

to this:

In statistics, statistical significance (or a statistically significant result) is attained when a p-value is less than the significance level.[1][2][3][4][5][6][7] The p-value is the probability of obtaining at least as extreme results given that the null hypothesis is true whereas the significance or alpha (α) level is the probability of rejecting the null hypothesis given that it is true.[3][4][5][8] As a matter of good scientific practice, a significance level is chosen before data collection and is usually set to 0.05 (5%).[9] Other significance levels (e.g., 0.01) may be used, depending on the field of study.[10]

I start this discussion rather than being bold when editing to firmly anchor this change among participants. That way we can make long-lasting and durable changes. EmilKarlsson (talk) 22:20, 2 June 2015 (UTC)

I have no objections to the proposed change, i.e. changing effect --> at least as extreme results. It's a bit wordy but oh well, not a big deal. danielkueh (talk) 22:28, 2 June 2015 (UTC)
I do have some questions. Suppose we analyzed the results of a simple between-groups experiment and observed a p-value of 0.5 (or 50%), which is clearly not significant. Would the results need to be extreme to obtain such a large p-value? Is the p-value, in this case, "just a sliver?" danielkueh (talk) 14:52, 3 June 2015 (UTC)
Let the population means of the two groups be m(1) and m(2) (these are unknown parameters) and the observed sample means we get be x(1) and x(2). The observed difference is then |x(1)-x(2)| (i.e. the distance between sample means). Assuming that the null hypothesis of m(1) = m(2) is true, a p value of 0.5 means that there is a 50% probability of obtaining a difference |x(1)-x(2)| that is equal to or larger than (not merely "equal to") what you actually observed. The phrase "more extreme results" here means "a larger deviation from the null hypothesis than observed" i.e. a larger value for |x(1)-x(2)|. The p value (correctly defined) is never a sliver of a distribution (but the wrong definition of p value currently used implies that it is), since it is always "results or more extreme results / at least as extreme results" (i.e. from the observed result all the way towards the tail of the distributions). Consider figure 1C in http://www.nature.com/nmeth/journal/v10/n11/full/nmeth.2698.html#f1. The p value is then the entire black + grey area under the curve (for a two tailed test), not just the infinitesimal sliver constituting the area indicated by the dotted line under "x" (i.e. the observed results). Hopefully this goes some way towards answering your question, but please ask follow-up questions if something I wrote sounds weird, strange or otherwise unclear. EmilKarlsson (talk) 20:56, 3 June 2015 (UTC)
I disagree that the second sentence on the definition of the p-value implies that the p-value just a "sliver." I agree that the first sentence does imply that because the p-value has to be a sliver for it to be significant (assuming a very low threshold). Anyway, I actually picked up the word, sliver, from what you wrote earlier that the "because that probability is always infinitesimal (e. g. just a sliver in a normal distribution)." Hence, my earlier questions. Thank you for explaining the phrase, "results or more extreme results / at least as extreme results". I still find that statement to be a bit wordy for my taste, but that is a very small matter. danielkueh (talk) 22:08, 3 June 2015 (UTC)
Now I understand our crux! When I say that it is wrong to define p value as just "probability of results given null" (instead of "probability of results or more extreme results given null") because it would falsely entail that such a probability (p value) would always just be a tiny sliver, I am speaking about the area under the graph of a distribution (compare the infinitesimal area precisely under x constituting the flawed definition with the dark + grey area under the graph corresponding to the correct definition in the above mentioned figure 1C). This is because a given result value (assuming that the variable is continuous and can take any value within a reasonable range) is just one possibility among a very large set of realistic possibilities. If our observed difference |x(1)-x(2)| happened to be 2.1, it could have been 1.8, 1.85. 1.8445456, 2.14 and so on. Thus, getting our precise result of 2.1 (or any specific value) given null would almost always be quite unlikely. So, given this flawed definition of a p value, all p values would be exceedingly small, which is then my reductio argument against what I see as the wrong definition of p value. Instead, p value should be though of as "probability of getting a result of 2.1 or more extreme (i.e. further away from null), given null". Does this clarify what I wrote before, or does it introduce more questions than answers? EmilKarlsson (talk) 18:32, 4 June 2015 (UTC)
@EmilKarlsson, I understand what you're saying about the area under the curve and how it captures a range of lower probabilities or p-values. However, I still don't see any fundamental contradiction between the definition that you proposed and the current definition in the second lead sentence. But suppose for the sake of this discussion that there is a contradiction or a distinction between the two definitions. If we need to be exact, then we all we have to add is the word "observed" or "calculated" before p-value as in "observed p-value (p_obs)" or "calculated (p-calc)," which is different from the critical p-value (p_crit). It seems to me that your proposed definition speaks more to p_crit than to p_obs. At the end of the day, to determine if an experimental result is significant, all we need to know is whether p_obs < p_crit and not whether our observed p-value captures a range of lower values. danielkueh (talk) 20:36, 4 June 2015 (UTC)
@EmilKarlsson, FYI, I am not opposed to the proposed changes to the second sentence. I am just having fun discussing this topic here. So feel free to move on.
I still do not think we are completely on the same wavelength. What you have wrote makes sense if the x-axis in the histogram is the p-value, but it is rather the observed difference. It boils down to this: the probability of getting a specific result (e. g. 5.42 difference) given null is very, very tiny (and thus the flawed definition of p value implies that all p-values are essentially 0) because you could have gotten whatever observed difference since it is often a continuous variable. 5.42 is just a tiny, tiny subset of all possibilities. This is why the correct p value definition has to include the bit about "more extreme results" (i.e. an observed difference of 5.42 or more extreme away from null) to even make sense or be useful. EmilKarlsson (talk) 11:24, 6 June 2015 (UTC)
@EmilKarlsson, Again, I don't dispute proposed definition that includes the qualifier "or more extreme results." It is correct and fairly standard. If you feel that it should replace the present definition because it introduces more possibilities (smaller p-values) that correspond with more extreme results, then by all means do so. But to assert that the present definition is wrong (heading of this discussion) because it contradicts the proposed definition is a little over the top. All the present definition is saying is if we observed an effect (mean difference), we get this specific p-value. And if this p-value is less than alpha, it is significant. That is all. It really does not matter if there are "additional sets of possibilities" that correspond with "more extreme results." I suppose in the days before SPSS, SAS, or R, when people had to rely on t- or F-distribution tables to identify regions of p-values (one-tail or two-tail) that are smaller than alpha, it makes sense to remind them of "more extreme results" because that would help them understand these lookup tables. And the best they could hope to do when reporting p-values was to specify a range such as 0.01 < p < 0.05 or just simply p < 0.05. But in this day and age, when p-values can be calculated to the nth decimal, the only question that we need to know is whether this p-value (singular not plural) is smaller or greater than alpha. It's redundant, and often pointless, to ask if our observed p-value would include p-values that are smaller and correspond with larger mean differences. Of course they would, why wouldn't they? But guess what? Unless we actually do another set of experiments that produces greater effects, the only p-value we have to report is the one that was actually calculated. Again, none of these issues of pragmatics contradicts the proposed statement with the qualifier, "or more extreme results." But like I said, if you want to change "an effect" to "at least as extreme results" because it would add "other subset of possibilities," then by all means do so. Pragmatics aside, I concede it is conceptually correct. So knock yourself out. :) danielkueh (talk) 14:36, 6 June 2015 (UTC)
I agree with these changes. In the second sentence I would further change
"whereas the significance or alpha (α) level is the probability of rejecting the null hypothesis given that it is true"
to
"whereas the significance or alpha (α) level is the probability value at which the null hypothesis is rejected given that it is true". Strasburger (talk) 23:01, 3 June 2015 (UTC)
The current description of alpha does have problems, but I think the status of alpha as the type-I error rate does not quite shine through in your suggestion. Perhaps we can rework the entire first paragraph in addition to the p value sentence above? What about something like this:
In statistics, statistical significance (or a statistically significant result) is attained when the p-value (p) is smaller than the significance level or alpha (α). The p value is defined as the probability of getting at least as extreme results given that the null hypothesis is true and this value is determined by the observed data and sample size. [1][2][3][4][5][6][7] In contrast, alpha is defined as the probability of rejecting the null hypothesis given that it is true and is set by the researcher as the type-I error rate of the statistical test being used (and thus limits how often researchers incorrectly reject true null hypotheses). [3][4][5][8] Other significance levels (e.g., 0.01) may be used, depending on the field of study.[10]
Is there anything that is unclear, weird or equivocal with this suggestion? EmilKarlsson (talk) 18:32, 4 June 2015 (UTC)
Very clear and concise. Strasburger (talk) 21:17, 4 June 2015 (UTC)
@EmilKarlsson, For starters, delete needless words such as "defined as" and just say what it is. For example, change "The p value is defined as the probability to "The p value is the probability." Change "smaller than" back to "less than" as that is how the symbol "<" is often read. I recommend either splitting the third sentence on the alpha level into multiple sentences, or omit some information. Too much info. It's the lead, not the main text. The last sentence does not follow from the previous sentence. Finally, I recommend making these changes *global,* i.e., be sure that the main text says the same thing as the lead. After all, the lead is supposed to be a summary (details in WP:lead). In fact, I recommend starting with the main text and working your way back to the lead. That way, no topic or issue is given undue weight. danielkueh (talk) 21:31, 4 June 2015 (UTC)
Good point. In fact, I think the entire article deserves to be re-written because of the seriousness and importance of this topic. What kind of sections do you think would be worthwhile to include? History, role in statistical significance testing (perhaps including table of important concepts related to statistical significance tests like alpha, beta, sample size etc.), what can be inferred from a statistically significant result, strengths and drawbacks, misunderstandings, alternatives to statistical significance? Is there any other key issue that stands out and deserves a place in this article? EmilKarlsson (talk) 11:24, 6 June 2015 (UTC)
@EmilKarlsson, it is entirely up to you. Right now, you seem to be focused on p-values, alphas, etc. So if you want, you can start with the "Role in statistical hypothesis testing." I agree this article could always be improved. If you think the present article is bad, you should have seen previous versions (e.g., [[1]]) By the way, I appreciate you taking the lead on this. Have fun editing. I am not as active as I used to be on Wikipedia, but if there's anything I can do to help, feel free to post your requests here or on my talk page. I am sure there are other editors would be interested as well. danielkueh (talk) 14:36, 6 June 2015 (UTC)
@EmilKarlsson, FYI, if you intend on taking this article to FA status (WP:FA), you should check Wikipedia's The Core Contest (WP:TCC). I believe this year's contest is just over but given the amount of time and effort it goes to building an article to FA status, you could try for next year's. It's just a bonus. danielkueh (talk) 15:03, 6 June 2015 (UTC)

Maybe it is just me, but it sometimes seems to me that there is too much focus on alpha and making an assessment, black and white, as to whether or not something is "significant". I can imagine that an experimenter might choose, beforehand, an alpha threshold (and that is the word I would use), but if a p-value is just slightly larger than the chosen alpha, the results still might be worthy of reporting. In my own work I have the flexibility to report p-values as they are, I put them in papers, and I let the reader judge them for what they are. So, in light of this, I would advocate some accommodation or discussion or something in this article of the often arbitrary thinking about the alpha should be and, indeed, whether or not there should even be an alpha. Isambard Kingdom (talk) 14:01, 7 June 2015 (UTC) Oh, and I now see that I've already been commenting on this talk page! I had actually forgotten. Forgive me for possible redundancy. Isambard Kingdom (talk) 14:29, 7 June 2015 (UTC)

@Isambard Kingdom, interesting. I'm assuming your work is in physics or similar? I know in the life sciences and in the social sciences, the alpha level is enforced strictly enforced. In fact, many folks get suspicious if the p-value is too close to the alpha level (e.g., 0.047). danielkueh (talk) 15:53, 7 June 2015 (UTC)
Hmm. I am in the life sciences/social sciences and I do it exactly like Isambard. Strasburger (talk) 16:29, 7 June 2015 (UTC)
@Strasburger, by "do it exactly like Isambard," you mean report p-values that are not significant? Sure, you can do that. Doesn't make it anymore statistically significant. danielkueh (talk) 16:39, 7 June 2015 (UTC)
It's good practice to just mention the achieved p-value, and discuss its 'significance'. A significance level is needed if a test is performed in establishing a crtitical area. Nijdam (talk) 18:37, 10 June 2015 (UTC)
If the p value is slightly above the 5% mark, one way of saying it is that "the result just missed significance (p=xxx%)." The result might be worth reporting, in particular if n is small. Reading Friston's paper cited in the main article is instructive for taking the focus a little away from the alpha level. Strasburger (talk) 19:38, 10 June 2015 (UTC)
@Nijdam, I guess it really depends on the research question. If the non-significant result is interesting (e.g., drug A does not work), then yes, we should discuss the practical or theoretical significance of the statistically non-significant result. danielkueh (talk) 20:01, 10 June 2015 (UTC)

### references mentioned in discussion

1. ^ a b c Redmond, Carol; Colton, Theodore (2001). "Clinical significance versus statistical significance". Biostatistics in Clinical Trials. Wiley Reference Series in Biostatistics (3rd ed.). West Sussex, United Kingdom: John Wiley & Sons Ltd. pp. 35–36. ISBN 0-471-82211-6.
2. ^ a b c Cumming, Geoff (2012). Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. New York, USA: Routledge. pp. 27–28.
3. Krzywinski, Martin; Altman, Naomi (30 October 2013). "Points of significance: Significance, P values and t-tests". Nature Methods (Nature Publishing Group) 10 (11): 1041–1042. doi:10.1038/nmeth.2698. Retrieved 3 July 2014.
4. Sham, Pak C.; Purcell, Shaun M (17 April 2014). "Statistical power and significance testing in large-scale genetic studies". Nature Reviews Genetics (Nature Publishing Group) 15 (5): 335–346. doi:10.1038/nrg3706. Retrieved 3 July 2014.
5. Johnson, Valen E. (October 9, 2013). "Revised standards for statistical evidence". Proceedings of the National Academy of Sciences (National Academies of Science). doi:10.1073/pnas.1313476110. Retrieved 3 July 2014.
6. ^ a b c Altman, Douglas G. (1999). Practical Statistics for Medical Research. New York, USA: Chapman & Hall/CRC. p. 167. ISBN 978-0412276309.
7. ^ a b c Devore, Jay L. (2011). Probability and Statistics for Engineering and the Sciences (8th ed.). Boston, MA: Cengage Learning. pp. 300–344. ISBN 0-538-73352-7.
8. ^ a b c Schlotzhauer, Sandra (2007). Elementary Statistics Using JMP (SAS Press) (PAP/CDR ed.). Cary, NC: SAS Institute. pp. 166–169. ISBN 1-599-94375-1.
9. ^ a b Craparo, Robert M. (2007). "Significance level". In Salkind, Neil J. Encyclopedia of Measurement and Statistics 3. Thousand Oaks, CA: SAGE Publications. pp. 889–891. ISBN 1-412-91611-9.
10. ^ a b c Sproull, Natalie L. (2002). "Hypothesis testing". Handbook of Research Methods: A Guide for Practitioners and Students in the Social Science (2nd ed.). Lanham, MD: Scarecrow Press, Inc. pp. 49–64. ISBN 0-810-84486-9.

## Introduction

I suggest the following introduction:

In statistics, statistical significance (or a statistically significant result) is attained when, simplified said, the result is rather extreme, unexpected, assuming the null hypothesis to be true. As a measure of how extreme the result is, either the p-value of the result should be sufficiently small, i.e. less than a given value, the significance level, or the value of the test statisic is extreme, i.e. in the critical area.

The p-value is the probability of observing an effect as extreme or more extreme than the actual result, given that the null hypothesis is true, whereas the significance or alpha (α) level is the probability of rejecting the null hypothesis given that it is true.

Nijdam (talk) 18:33, 10 June

• Oppose. Aside from the clumsy wording, it's not an improvement over the present version. I suggest giving the above discussion, led by EmilKarlsson, some time to come to a consensus on how best to proceed. In the meantime, we could just change "an effect" to "at least as extreme results," as suggested by EmilKarlsson above. danielkueh (talk) 18:48, 10 June 2015 (UTC)
Then improve the wording, but the text as it stands in the intro is too theoretical. Anyone may understand that a result is significant, if it is very unlikely in the light of the null hypothesis. The main point is not the p-value itself, but it being a measure for the extremeness of the found result. Nijdam (talk) 22:33, 10 June 2015 (UTC)
@Nijdam, I've changed "observing an effect" to "obtaining at least as extreme results," as discussed above. danielkueh (talk) 22:43, 10 June 2015 (UTC)
Explaining 'significance' is best done without direct reference to the p-value (of the observed result). As i said above, it is not difficult to understand that a result is significant - meaning pointing towards the rejection of the (null) hypothesis - when, assuming the null hypothesis to be true, an unlikely event has occurred. Compare this to the Proof by contradiction. What unlikely means may then be explained by the p-value. Nijdam (talk) 09:45, 11 June 2015 (UTC)
We go with the sources, which overwhelming define significance in terms of p-values. See cited references in the lead and the extensive discussion in the archives. If you would like to expound further on the process of establishing statistical significance, the best place to do so is in the main body of this article and not the lead, which is supposed to be a summary. You should join the discussion above with EmilKarlsson, who intends to revised the entire article. danielkueh (talk) 11:47, 11 June 2015 (UTC)

I've amended the intro and made explicit ref to the standard fallacies of p-value interpretation, plus a reference. Contributors to this section should note that, as Goodman makes clear in the cited article, many textbooks cannot be relied upon concerning the definition of p-values. Robma (talk) 11:43, 7 August 2015 (UTC)

I removed the newly added paragraph because it appears to try to settle a controversial issue (frequentist vs Bayesian approach) that is far from being resoled. I am not opposed to adding a new section that compares and contrasts the frequentist and Bayesian approaches. However, I think a better place for that sort of comparison is the p-value article, which already covers it. danielkueh (talk) 16:01, 7 August 2015 (UTC)
I think we may be "editing at cross purposes" here. The mods I made were to fix a misconception which transcends the debate between frequentism/Bayesianism. The p-value simply does not mean what the introduction (and, regrettably, Sirkin's text) states, under either paradigm. The frequentist/Bayes debate centres on a different issue: whether metrics like p-values (even correctly defined) are a meaningful inferential tool - which as you rightly say, is best addressed elsewhere. Of course, feel free to revert your edit if you see my point! Cheers Robma (talk) 09:47, 8 August 2015 (UTC)
, While there are, often, misconceptions, and I sometimes get confused myself (!), but in what way does the p-value not mean what is stated in the introduction? Isambard Kingdom ([[User talk:|talk]]) 13:55, 8 August 2015 (UTC)
I fail to see how your recent edits "transcend" the frequentist/Bayedian approaches. If they are supposed to transcend that issue, then you're clearly using the wrong source because the whole point of the Goodman paper was to address frequentist/Bayesian issue. I am interested to see your reply to Isambard's question above. danielkueh (talk) 15:17, 8 August 2015 (UTC)

Gah! Thanks for picking that up; I cited the wrong Goodman paper; here's the right one [2]. This makes clear all the points I'm failing to make - incl the one about the correct interpretation of p-values transcending the Freq/Bayes debate. As Goodman explicitly states in the abstract, there's the p-value and its Bayesian counterpart in inferential issues. One is the tail area giving Pr(at least as extreme results, assuming Ho)), the other is a posterior probability. As such, they're not different interpretations of the same metric, but literally different quantities calculated in different ways. My edit was simply trying to correct the notion that P-values are the probability that Ho has been ruled out...which isn't true, regardless of one's stance on Frequentism/Bayes. Robma (talk) 18:59, 8 August 2015 (UTC)

Okay, but do you agree that the statement given in the intro, which is *conditional* on the null hypothesis being true, is an accurate statement of the meaning of the p-value? Again, I agree that significance is sometimes not understood, but I think the intro is technically correct, if not exactly poetic. Isambard Kingdom (talk) 19:13, 8 August 2015 (UTC)
Yes, indeed that is correct. What I deleted is the following: "But if the p-value is less than the significance level (e.g., p < 0.05), then an investigator may conclude that the observed effect actually reflects the characteristics of the population rather than just sampling error". This indeed is what is done, but it's an incorrect inference from the p-value (as Fisher himself tried to make clear....). Robma (talk) 08:39, 9 August 2015 (UTC)
Thanks for clarifying the reference. However, I am still not sure why you made the first edit [3], which doesn't seem to be related to your second edit [4]. If you can explain that further, that would be helpful. Thanks. danielkueh (talk) 21:26, 8 August 2015 (UTC)
Because it removes both the incorrect interpretation of the p-value, supported by an unreliable reference, while retaining the frequentist argument in the second part that a p-value < 0.05 can be deemed statistically significant, which of course is correct. Robma (talk) 22:11, 8 August 2015 (UTC)
But this article is not just about p-values. It is about statistical significance. Significant p-values are typically linked to calculated statistics such as t- or F-values, which are calculated based on the ratio of "effect and error (numerator)" and "error alone (denominator)." That's what the interpretation is based on. danielkueh (talk) 22:20, 8 August 2015 (UTC)