Statistical significance

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Statistical significance is used to refer to two separate notions:

A fixed number, most often 0.05, is referred to as a significance level or level of significance; such a number may be used either in the first sense, as a cutoff mark for p-values (each p-value is calculated from the data), or in the second sense as a desired parameter in the test design (α depends only on the test design, and is not calculated from observed data).

These two notions reflect distinct and incompatible approaches to statistics, and measure different quantities, which cannot be compared. However, they are often conflated, as in the first approach p is often compared to 0.05 (p \leq 0.05 is checked), and in the second approach α is often set to 0.05 (\alpha = 0.05), so combining these equations yields "p \leq \alpha", which is not a meaningful comparison. Due to this confusion, the notation α is sometimes used for a cutoff value of p even when the Neyman–Pearson approach is not being used. In this article, "statistical significance" is used in the sense of p-value (Fisher), and to avoid confusion, α will not be used. See statistical hypothesis testing for further discussion.

Contents

Overview [edit]

In the sense of Fisher (but not of Neyman–Pearson), statistical significance is a statistical assessment of whether observations reflect a pattern rather than just chance. When used in statistics, the word significant does not mean important or meaningful, as it does in everyday speech; with sufficient data, a statistically significant result may be very small in magnitude.

The fundamental challenge is that any partial picture of a given hypothesis, poll or question is subject to random error. In statistical testing, a result is deemed statistically significant if it is so extreme (without external variables which would influence the correlation results of the test) that such a result would be expected to arise simply by chance only in rare circumstances. Hence the result provides enough evidence to reject the hypothesis of 'no effect'.

For example, tossing 3 coins and obtaining 3 heads would not be considered an extreme result. However, tossing 10 coins and finding that all 10 land the same way up would be considered an extreme result: for fair coins the probability of having the first coin matched by all 9 others is \left ( \tfrac{1}{2} \right ) ^9 \approx 0.002 which is rare. The result may therefore be considered statistically significant evidence that the coins are not fair.

Researchers focusing solely on whether individual test results are significant or not may miss important response patterns which individually fall under the threshold set for tests of significance. Therefore along with tests of significance, it is preferable to examine effect-size statistics, which describe how large the effect is and the uncertainty around that estimate, so that the practical importance of the effect may be gauged by the reader.

The calculated statistical significance of a result is in principle only valid if the hypothesis was specified before any data were examined. If, instead, the hypothesis was specified after some of the data were examined, and specifically tuned to match the direction in which the early data appeared to point, the calculation would overestimate statistical significance.

An alternative (but nevertheless related) statistical hypothesis testing framework is the Neyman–Pearson frequentist school which requires both a null and an alternative hypothesis to be defined and investigates the repeat sampling properties of the procedure, i.e. the probability that a decision to reject the null hypothesis will be made when it is in fact true and should not have been rejected (this is called a "false positive" or Type I error) and the probability that a decision will be made to accept the null hypothesis when it is in fact false (Type II error). Fisherian p-values are philosophically different from Neyman–Pearson Type I errors. This confusion is unfortunately propagated by many statistics textbooks.[1]

History [edit]

The phrase test of significance was coined by Ronald Fisher.[2] The term significance, used in a statistical sense, dates back to 1885.[3]

Use in practice [edit]

Popular levels of significance are 10% (0.1), 5% (0.05), 1% (0.01), 0.5% (0.005), and 0.1% (0.001). If a test of significance gives a p-value lower than or equal to the significance level,[4] the null hypothesis is rejected at that level. Such results are informally referred to as 'statistically significant (at the p = 0.05 level, etc.)'. For example, if someone argues that "there's only one chance in a thousand this could have happened by coincidence", a 0.001 level of statistical significance is being stated. The lower the significance level chosen, the stronger the evidence required. The choice of significance level is somewhat arbitrary, but for many applications, a level of 5% is chosen by convention.[5][6]

In some situations it is convenient to express the complementary statistical significance (so 0.95 instead of 0.05), which corresponds to a quantile of the test statistic. In general, when interpreting a stated significance, one must be careful to note what, precisely, is being tested statistically.

Different levels of cutoff trade off countervailing effects. Lower levels – such as 0.01 instead of 0.05 – are stricter, and increase confidence in the determination of significance, but run an increased risk of failing to reject a false null hypothesis. Evaluation of a given p-value of data requires a degree of judgment, and rather than a strict cutoff, one may instead simply consider lower p-values as more significant.

Graphically, statistical significance is often indicated by the use of star symbols (*). The number of stars usually indicates the significance level: one star (*) for 0.05, two (**) for 0.01, and three (***) for 0.001 or 0.005. These star symbols may also be used on graphics, such as bar charts, to indicate a significant effect, such as a significant difference in the mean value between two populations (e.g. here).

In terms of σ (sigma) [edit]

In some fields, for example nuclear and particle physics, it is common to express statistical significance in units of the standard deviation σ of a normal distribution. A statistical significance of "n\sigma" can be converted into a p-value by use of the cumulative distribution function Φ of the standard normal distribution, through the relation:

\!p = 2 (1 - \Phi (n)), (this formula varies depending on whether a one-tailed or a two-tailed test is appropriate)

or via use of the error function:

p = 1 - \operatorname{erf}\left(n/\sqrt{2}\right) .

Tabulated values of these functions are often found in statistics text books: see standard normal table. The use of σ implicitly assumes a normal distribution of measurement values. For example, if a theory predicts a parameter to have a value of, say, 109 ± 3, and one measures the parameter to be 100, then one might report the measurement as a "3σ deviation" from the theoretical prediction. In terms of p-value, this statement is equivalent to saying that "assuming the theory is true, the likelihood of obtaining the experimental result by coincidence is 0.27%" (since 1 − erf(3/√2) = 0.0027) (again depending on whether a one-tailed test or two-tailed test is appropriate).

Fixed significance levels such as those mentioned above may be regarded as useful in exploratory data analyses. However, modern practice is to quote the p-value explicitly, where the outcome of a test is essentially the final outcome of an experiment or other study. And, importantly, it should be stated whether the p-value is judged to be significant. This allows the maximum information to be transferred from a summary of the study into meta-analyses.

Pitfalls and criticism [edit]

The scientific literature contains extensive discussion of the concept of statistical significance and in particular of its potential misuse and abuse.

Signal–noise ratio conceptualisation of significance [edit]

Statistical significance can be considered to be the confidence one has in a given result. In a comparison study, it is dependent on the relative difference between the groups compared, the amount of measurement and the noise associated with the measurement. In other words, the confidence one has in a given result being non-random (i.e. it is not a consequence of chance) depends on the signal-to-noise ratio (SNR) and the sample size.

Expressed mathematically, the confidence that a result is not by random chance is given by the following formula by Sackett:[7]

\mathrm{confidence} = \frac{\mathrm{signal}}{\mathrm{noise}} \times \sqrt{\mathrm{sample\ size}}.

For clarity, the above formula is presented in tabular form below.

Dependence of confidence with noise, signal and sample size (tabular form)

Parameter Parameter increases Parameter decreases
Noise Confidence decreases Confidence increases
Signal Confidence increases Confidence decreases
Sample size Confidence increases Confidence decreases

In words, the dependence of confidence is high if the noise is low and/or the sample size is large and/or the effect size (signal) is large. The confidence of a result (and its associated confidence interval) is not dependent on effect size alone. If the sample size is large and the noise is low a small effect size can be measured with great confidence. Whether a small effect size is considered important is dependent on the context of the events compared.

In medicine, small effect sizes (reflected by small increases of risk) are often considered clinically relevant and are frequently used to guide treatment decisions if there is great confidence in them. Whether a given treatment is considered a worthy endeavour is dependent on the risks, benefits and costs.[citation needed]

Does order of procedure affect statistical significance? [edit]

Order refers to which comes first: the test data or the specification of the hypotheses to be tested. When the hypotheses come first the test is "prospective" and when the data come first the test is "retrospective". Traditionally, prospective tests have been required.[8][9] However, there is a well-known generally accepted hypothesis test in which the data preceded the hypotheses.[10][dubious ] In that study the statistical significance was calculated the same as it would have been had the hypotheses preceded the data. A retrospective significance test can be used to separate promising and unpromising treatments, but a perspective test is required to justify scientific conclusions. "The reasoning behind statistical significance works well if you decide what effect you are seeking, design an experiment or sample to search for it, and use a test of significance to weigh the evidence that you get."[11] (p 465) "You cannot legitimately test a hypothesis on the same data that first suggested that hypothesis."[11] (p 466) A related question in use of statistics in the physical sciences is whether probability theory applies to the known past in the same way that it applies to the unknown future. Although these questions have been discussed,[12] there are few references in this area of statistics. It hardly seems reasonable to accord the same status to a hypothesis that explains the results of an experiment after the results are known as to a hypothesis that predicts the results of an experiment before they are known. This is because it is well known that predicting an event before it occurs is more difficult than explaining it after it occurs.

See also [edit]

References [edit]

  1. ^ Hubbard, Raymond; Bayarri, M.J. (November 2003), P Values are not Error Probabilities, a working paper that explains the difference between Fisher's evidential p-value and the Neyman–Pearson Type I error rate \alpha. 
  2. ^ "Critical tests of this kind may be called tests of significance, and when such tests are available we may discover whether a second sample is or is not significantly different from the first." — R. A. Fisher (1925). Statistical Methods for Research Workers, Edinburgh: Oliver and Boyd, 1925, p.43.
  3. ^ Higgs, M. D. (2013). "Do We Really Need the S-word?". American Scientist 101: 6–1. doi:10.1511/2013.100.6.  edit
  4. ^ Fisher RA (1926). "The arrangement of field experiments". Journal of the Ministry of Agriculture 33: 504. 
  5. ^ Stigler 2008.
  6. ^ Fisher 1925.
  7. ^ Sackett DL (October 2001). "Why randomized controlled trials fail but needn't: 2. Failure to employ physiological statistics, or the only formula a clinician-trialist is ever likely to need (or understand!)". CMAJ 165 (9): 1226–37. PMC 81587. PMID 11706914. 
  8. ^ Bacon, Francis (1952) [1620]. In Adler, Mortimer. Novum Organum. Great Books of the Western World 30. Encyclopedia Britannica. 
  9. ^ Boole, George (1958) [1854]. "22". The Laws of Thought. New York: Dover Publications Inc. p. 402. ISBN 0-486-60028-9. 
  10. ^ USEPA (December 1992). Respiratory Health Effects of Passive Smoking: Lung Cancer and other disorders. Washington D. C.: U. S. Environmental Protection Agency. Retrieved Aug. 8, 2012. 
  11. ^ a b Moore, David; McCabe, George P. (2003). Introduction to the practice of statistics. New York: W.H. Freeman and Co. ISBN 9780716796572. 
  12. ^ Root, D.H. (2003). "Bacon, Boole, the EPA and Scientific Standards". Risk Analysis 23 (4): 663–668. doi:10.1111/1539-6924.00345. 

Further reading [edit]

External links [edit]