Jump to content


From Wikipedia, the free encyclopedia

A statistical hypothesis test is a method of making decisions using data, whether from a controlled experiment or an observational study (not controlled). In statistics, a result is called statistically significant if it is unlikely to have occurred by chance alone, according to a pre-determined threshold probability, the significance level. The phrase "test of significance" was coined by Ronald Fisher: "Critical tests of this kind may be called tests of significance, and when such tests are available we may discover whether a second sample is or is not significantly different from the first."[1]

Hypothesis testing is sometimes called confirmatory data analysis, in contrast to exploratory data analysis. In frequency probability, these decisions are almost always made using null-hypothesis tests. These are tests that answer the question Assuming that the null hypothesis is true, what is the probability of observing a value for the test statistic that is at least as extreme as the value that was actually observed?)[2] More formally, they represent answers to the question, posed before undertaking an experiment, of what outcomes of the experiment would lead to rejection of the null hypothesis for a pre-specified probability of an incorrect rejection. One use of hypothesis testing is deciding whether experimental results contain enough information to cast doubt on conventional wisdom.

Statistical hypothesis testing is a key technique of frequentist statistical inference. The Bayesian approach to hypothesis testing is to base rejection of the hypothesis on the posterior probability.[3][4] Other approaches to reaching a decision based on data are available via decision theory and optimal decisions.

The critical region of a hypothesis test is the set of all outcomes which cause the null hypothesis to be rejected in favor of the alternative hypothesis. The critical region is usually denoted by the letter C.



Example 1 – Philosopher's beans


The following example was produced by a philosopher describing scientific methods generations before hypothesis testing was formalized and popularized.[5]

Few beans of this handful are white.
Most beans in this bag are white.
Therefore: Probably, these beans were taken from another bag.
This is an hypothetical inference.

The beans in the bag are the population. The handful are the sample. The null hypothesis is that the sample originated from the population. The criterion for rejecting the null-hypothesis is the "obvious" difference in appearance (an informal difference in the mean). The interesting result is that consideration of a real population and a real sample produced an imaginary bag. The philosopher was considering logic rather than probability. To be a real statistical hypothesis test, this example requires the formalities of a probability calculation and a comparison of that probability to a standard.

A simple generalization of the example considers a mixed bag of beans and a handful that contain either very few or very many white beans. The generalization considers both extremes. It requires more calculations and more comparisons to arrive at a formal answer, but the core philosophy is unchanged; If the composition of the handful is greatly different that of the bag, then the sample probably originated from another bag. The original example is termed a one-sided or a one-tailed test while the generalization is termed a two-sided or two-tailed test.

Example 2 – Clairvoyant card game


A person (the subject) is tested for clairvoyance. He is shown the reverse of a randomly chosen playing card 25 times and asked which of the four suits it belongs to. The number of hits, or correct answers, is called X.

As we try to find evidence of his clairvoyance, for the time being the null hypothesis is that the person is not clairvoyant. The alternative is, of course: the person is (more or less) clairvoyant.

If the null hypothesis is valid, the only thing the test person can do is guess. For every card, the probability (relative frequency) of any single suit appearing is 1/4. If the alternative is valid, the test subject will predict the suit correctly with probability greater than 1/4. We will call the probability of guessing correctly p. The hypotheses, then, are:

  • null hypothesis     (just guessing)


  • alternative hypothesis    (true clairvoyant).

When the test subject correctly predicts all 25 cards, we will consider him clairvoyant, and reject the null hypothesis. Thus also with 24 or 23 hits. With only 5 or 6 hits, on the other hand, there is no cause to consider him so. But what about 12 hits, or 17 hits? What is the critical number, c, of hits, at which point we consider the subject to be clairvoyant? How do we determine the critical value c? It is obvious that with the choice c=25 (i.e. we only accept clairvoyance when all cards are predicted correctly) we're more critical than with c=10. In the first case almost no test subjects will be recognized to be clairvoyant, in the second case, a certain number will pass the test. In practice, one decides how critical one will be. That is, one decides how often one accepts an error of the first kind – a false positive, or Type I error. With c = 25 the probability of such an error is:

and hence, very small. The probability of a false positive is the probability of randomly guessing correctly all 25 times.

Being less critical, with c=10, gives:

Thus, c = 10 yields a much greater probability of false positive.

Before the test is actually performed, the maximum acceptable probability of a Type I error (α) is determined. Typically, values in the range of 1% to 5% are selected. (If the maximum acceptable error rate is zero, an infinite number of correct guesses is required.) Depending on this Type 1 error rate, the critical value c is calculated. For example, if we select an error rate of 1%, c is calculated thus:

From all the numbers c, with this property, we choose the smallest, in order to minimize the probability of a Type II error, a false negative. For the above example, we select: .

Example 3 – Radioactive suitcase


As an example, consider determining whether a suitcase contains some radioactive material. Placed under a Geiger counter, it produces 10 counts per minute. The null hypothesis is that no radioactive material is in the suitcase and that all measured counts are due to ambient radioactivity typical of the surrounding air and harmless objects. We can then calculate how likely it is that we would observe 10 counts per minute if the null hypothesis were true. If the null hypothesis predicts (say) on average 9 counts per minute and a standard deviation of 1 count per minute, then we say that the suitcase is compatible with the null hypothesis (this does not guarantee that there is no radioactive material, just that we don't have enough evidence to suggest there is). On the other hand, if the null hypothesis predicts 3 counts per minute and a standard deviation of 1 count per minute, then the suitcase is not compatible with the null hypothesis, and there are likely other factors responsible to produce the measurements.

The test does not directly assert the presence of radioactive material. A successful test asserts that the claim of no radioactive material present is unlikely given the reading (and therefore ...). The double negative (disproving the null hypothesis) of the method is confusing, but using a counter-example to disprove is standard mathematical practice. The attraction of the method is its practicality. We know (from experience) the expected range of counts with only ambient radioactivity present, so we can say that a measurement is unusually large. Statistics just formalizes the intuitive by using numbers instead of adjectives. We probably do not know the characteristics of the radioactive suitcases; We just assume that they produce larger readings.

To slightly formalize intuition: Radioactivity is suspected if the Geiger-count with the suitcase is among or exceeds the greatest (5% or 1%) of the Geiger-counts made with ambient radiation alone. This makes no assumptions about the distribution of counts. Many ambient radiation observations are required to obtain good probability estimates for rare events.

The test described here is more fully the null-hypothesis statistical significance test. The null hypothesis represents what we would believe by default, before seeing any evidence. Statistical significance is a possible finding of the test, declared when the observed sample is unlikely to have occurred by chance if the null hypothesis were true. The name of the test describes its formulation and its possible outcome. One characteristic of the test is its crisp decision: to reject or not reject the null hypothesis. A calculated value is compared to a threshold, which is determined from the tolerable risk of error.

Example 4 – Lady tasting tea


The following example is summarized from Fisher, and is known as the Lady tasting tea example.[6] Fisher thoroughly explained his method in a proposed experiment to test a Lady's claimed ability to determine the means of tea preparation by taste. The article is less than 10 pages in length and is notable for its simplicity and completeness regarding terminology, calculations and design of the experiment. The example is loosely based on an event in Fisher's life. The Lady proved him wrong.[7]

  • The experiment provided the Lady with 8 randomly ordered cups of tea - 4 prepared by first adding milk, 4 prepared by first adding the tea. She was to select the 4 cups prepared by one method.
    • This offered the Lady the advantage of judging cups by comparison.
    • The Lady was fully informed of the experimental method.
  • The null hypothesis was that the Lady had no such ability.
  • The test statistic was a simple count of the number of successes in selecting the 4 cups.
  • The null hypothesis distribution was computed by the number of permutations. The number of selected permutations equaled the number of unselected permutations.
Tea-Tasting Distribution
Success count Permutations of selection Number of permutations
0 oooo 1 × 1 = 1
1 ooox, ooxo, oxoo, xooo 4 × 4 = 16
2 ooxx, oxox, oxxo, xoxo, xxoo, xoox 6 × 6 = 36
3 oxxx, xoxx, xxox, xxxo 4 × 4 = 16
4 xxxx 1 × 1 = 1
Total 70
  • The critical region was the single case of 4 successes of 4 possible based on a conventional probability criterion (< 5%; 1 of 70 ≈ 1.4%).
  • Fisher asserted that no alternative hypothesis was (ever) required.

If and only if the Lady properly categorized all 8 cups was Fisher willing to reject the null hypothesis – effectively acknowledging the Lady's ability at a 1.4% significance level (but without quantifying her ability). Fisher later discussed the benefits of more trials and repeated tests.

The testing process


In the statistical literature, statistical hypothesis testing plays a fundamental role.[8][citation needed] The usual line of reasoning is as follows:

  1. There is an initial research hypothesis of which the truth is unknown.
  2. The first step is to state the relevant null and alternative hypotheses. This is important as mis-stating the hypotheses will muddy the rest of the process. Specifically, the null hypothesis allows to attach an attribute: it should be chosen in such a way that it allows us to conclude whether the alternative hypothesis can either be accepted or stays undecided as it was before the test.[9]
  3. The second step is to consider the statistical assumptions being made about the sample in doing the test; for example, assumptions about the statistical independence or about the form of the distributions of the observations. This is equally important as invalid assumptions will mean that the results of the test are invalid.
  4. Decide which test is appropriate, and state the relevant test statistic T.
  5. Derive the distribution of the test statistic under the null hypothesis from the assumptions. In standard cases this will be a well-known result. For example the test statistic may follow a Student's t distribution or a normal distribution.
  6. Select a significance level (α), a probability threshold below which the null hypothesis will be rejected. Common values are 5% and 1%.
  7. The distribution of the test statistic under the null hypothesis partitions the possible values of T into those for which the null-hypothesis is rejected, the so called critical region, and those for which it is not. The probability of the critical region is α.
  8. Compute from the observations the observed value tobs of the test statistic T.
  9. Decide to either fail to reject the null hypothesis or reject it in favor of the alternative. The decision rule is to reject the null hypothesis H0 if the observed value tobs is in the critical region, and to accept or "fail to reject" the hypothesis otherwise.

An alternative process is commonly used:

  1. Compute from the observations the observed value tobs of the test statistic T.
  2. From the statistic calculate a probability of the observation under the null hypothesis (the p-value).
  3. Reject the null hypothesis or not. The decision rule is to reject the null hypothesis if and only if the p-value is less than the significance level (the selected probability) threshold.

The two processes are equivalent.[10] The former process was advantageous in the past when only tables of test statistics at common probability thresholds were available. It allowed a decision to be made without the calculation of a probability. It was adequate for classwork and for operational use, but it was deficient for reporting results.

The latter process relied on extensive tables or on computational support not always available. The explicit calculation of a probability is useful for reporting. The calculations are now trivially performed with appropriate software.

The difference in the two processes applied to the Radioactive suitcase example:

  • "The Geiger-counter reading is 10. The limit is 9. Check the suitcase."
  • "The Geiger-counter reading is high; 97% of safe suitcases have lower readings. The limit is 95%. Check the suitcase."

The former report is adequate, the latter gives a more detailed explanation of the data and the reason why the suitcase is being checked.

It is important to note the philosophical difference between accepting the null hypothesis and simply failing to reject it. The "fail to reject" terminology highlights the fact that the null hypothesis is assumed to be true from the start of the test; if there is a lack of evidence against it, it simply continues to be assumed true. The phrase "accept the null hypothesis" may suggest it has been proved simply because it has not been disproved, a logical fallacy known as the argument from ignorance. Unless a test with particularly high power is used, the idea of "accepting" the null hypothesis may be dangerous. Nonetheless the terminology is prevalent throughout statistics, where its meaning is well understood.

Alternatively, if the testing procedure forces us to reject the null hypothesis (H0), we can accept the alternative hypothesis (H1) and we conclude that the research hypothesis is supported by the data. This fact expresses that our procedure is based on probabilistic considerations in the sense we accept that using another set of data could lead us to a different conclusion.

The processes described here are perfectly adequate for computation. They seriously neglect the design of experiments considerations.[11][12]

It is particularly critical that appropriate sample sizes be estimated before conducting the experiment.

Definition of terms


The following definitions are mainly based on the exposition in the book by Lehmann and Romano:[8]

Statistical hypothesis
A statement about the parameters describing a population (not a sample).
A value calculated from a sample, often to summarize the sample for comparison purposes.
Simple hypothesis
Any hypothesis which specifies the population distribution completely.
Composite hypothesis
Any hypothesis which does not specify the population distribution completely.
Null hypothesis (H0)
A simple hypothesis associated with a contradiction to a theory one would like to prove.
Alternative hypothesis (H1)
A hypothesis (often composite) associated with a theory one would like to prove.
Statistical test
A procedure whose inputs are samples and whose result is a hypothesis.
Region of acceptance
The set of values of the test statistic for which we fail to reject the null hypothesis.
Region of rejection / Critical region
The set of values of the test statistic for which the null hypothesis is rejected.
Critical value
The threshold value delimiting the regions of acceptance and rejection for the test statistic.
Power of a test (1 − β)
The test's probability of correctly rejecting the null hypothesis. The complement of the false negative rate, β. Power is termed sensitivity in biostatistics. ("This is a sensitive test. Because the result is negative, we can confidently say that the patient does not have the condition.") See sensitivity and specificity and Type I and type II errors for exhaustive definitions.
Size / Significance level of a test (α)
For simple hypotheses, this is the test's probability of incorrectly rejecting the null hypothesis. The false positive rate. For composite hypotheses this is the upper bound of the probability of rejecting the null hypothesis over all cases covered by the null hypothesis. The complement of the false positive rate, (1 − α), is termed specificity in biostatistics. ("This is a specific test. Because the result is positive, we can confidently say that the patient has the condition.") See sensitivity and specificity and Type I and type II errors for exhaustive definitions.
The probability, assuming the null hypothesis is true, of observing a result at least as extreme as the test statistic.
Statistical significance test
A predecessor to the statistical hypothesis test (see the Origins section). An experimental result was said to be statistically significant if a sample was sufficiently inconsistent with the (null) hypothesis. This was variously considered common sense, a pragmatic heuristic for identifying meaningful experimental results, a convention establishing a threshold of statistical evidence or a method for drawing conclusions from data. The statistical hypothesis test added mathematical rigor and philosophical consistency to the concept by making the alternative hypothesis explicit. The term is loosely used to describe the modern version which is now part of statistical hypothesis testing.
Conservative test
A test is conservative if, when constructed for a given nominal significance level, the true probability of incorrectly rejecting the null hypothesis is never greater than the nominal level.
Exact test
a test in which the significance level or critical value can be computed exactly and without any approximation. In some contexts this term is restricted to tests applied to categorical data and to permutation tests, in which computations are carried out by complete enumeration of all possible outcomes and their probabilities.

A statistical hypothesis test compares a test statistic (z or t for examples) to a threshold. The test statistic (the formula found in the table below) is based on optimality. For a fixed level of Type I error rate, use of these statistics minimizes Type II error rates (equivalent to maximizing power). The following terms describe tests in terms of such optimality:

Most powerful test
For a given size or significance level, the test with the greatest power (probability of rejection) for a given value of the parameter(s) being tested, contained in the alternative hypothesis.
Uniformly most powerful test (UMP)
A test with the greatest power for all values of the parameter(s) being tested, contained in the alternative hypothesis.



If the p-value is less than the required significance level (equivalently, if the observed test statistic is in the critical region), then we say the null hypothesis is rejected at the given level of significance. Rejection of the null hypothesis is a conclusion. This is like a "guilty" verdict in a criminal trial - the evidence is sufficient to reject innocence, thus proving guilt. We might accept the alternative hypothesis (and the research hypothesis).

If the p-value is not less than the required significance level (equivalently, if the observed test statistic is outside the critical region), then the test has no result. The evidence is insufficient to support a conclusion. (This is like a jury that fails to reach a verdict.) The researcher typically gives extra consideration to those cases where the p-value is close to the significance level.

In the Lady tasting tea example, Fisher required the Lady to properly categorize all of the cups of tea to justify the conclusion that the result was unlikely to result from chance. He defined the critical region as that case alone. The region was defined by a probability (that the null hypothesis was correct) of less than 5%.

Whether rejection of the null hypothesis truly justifies acceptance of the research hypothesis depends on the structure of the hypotheses. Rejecting the hypothesis that a large paw print originated from a bear does not immediately prove the existence of "Bigfoot". Hypothesis testing emphasizes the rejection which is based on a probability rather that the acceptance which requires extra steps of logic.

Common test statistics




One-sample tests are appropriate when a sample is being compared to the population from a hypothesis. The population characteristics are known from theory or are calculated from the population.

Two-sample tests are appropriate for comparing two samples, typically experimental and control samples from a scientifically controlled experiment.

Paired tests are appropriate for comparing two samples where it is impossible to control important variables. Rather than comparing two sets, members are paired between samples so the difference between the members becomes the sample. Typically the mean of the differences is then compared to zero.

Z-tests are appropriate for comparing means under stringent conditions regarding normality and a known standard deviation.

T-tests are appropriate for comparing means under relaxed conditions (less is assumed).

Tests of proportions are analogous to tests of means (the 50% proportion).

Chi-squared tests use the same calculations and the same probability distribution for different applications:

  • Chi-squared tests for variance are used to determine whether a normal population has a specified variance. The null hypothesis is that it does.
  • Chi-squared tests of independence are used for deciding whether two variables are associated or are independent. The variables are categorical rather than numeric. It can be used to decide whether left-handedness is correlated with libertarian politics (or not). The null hypothesis is that the variables are independent. The numbers used in the calculation are the observed and expected frequencies of occurrence (from contingency tables).
  • Chi-squared goodness of fit tests are used to determine the adequacy of curves fit to data. The null hypothesis is that the curve fit is adequate. It is common to determine curve shapes to minimize the mean square error, so it is appropriate that the goodness-of-fit calculation sums the squared errors.

F-tests (analysis of variance, ANOVA) are commonly used when deciding whether groupings of data by category are meaningful. If the variance of test scores of the left-handed in a class is much smaller than the variance of the whole class, then it may be useful to study lefties as a group. The null hypothesis is that two variances are the same - so the proposed grouping is not meaningful.



In the table below, the symbols used are defined at the bottom of the table. Many other tests can be found in other articles.

Name Formula Assumptions or notes
One-sample z-test (Normal population or n > 30) and σ known.

(z is the distance from the mean in relation to the standard deviation of the mean). For non-normal distributions it is possible to calculate a minimum proportion of a population that falls within k standard deviations for any k (see: Chebyshev's inequality).

Two-sample z-test Normal population and independent observations and σ1 and σ2 are known
One-sample t-test

(Normal population or n > 30) and unknown
Paired t-test

(Normal population of differences or n > 30) and unknown
Two-sample pooled t-test, equal variances


(Normal populations or n1 + n2 > 40) and independent observations and σ1 = σ2 unknown
Two-sample unpooled t-test, unequal variances


(Normal populations or n1 + n2 > 40) and independent observations and σ1 ≠ σ2 both unknown
One-proportion z-test n .p0 > 10 and n (1 − p0) > 10 and it is a SRS (Simple Random Sample), see notes.
Two-proportion z-test, pooled for

n1 p1 > 5 and n1(1 − p1) > 5 and n2 p2 > 5 and n2(1 − p2) > 5 and independent observations, see notes.
Two-proportion z-test, unpooled for n1 p1 > 5 and n1(1 − p1) > 5 and n2 p2 > 5 and n2(1 − p2) > 5 and independent observations, see notes.
Chi-squared test for variance Normal population
Chi-squared test for goodness of fit df = k - 1 - # parameters estimated, and one of these must hold.

• All expected counts are at least 5.[14]

• All expected counts are > 1 and no more than 20% of expected counts are less than 5[15]

Two-sample F test for equality of variances Normal populations
Arrange so > and reject H0 for [16]
In general, the subscript 0 indicates a value taken from the null hypothesis, H0, which should be used as much as possible in constructing its test statistic. ... Definitions of other symbols:
  • = sample variance
  • = sample 1 standard deviation
  • = sample 2 standard deviation
  • = t statistic
  • = degrees of freedom
  • = sample mean of differences
  • = hypothesized population mean difference
  • = standard deviation of differences
  • = Chi-squared statistic
  • = x/n = sample proportion, unless specified otherwise
  • = hypothesized population proportion
  • = proportion 1
  • = proportion 2
  • = hypothesized difference in proportion
  • = minimum of n1 and n2
  • = F statistic



Hypothesis testing is largely the product of Ronald Fisher, Jerzy Neyman, Karl Pearson and (son) Egon Pearson. Fisher was an agricultural statistician who emphasized rigorous experimental design and methods to extract a result from few samples assuming Gaussian distributions. Neyman (who teamed with the younger Pearson) emphasized mathematical rigor and methods to obtain more results from many samples and a wider range of distributions. Modern hypothesis testing is an (extended) hybrid of the Fisher vs Neyman/Pearson formulation, methods and terminology developed in the early 20th century.

Fisher popularized the "significance test". He required a null-hypothesis (corresponding to a population frequency distribution) and a sample. His (now familiar) calculations determined whether to reject the null-hypothesis or not. Significance testing did not utilize an alternative hypothesis so there was no concept of a Type II error.

Neyman & Pearson considered a different problem (which they called "hypothesis testing"). They initially considered two simple hypotheses (both with frequency distributions). They calculated two probabilities and typically selected the hypothesis associated with the higher probability (the hypothesis more likely to have generated the sample). Their method always selected a hypothesis. It also allowed the calculation of both types of error probabilities.

Fisher and Neyman/Pearson clashed bitterly. The pair considered their formulation to be an improved generalization of significance testing. Fisher thought that it was without application. (The defining paper was abstract. Mathematicians have generalized and refined the theory for three generations.) All parties moved on to other matters with the conflict unresolved.

The modern version of hypothesis testing is a hybrid of the two approaches. (But signal detection, for example, still uses the Neyman/Pearson formulation.) Great conceptual differences were ignored. Neyman and Pearson provided the stronger terminology, the more rigorous mathematics and the more consistent philosophy, but the subject taught today in introductory statistics has more similarities with Fisher's method than theirs.[17] This history explains the inconsistent terminology (example: the null hypothesis is never accepted, but there is a region of acceptance).

While hypothesis testing was popularized early in the 20th century, evidence of its use can be found much earlier. In the 1770s Laplace considered the statistics of almost half a million births. The statistics showed an excess of boys compared to girls. He concluded by calculation of a p-value that the excess was a real, but unexplained, effect.[18]

Use and Importance


Statistics are helpful in analyzing most collections of data. This is equally true of hypothesis testing which can justify conclusions even when no scientific theory exists. In the Lady tasting tea example, it was "obvious" that no difference existed between (milk poured into tea) and (tea poured into milk). The data contradicted the "obvious".

Real world applications of hypothesis testing include[19]:

  • Testing whether more men than women suffer from nightmares
  • Establishing authorship of documents
  • Evaluating the effect of the full moon on behavior
  • Determining the range at which a bat can detect an insect by echo
  • Deciding whether hospital carpeting results in more infections
  • Selecting the best means to stop smoking
  • Checking whether bumper stickers reflect car owner behavior
  • Testing the claims of handwriting analysts

Statistical hypothesis testing plays an important role in the whole of statistics and in statistical inference. For example, Lehmann (1992) in a review of the fundamental paper by Neyman and Pearson (1933) says: "Nevertheless, despite their shortcomings, the new paradigm formulated in the 1933 paper, and the many developments carried out within its framework continue to play a central role in both the theory and practice of statistics and can be expected to do so in the foreseeable future".

Significance testing has been the favored statistical tool in some experimental social sciences (over 90% of articles in the Journal of Applied Psychology during the early 1990s).[20] Other fields have favored the estimation of parameters. Editors often consider significance as a criterion for the publication of scientific conclusions based on experiments with statistical results.



Statistics is increasingly being taught in schools with hypothesis testing being one of the elements taught.[21][22] Many conclusions reported in the popular press (political opinion polls to medical studies) are based on statistics. An informed public should understand the limitations of statistical conclusions[23][24] [citation needed] and many college fields of study require a course in statistics for the same reason.[23][24] [citation needed] An introductory college statistics class places much emphasis on hypothesis testing - perhaps half of the course. Even such fields as literature and divinity now include findings based on statistical analysis (see the Bible Analyzer). An introductory statistics class teaches hypothesis testing as a cookbook process. Hypothesis testing is also taught at the postgraduate level. Statisticians learn how to create good statistical test procedures (like z, Student's t, F and chi-squared). Statistical hypothesis testing is considered a mature area within statistics[25], but a limited amount of development continues.



The successful hypothesis test is associated with a probability and a type-I error rate. The conclusion might be wrong.

The conclusion of the test is only as solid as the sample upon which it is based. The design of the experiment is critical. A number of unexpected effects have been observed including:

  • The Clever Hans effect. A horse appeared to be capable of doing simple arithmetic.
  • The Hawthorne effect. Industrial workers were more productive in better illumination, and most productive in worse.
  • The Placebo effect. Pills with no medically active ingredients were remarkably effective.

A statistical analysis of misleading data produces misleading conclusions. The issue of data quality can be more subtle. In forecasting for example, there is no agreement on a measure of forecast accuracy. In the absence of a consensus measurement, no decision based on measurements will be without controversy.

The book How to Lie with Statistics[26][27] is the most popular book on statistics ever published.[28] It does not much consider hypothesis testing, but its cautions are applicable, including: Many claims are made on the basis of samples too small to convince. If a report does not mention sample size, be doubtful.

Hypothesis testing acts as a filter of statistical conclusions; Only those results meeting a probability threshold are publishable. Economics also acts as a publication filter; Only those results favorable to the author and funding source may be submitted for publication. The impact of filtering on publication is termed publication bias. A related problem is that of multiple testing (sometimes linked to data mining), in which a variety of tests for a variety of possible effects are applied to a single data set and only those yielding a significant result are reported.

Those making critical decisions based on the results of a hypothesis test are prudent to look at the details rather than the conclusion alone. In the physical sciences most results are fully accepted only when independently confirmed. The general advice concerning statistics is, "Figures never lie, but liars figure" (anonymous).



Since significance tests were first popularized many objections have been voiced by prominent and respected statisticians. The volume of criticism and rebuttal has filled books with language seldom used in the scholarly debate of a dry subject.[29][30][31][32] Much of the criticism was published more than 40 years ago. The fires of controversy have burned hottest in the field of experimental psychology. Nickerson surveyed the issues in the year 2000.[33] He included 300 references and reported 20 criticisms and almost as many recommendations, alternatives and supplements. The following section greatly condenses Nickerson's discussion, omitting many issues.

Selected criticisms


There are numerous persistent misconceptions regarding the test and its results:[33][citation needed]

  • The test is a flawed application of probability theory. But it clearly is a valid application of probability theory and logic.
    • While the data can be unlikely given the null hypothesis, the alternative hypothesis can be even more unlikely. (Nobody can be that lucky. vs. Clairvoyance is impossible.) But a hypothesis test is not and is not designed to be a measure of likelihood of a hypothesis.
  • The test result is a function of sample size. But the test summarizes what can be concluded from the data available, as it should.
  • The test result is uninformative. But it provides information (a suggested conclusion) on a possibly over-simplified question.
  • Statistical significance does not imply practical significance. But significance tests can structured to test for an effect size of "practical significance", if an investigator can be bothered to specify what that means.
  • Statistical testing harms forecasting success.[34] But in it not clear that the tests applied are appropriate to the particular context of forecasting.
  • Using statistical significance as a criterion for publication leads to the following problems, collectively known as publication bias:
    • Published Type I errors are difficult to correct.
    • Published effect sizes are biased upward.[citation needed]
    • Inadequately reported multiple testing investigations are biased by the invisibility of tests which failed to reach significance.
    • Type II errors (false negatives) are common.
But publication bias clearly affects any application of the scientific method.

Each criticism has merit, but is subject to discussion.

Misuses and abuses


The characteristics of significance tests can be abused.[33] When the test statistic is close to the chosen significance level, the temptation to carefully treat outliers, to adjust the chosen significance level, to pick a better statistic or to replace a two-tailed test with a one-tailed test can be powerful. If the goal is to produce a significant experimental result:

  • Conduct a few tests with a large sample size.
  • Rigorously control the experimental design.
  • Publish the successful tests; Hide the unsuccessful tests.
  • Emphasize the statistical significance of the results if the practical significance is doubtful.

If the goal is to fail to produce a significant effect:

  • Conduct a large number of tests with inadequate sample size.
  • Minimize experimental design constraints.
  • Publish the number of tests conducted that show "no significant result".

Results of the controversy


The controversy has produced several results. The American Psychological Association has strengthened its statistical reporting requirements after review,[35] medical journal publishers have recognized the obligation to publish some results that are not statistically significant to combat publication bias[36] and a journal (Journal of Articles in Support of the Null Hypothesis) has been created to publish such results exclusively.[37] Textbooks have added some cautions[38] and increased coverage of the tools necessary to estimate the size of the sample required to produce significant results. Major organizations have not abandoned use of significance tests although they have discussed doing so.[35]

Alternatives to significance testing


The numerous criticisms of significance testing do not lead to a single alternative or even to a unified set of alternatives. As a result, statistical testing impedes communication between the author and the reader.[39] A unifying position of critics is that statistics should not lead to a conclusion or a decision but to a probability or to an estimated value with a confidence interval.

One strong critic of significance testing suggested a list of reporting alternatives:[34] effect sizes for importance, prediction intervals for confidence, replications and extensions for replicability, meta-analyses for generality. None of these suggested alternatives produces a conclusion/decision. Lehmann said that hypothesis testing theory can be presented in terms of conclusions/decisions, probabilities, or confidence intervals. "The distinction between the ... approaches is largely one of reporting and interpretation."[40]

On one "alternative" there is no disagreement: Fisher himself said,[6] "In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result." Cohen, an influential critic of significance testing, concurred,[41] "...don't look for a magic alternative to NHST [null hypothesis significance testing] ... It doesn't exist." "...given the problems of statistical induction, we must finally rely, as have the older sciences, on replication." The "alternative" to significance testing is repeated testing. The easiest way to decrease statistical uncertainty is by obtaining more data, whether by increased sample size or by repeated tests. Nickerson claimed to have never seen the publication of a literally replicated experiment in psychology.[33] However, an indirect approach to replication is meta-analysis.

While Bayesian inference is a possible alternative to significance testing, it requires information that is seldom available in the cases where significance testing is most heavily used.

Future of the controversy


It is unlikely that this controversy will be resolved in the near future. The supposed flaws and unpopularity of significance testing do not eliminate the need for an objective and transparent means of reaching conclusions regarding experiments that produce statistical results. Critics have not unified around an alternative. Some of them have, however, suggested reforms for statistical and marketing research education to include a more thorough analysis of the meaning of statistical significance.[42] Other forms of reporting confidence or uncertainty will probably grow in popularity.

Recent work includes reconstruction and defense of Neyman–Pearson testing.[43]

See also



  1. ^ R. A. Fisher (1925). Statistical Methods for Research Workers, Edinburgh: Oliver and Boyd, 1925, p.43.
  2. ^ Cramer, Duncan (2004). The Sage Dictionary of Statistics. p. 76. ISBN 0-7619-4138-X. {{cite book}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)
  3. ^ Schervish, M (1996) Theory of Statistics, p. 218. Springer ISBN 0-387-94546-6
  4. ^ Kaye, David H.; Freedman, David A. (2011). "Reference Guide on Statistics". Reference manual on scientific evidence (3rd ed.). Eagan, MN Washington, D.C: West National Academies Press. p. 259. ISBN 978-0-309-21421-6. "In short, a Bayesian statistician can compute posterior probabilities for various hypotheses about the coin, given the data. These posterior probabilities quantify the statistician’s confidence in the hypothesis that a coin is fair. Although such posterior probabilities relate directly to hypotheses of legal interest, they are necessarily subjective, for they reflect not just the data but also the subjective prior probabilities—that is, degrees of belief about hypotheses formulated prior to obtaining data."
  5. ^ C. S. Peirce (August 1878). "Illustrations of the Logic of Science VI: Deduction, Induction, and Hypothesis". Popular Science Monthly. 13. Retrieved 30 March 2012.{{cite journal}}: CS1 maint: date and year (link)
  6. ^ a b Fisher, Sir Ronald A. (1956) [1935]. "Mathematics of a Lady Tasting Tea". In James Roy Newman (ed.). The World of Mathematics, volume 3 [Design of Experiments]. Courier Dover Publications. ISBN 978-0-486-41151-4. Originally from Fisher's book Design of Experiments.
  7. ^ Box, Joan Fisher (1978). R.A. Fisher, The Life of a Scientist. New York: Wiley. p. 134. ISBN 0-471-09300-9.
  8. ^ a b Lehmann, E.L.; Romano, Joseph P. (2005). Testing Statistical Hypotheses (3E ed.). New York: Springer. ISBN 0-387-98864-5.
  9. ^ Adèr,J.H. (2008). Chapter 12: Modelling. In H.J. Adèr & G.J. Mellenbergh (Eds.) (with contributions by D.J. Hand), Advising on Research Methods: A consultant's companion (pp. 183–209). Huizen, The Netherlands: Johannes van Kessel Publishing
  10. ^ Triola, Mario (2001). Elementary statistics (8 ed.). Boston: Addison-Wesley. p. 388. ISBN 0-201-61477-4.
  11. ^ Hinkelmann, Klaus and Kempthorne, Oscar (2008). Design and Analysis of Experiments. Vol. I and II (Second ed.). Wiley. ISBN 978-0-470-38551-7.{{cite book}}: CS1 maint: multiple names: authors list (link)
  12. ^ Montgomery, Douglas (2009). Design and analysis of experiments. Hoboken, NJ: Wiley. ISBN 978-0-470-12866-4.
  13. ^ a b NIST handbook: Two-Sample t-Test for Equal Means
  14. ^ Steel, R.G.D, and Torrie, J. H., Principles and Procedures of Statistics with Special Reference to the Biological Sciences., McGraw Hill, 1960, page 350.
  15. ^ Weiss, Neil A. (1999). Introductory Statistics (5th ed.). Addison Wesley. p. 802. ISBN 0-201-59877-9.
  16. ^ NIST handbook: F-Test for Equality of Two Standard Deviations (Testing standard deviations the same as testing variances)
  17. ^ Gigerenzer, Gerd (1990). "Part 3: The inference experts". The Empire of Chance: How Probability Changed Science and Everyday Life. Cambridge University Press. pp. 70–122. ISBN 978-0-521-39838-1. {{cite book}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)
  18. ^ Stigler, Stephen M. (1986). The history of statistics : the measurement of uncertainty before 1900. Cambridge, Mass: Belknap Press of Harvard University Press. p. 134. ISBN 0-674-40340-1.
  19. ^ Richard J. Larsen, Donna Fox Stroup (1976). Statistics in the Real World.
  20. ^ Hubbard, R.; Parsa, A. R.; Luthy, M. R. (1997). "The spread of statistical significance testing in psychology: The case of the Journal of Applied Psychology". Theory and Psychology. 7: 545–554. doi:10.1177/0959354397074006.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  21. ^ Mathematics > High School: Statistics & Probability > Introduction Common Core State Standards Initiative (relates to USA students)
  22. ^ College Board Tests > AP: Subjects > Statistics The College Board ((relates to USA students)
  23. ^ a b Huff, Darrell (1993). How to lie with statistics. New York: Norton. p. 8. ISBN 0-393-31072-8. 'Statistical methods and statistical terms are necessary in reporting the mass data of social and economic trends, business conditions, "opinion" polls, the census. But without writers who use the words with honesty and readers who know what they mean, the result can only be semantic nonsense.'
  24. ^ a b Snedecor, George W.; Cochran, William G. (1967). Statistical Methods (6 ed.). Ames, Iowa: Iowa State University Press. p. 3. "...the basic ideas in statistics assist us in thinking clearly about the problem, provide some guidance about the conditions that must be satisfied if sound inferences are to be made, and enable us to detect many inferences that have no good logical foundation."
  25. ^ Lehmann, E. L. (1997). "Testing statistical hypotheses: The story of a book". Statistical Science. 12: 48. doi:10.1214/ss/1029963261.
  26. ^ Huff, Darrell (1993). How to lie with statistics. New York: Norton. ISBN 0-393-31072-8.
  27. ^ Huff, Darrell (1991). How to lie with statistics. London: Penguin Books. ISBN 0-14-013629-0.
  28. ^ "Over the last fifty years, How to Lie with Statistics has sold more copies than any other statistical text." J. M. Steele. "Darrell Huff and Fifty Years of How to Lie with Statistics. Statistical Science, 20 (3), 2005, 205–209.
  29. ^ Harlow, Lisa Lavoie; Stanley A. Mulaik; James H. Steiger, ed. (1997). What If There Were No Significance Tests?. Lawrence Erlbaum Associates. ISBN 978-0-8058-2634-0.{{cite book}}: CS1 maint: multiple names: editors list (link)
  30. ^ Morrison, Denton; Henkel, Ramon, ed. (2006) [1970]. The Significance Test Controversy. AldineTransaction. ISBN 0-202-30879-0.{{cite book}}: CS1 maint: multiple names: editors list (link)
  31. ^ McCloskey, Deirdre N. (2008). The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives. University of Michigan Press. ISBN 978-0-472-05007-9. {{cite book}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)
  32. ^ Chow, Siu L. (1997). Statistical Significance: Rationale, Validity and Utility. SAGE Publications. ISBN 0-7619-5205-5.
  33. ^ a b c d Nickerson, Raymond S. (2000). "Null Hypothesis Significance Tests: A Review of an Old and Continuing Controversy". Psychological Methods. 5 (2): 241–301. doi:10.1037/1082-989X.5.2.241. PMID 10937333.
  34. ^ a b Armstrong, J. Scott (2007). "Significance tests harm progress in forecasting" (PDF). International Journal of Forecasting. 23 (2): 321–327. doi:10.1016/j.ijforecast.2007.03.004.
  35. ^ a b Wilkinson, Leland (1999). "Statistical Methods in Psychology Journals; Guidelines and Explanations". American Psychologist. 54 (8): 594–604. doi:10.1037/0003-066X.54.8.594.
  36. ^ ICMJE website: http://www.icmje.org/
  37. ^ Journal of Articles in Support of the Null Hypothesis website: JASNH homepage
  38. ^ Howell, David (2002). Statistical Methods for Psychology (5 ed.). Duxbury. p. 94. ISBN 0-534-37770-X.
  39. ^ Armstrong, J. (2007). "Statistical significance tests are unnecessary even when properly done and properly interpreted: Reply to commentaries". International Journal of Forecasting. 23 (2): 335–377. doi:10.1016/j.ijforecast.2007.01.010.
  40. ^ E. L. Lehmann (1997). "Testing Statistical Hypotheses: The Story of a Book". Statistical Science. 12 (1): 48–52. doi:10.1214/ss/1029963261.
  41. ^ Jacob Cohen (December 1994). "The Earth Is Round (p < .05)". American Psychologist. 49 (12): 997–1003. doi:10.1037/0003-066X.49.12.997.{{cite journal}}: CS1 maint: date and year (link) This paper lead to the review of statistical practices by the APA. Cohen was a member of the Task Force that did the review.
  42. ^ Hubbard, Raymond; Armstrong, J. Scott (2006). "Why We Don't Really Know What Statistical Significance Means: Implications for Educators". Journal of Marketing Education. 28 (2): 114. doi:10.1177/0273475306288399. Preprint
  43. ^ Mayo, D. G.; Spanos, A. (2006). "Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction". The British Journal for the Philosophy of Science. 57 (2): 323. doi:10.1093/bjps/axl003.

Further reading

  • Lehmann E.L. (1992) "Introduction to Neyman and Pearson (1933) On the Problem of the Most Efficient Tests of Statistical Hypotheses". In: Breakthroughs in Statistics, Volume 1, (Eds Kotz, S., Johnson, N.L.), Springer-Verlag. ISBN 0-387-94037-5 (followed by reprinting of the paper)
  • Neyman, J.; Pearson, E.S. (1933). "On the Problem of the Most Efficient Tests of Statistical Hypotheses". Phil. Trans. R. Soc., Series A. 231 (694–706): 289–337. doi:10.1098/rsta.1933.0009.