# Null hypothesis

In statistical inference of observed data of a scientific experiment, the null hypothesis refers to a general statement or default position that there is no relationship between two measured phenomena.[1] Rejecting or disproving the null hypothesis – and thus concluding that there are grounds for believing that there is a relationship between two phenomena or that a potential treatment has a measurable effect – is a central task in the modern practice of science, and gives a precise sense in which a claim is capable of being proven false.

In statistical significance, the null hypothesis is often denoted H0 (read “H-nought”) and is generally assumed true until evidence indicates otherwise. The concept of a null hypothesis is used differently in two approaches to statistical inference. In the significance testing approach of Ronald Fisher, a null hypothesis is potentially rejected or disproved on the basis of data that is significant under its assumption, but never accepted or proved. In the hypothesis testing approach of Jerzy Neyman and Egon Pearson, a null hypothesis is contrasted with an alternative hypothesis, and these are distinguished on the basis of data, with certain error rates. Proponents of these two approaches criticize each other, though today a hybrid approach is widely practiced and presented in textbooks. This hybrid is in turn criticized as incorrect and incoherent—see statistical hypothesis testing. Statistical significance plays a pivotal role in statistical hypothesis testing where it is used to determine if a null hypothesis can be rejected or retained.

## Basic definitions

The null and alternate hypotheses are technical terms used exclusively in statistical tests which are formal methods of reaching conclusions or making decisions on the basis of data. The hypotheses are conjectures about reality based on observations (about the population based on the sample in statistical terminology). The tests are core elements of inferential statistics, heavily used in the interpretation of scientific experimental data where they separate scientific claims from statistical noise.

"The statement being tested in a test of [statistical] significance is called the null hypothesis. The test of significance is designed to assess the strength of the evidence against the null hypothesis. Usually the null hypothesis is a statement of 'no effect' or 'no difference'."[2] It is often symbolized as H0.

The statement that is hoped or expected to be true instead of the null hypothesis is the alternative hypothesis.[2] Symbols include H1 and Ha.

Statistical significance test: "Very roughly, the procedure for deciding goes like this: Take a random sample from the population. If the sample data are consistent with the null hypothesis, then do not reject the null hypothesis; if the sample data are inconsistent with the null hypothesis, then reject the null hypothesis and conclude that the alternative hypothesis is true."[3]

The following sections add context and nuance to the basic definitions.

## History of statistical tests

The history of the null and alternative hypotheses is embedded in the history of statistical tests.[4][5]

• Before 1925: There are occasional transient traces of statistical tests for centuries in the past, which provide early examples of null hypotheses. In the late 19th century statistical significance was defined. In the early 20th century important probability distributions were defined. Gossett and Pearson worked on specific cases of significance testing.
• 1925: Fisher published the first edition of Statistical Methods for Research Workers which defined the statistical significance test and made it a mainstream method of analysis for much of experimental science. The text was devoid of proofs and weak on explanations, but it was filled with real examples. It placed statistical practice in the sciences well in advance of published statistical theory.
• 1933: In a series of papers (published over a decade starting in 1928) Neyman & Pearson defined the statistical hypothesis test as a proposed improvement on Fisher's test. The papers provided much of the terminology for statistical tests including alternative hypothesis and H0 as a hypothesis to be tested using observational data (with H1, H2... as alternatives).[6] Neyman did not use the term null hypothesis in later writings about his method.
• 1935: Fisher published the first edition of The Design of Experiments which introduced the null hypothesis[7] (by example rather than by definition) and carefully explained the rationale for significance tests in the context of the interpretation of experimental results.
• Following: Fisher and Neyman quarreled over the relative merits of their competing formulations until Fisher's death in 1962. Career changes and World War II ended the partnership of Neyman and Pearson. The formulations were merged by relatively anonymous textbook writers, experimenters (journal editors) and mathematical statisticians without input from the principals.[4] The subject today combines much of the terminology and explanatory power of Neyman & Pearson with the scientific philosophy and calculations provided by Fisher. Whether statistical testing is properly one subject or two remains a source of disagreement.[8] Sample of two: One text refers to the subject as hypothesis testing (with no mention of significance testing in the index) while another says significance testing (with a section on inference as a decision). Fisher developed significance testing as a flexible tool for researchers to weigh their evidence. Instead testing has become institutionalized. Statistical significance has become a rigidly-defined and -enforced criterion for the publication of experimental results in many scientific journals. In some fields significance testing has become the dominant and nearly exclusive form of statistical analysis. As a consequence the limitations of the tests have been exhaustively studied. Books have been filled with the collected criticism of significance testing.

## Principle

Hypothesis testing works by collecting data and measuring how likely the particular set of data is, assuming the null hypothesis is true, when the study is on a random representative sample. The null hypothesis assumes no relationship between variables in the population from which the sample is selected. If the data-set of a random representative sample is very unlikely relative to the null hypothesis, defined as being part of a class of sets of data that only rarely will be observed, the experimenter rejects the null hypothesis concluding it (probably) is false. This class of data-sets is usually specified via a test statistic which is designed to measure the extent of apparent departure from the null hypothesis. The procedure works by assessing whether the observed departure measured by the test statistic is larger than a value defined so that the probability of occurrence of a more extreme value is small under the null hypothesis (usually in less than either 5% or 1% of similar data-sets in which the null hypothesis does hold). If the data do not contradict the null hypothesis, then only a weak conclusion can be made; namely that the observed data set provides no strong evidence against the null hypothesis. As the null hypothesis could be true or false, in this case, in some contexts this is interpreted as meaning that the data give insufficient evidence to make any conclusion, on others it means that there is no evidence to support changing from a currently useful regime to a different one.

For instance, a certain drug may reduce the chance of having a heart attack. Possible null hypotheses are "this drug does not reduce the chances of having a heart attack" or "this drug has no effect on the chances of having a heart attack". The test of the hypothesis consists of administering the drug to half of the people in a study group as a controlled experiment. If the data show a statistically significant change in the people receiving the drug, the null hypothesis is rejected.

## Quotations regarding the null hypothesis from the Design of Experiments

Fisher introduced the null hypothesis by an example, the now famous Lady tasting tea experiment, as a casual wager. She claimed the ability to determine the means of tea preparation by taste. Fisher proposed an experiment and an analysis to test her claim. She was to be offered 8 cups of tea, 4 prepared by each method, for determination. He proposed the null hypothesis that she possessed no such ability, so she was just guessing. With this assumption, the number of correct guesses (the test statistic) formed a binomial distribution. Fisher calculated that her chance of guessing all cups correctly was 1/70. He was provisionally willing to concede her ability (rejecting the null hypothesis) in this case only. Having an example, Fisher commented:[9]

• "...it should be noted that the null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation. Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis."
• "...the null hypothesis must be exact, that is free from vagueness and ambiguity, because it must supply the basis of the 'problem of distribution,' of which the test of significance is the solution."
• "We may, however, choose any null hypothesis we please, provided it is exact."

Regarding an alternative non-directional significance test of the Lady tasting tea experiment:

• "For this purpose the new test proposed would be entirely inappropriate, and no experimenter would be tempted to employ it. Mathematically, however, it is as valid as any other, in that with proper randomisation it is demonstrable that it would give a significant result with known probability, if the null hypothesis were true."

Regarding which test of significance to apply:

• "The notion that different tests of significance are appropriate to test different features of the same null hypothesis presents no difficulty to workers engaged in practical experimentation, but has been the occasion of much theoretical discussion among statisticians."

On selecting the appropriate experimental measurement and null hypothesis:

• "This question, when the answer to it is not already known, can be fruitfully discussed only when the experimenter has in view, not a single null hypothesis, but a class of such hypotheses, in the significance of deviations from each of which he is equally interested."

## Testing for differences

In scientific and medical research, null hypotheses play a major role in testing the significance of differences in treatment and control groups. This use, while widespread, offers several grounds for criticism, including straw man, Bayesian criticism and publication bias.

The typical null hypothesis at the outset of the experiment is that no difference exists between the control and experimental groups (for the variable being compared). Other possibilities include:

• that values in samples from a given population can be modeled using a certain family of statistical distributions.
• that the variability of data in different groups is the same, although they may be centered around different values.

### Example

Given the test scores of two random samples of men and women, does one group differ from the other? A possible null hypothesis is that the mean male score is the same as the mean female score:

H0: μ1 = μ2

where:

H0 = the null hypothesis
μ1 = the mean of population 1, and
μ2 = the mean of population 2.

A stronger null hypothesis is that the two samples are drawn from the same population, such that the variance and shape of the distributions are also equal.

## Terminology

Simple hypothesis
Any hypothesis which specifies the population distribution completely. For such a hypothesis the sampling distribution of any statistic is a function of the sample size alone.
Composite hypothesis
Any hypothesis which does not specify the population distribution completely. Example: A hypothesis specifying a normal distribution with a specified mean and an unspecified variance.

The simple/composite distinction was made by Neyman and Pearson.[6]

Exact hypothesis
Any hypothesis that specifies an exact parameter value.[10] Example: μ = 100. Synonym: point hypothesis.
Inexact hypothesis
Those specifying a parameter range or interval. Examples: μ ≤ 100; 95 ≤ μ ≤ 105.

Fisher required an exact null hypothesis for testing. (See the quotations above.)

A one-tailed hypothesis (AKA one-sided test)[2] is an inexact hypothesis in which the value of a parameter is specified as being either:

• above or equal to a certain value, or
• below or equal to a certain value.

A one-tailed hypothesis is said to have directionality.

Fisher's original (Lady tasting tea) example was a one-tailed test. The null hypothesis was symmetric. The odds of guessing all cups correctly was the same as guessing all cups incorrectly, but Fisher noted that only guessing correctly was compatible with the Lady's claim. (See the quotations above about his reasoning.)

## Goals of null hypothesis tests

Statistical tests can be significance tests or hypothesis tests. There are many types of significance tests for one, two or more samples, for means, variances and proportions, paired or unpaired data, for different distributions, for large and small samples... All have null hypotheses. There are also at least 4 goals of null hypotheses for significance tests:[11]

• Technical null hypotheses are used to verify statistical assumptions. Example: The residuals between the data and a statistical model cannot be distinguished from random noise. If true, there is no justification for complicating the model.
• Scientific null assumptions are used to directly advance a theory. Example: The angular momentum of the universe is zero. If not true, the theory of the early universe may need revision.
• Null hypotheses of homogeneity are used to verify that multiple experiments are producing consistent results. Example: The effect of a medication on the elderly is consistent with that of the general adult population. If true, this strengthens the general effectiveness conclusion and simplifies recommendations for use.
• Null hypotheses that assert the equality of effect of two or more alternative treatments, for example, a drug and a placebo, are used to reduce scientific claims based on statistical noise. This is the most popular null hypothesis; It is so popular that many statements about significant testing assume such null hypotheses.

Rejection of the null hypothesis is not necessarily the real goal of a significance tester. An adequate statistical model may be associated with a failure to reject the null; The model is adjusted until the null is not rejected. The numerous uses of significance testing were well known to Fisher who discussed many in his book written a decade before defining the null hypothesis.[12]

A statistical significance test shares much mathematics with a confidence interval. They are mutually illuminating. A result is often significant when there is confidence in the sign of a relationship (the interval does not include 0). Whenever the sign of a relationship is important, statistical significance is a worthy goal. This also reveals weaknesses of significance testing: A result can be significant without a good estimate of the strength of a relationship; Significance can be a modest goal. A weak relationship can also achieve significance with enough data. Reporting both significance and confidence intervals is commonly recommended.

The varied uses of significance tests reduce the number of generalizations that can be made about all applications.

## Choice of the null hypothesis

The choice of the null hypothesis is associated with sparse and inconsistent advice. Fisher mentioned few constraints on the choice and stated that many null hypotheses should be considered and that many tests are possible for each. The variety of applications and the diversity of goals suggests that the choice can be complicated. In many applications the formulation of the test is traditional. A familiarity with the range of tests available may suggest a particular null hypothesis and test. Formulating the null hypothesis is not automated; The calculations of significance testing usually are.

Caution: A statistical significance test is intended to test a hypothesis. If the hypothesis summarizes a set of data, there is no value in testing the hypothesis on that set of data. Example: If a study of last year's weather reports indicates that rain in a region falls primarily on weekends, it is only valid to test that null hypothesis on weather reports from any other year. Testing hypotheses suggested by the data is circular reasoning that proves nothing; It is a special limitation on the choice of the null hypothesis.

Routine advice: Start from the scientific hypothesis. Translate this to a statistical alternative hypothesis and proceed: "Because Ha expresses the effect that we wish to find evidence for, we often begin with Ha and then set up H0 as the statement that the hoped-for effect is not present."[2] This advice is reversed for modeling applications where we hope not to find evidence against the null.

A complex case example:[13] The gold standard in clinical research is the randomised placebo controlled double blind clinical trial. But testing a new drug against a (medically ineffective) placebo may be unethical for a serious illness. Testing a new drug against an older medically effective drug raises fundamental philosophical issues regarding the goal of the test and the motivation of the experimenters. The standard "no difference" null hypothesis may reward the pharmaceutical company for gathering inadequate data. "Difference" is a better null hypothesis in this case, but statistical significance is not an adequate criterion for reaching a nuanced conclusion which requires a good numeric estimate of the drug's effectiveness. A "minor" or "simple" proposed change in the null hypothesis ((new vs old) rather than (new vs placebo)) can have a dramatic effect on the utility of a test for complex non-statistical reasons.

### Directionality

The choice of null hypothesis (H0) and consideration of directionality (see "one-tailed test") is critical. Consider the question of whether a tossed coin is fair (i.e. that on average it lands heads up 50% of the time). A potential null hypothesis is "this coin is not biased toward heads" (one-tail test). The experiment is to repeatedly toss the coin. A possible result of 5 tosses is 5 heads. Under this null hypothesis, the data are considered unlikely (with a fair coin, the probability of this is 1/25=3.1% and the result would be even more unlikely if the coin were biased in favour of tails). The data refute the null hypothesis (that the coin is either fair or biased toward tails) and the conclusion is that the coin is biased towards heads.

Alternatively, the null hypothesis, "this coin is fair" could be examined by looking out for either too many tails or too many heads, and thus the types of outcomes that would tend to contradict this null hypothesis are those where a large number of heads or a large number of tails are observed. Thus a possible diagnostic outcome would be that all tosses yield the same outcome, and the probability of 5 of a kind is 6% under the null hypothesis. This is not statistically significant, preserving the null hypothesis in this case.

This example illustrates that the conclusion reached from a statistical test may depend on the precise formulation of the null and alternative hypotheses.

Fisher said, "the null hypothesis must be exact, that is free of vagueness and ambiguity, because it must supply the basis of the 'problem of distribution,' of which the test of significance is the solution", implying a more restrictive domain for H0.[14] According to this view, the null hypothesis must be numerically exact—it must state that a particular quantity or difference is equal to a particular number. In classical science, it is most typically the statement that there is no effect of a particular treatment; in observations, it is typically that there is no difference between the value of a particular measured variable and that of a prediction. The majority of null hypotheses in practice do not meet this "exactness" criterion. For example, consider the usual test that two means are equal where the true values of the variances are unknown—exact values of the variances are not specified.

Most statisticians believe that it is valid to state direction as a part of null hypothesis, or as part of a null hypothesis/alternative hypothesis pair.[15] However, the results are not a full description of all the results of an experiment, merely a single result tailored to one particular purpose. For example, consider an H0 that claims the population mean for a new treatment is an improvement on a well-established treatment with population mean = 10 (known from long experience), with the one-tailed alternative being that the new treatment's mean > 10. If the sample evidence obtained through x-bar equals −200 and the corresponding t-test statistic equals −50, the conclusion from the test would be that there is no evidence that the new treatmnent is better than the existing one: it would not report that it is markedly worse, but that is not what this particular test is looking for. To overcome any possible ambiguity in reporting the result of the test of a null hypothesis, it is best to indicate whether the test was two-sided and, if one-sided, to include the direction of the effect being tested.

The statistical theory required to deal with the simple cases of directionality dealt with here, and more complicated ones, makes use of the concept of an unbiased test.

The directionality of hypotheses is not always obvious. The explicit null hypothesis of Fisher's Lady tasting tea example was that the Lady had no such ability, which led to a symmetric probability distribution. The one-tailed nature of the test resulted from the one-tailed alternate hypothesis (a term not used by Fisher). The null hypothesis became implicitly one-tailed. The logical negation of the Lady's one-tailed claim was also one-tailed. (Claim: Ability > 0; Stated null: Ability = 0; Implicit null: Ability ≤ 0).

Pure arguments over the use of one-tailed tests are complicated by the variety of tests. Some tests (for instance the χ2 goodness of fit test) are inherently one-tailed. Some probability distributions are asymmetric. The traditional tests of 3 or more groups are two-tailed.

Advice concerning the use of one-tailed hypotheses has been inconsistent and accepted practice varies among fields.[16] The greatest objection to one-tailed hypotheses is their potential subjectivity. A non-significant result can sometimes be converted to a significant result by the use of a one-tailed hypothesis (as the fair coin test, at the whim of the analyst). The flip side of the argument: One-sided tests are less likely to ignore a real effect. One-tailed tests can suppress the publication of data that differs in sign from predictions. Objectivity was a goal of the developers of statistical tests.

Routine advice: Use one-tailed hypotheses by default: "If you do not have a specific direction firmly in mind in advance, use a two-sided alternative. Moreover, some users of statistics argue that we should always work with the two-sided alternative."[2][17]

One alternative to this advice is to use three-outcome tests. It eliminates the issues surrounding directionality of hypotheses by testing twice, once in each direction and combining the results to produce three possible outcomes.[18] Variations on this approach have a history, being suggested perhaps 10 times since 1950.[19]

Disagreements over one-tailed tests flow from the philosophy of science. While Fisher was willing to ignore the unlikely case of the Lady guessing all cups of tea incorrectly (which may have been appropriate for the circumstances), medicine believes that a proposed treatment that kills patients is significant in every sense and should be reported and perhaps explained. Poor statistical reporting practices have contributed to disagreements over one-tailed tests. Statistical significance resulting from two-tailed tests is insensitive to the sign of the relationship; Reporting significance alone is inadequate. "The treatment has an effect" is the uninformative result of a two-tailed test. "The treatment has a beneficial effect" is the more informative result of a one-tailed test. "The treatment has an effect, reducing the average length of hospitalization by 1.5 days" is the most informative report, combining a two-tailed significance test result with a numeric estimate of the relationship between treatment and effect. Explicitly reporting a numeric result eliminates a philosophical advantage of a one-tailed test. An underlying issue is the appropriate form of an experimental science without numeric predictive theories: A model of numeric results is more informative than a model of effect signs (positive, negative or unknown) which is more informative than a model of simple significance (non-zero or unknown); in the absence of numeric theory signs may suffice.

## References

1. ^ "null hypothesis definition". Businessdictionary.com. Retrieved 2010-07-29.
2. Moore, David; McCabe, George (2003). Introduction to the Practice of Statistics (4 ed.). New York: W.H. Freeman and Co. p. 438. ISBN 9780716796572.
3. ^ Weiss, Neil A. (1999). Introductory Statistics (5th ed.). p. 494. ISBN 9780201598773.
4. ^ a b Gigerenzer, Gerd; Zeno Swijtink; Theodore Porter; Lorraine Daston; John Beatty; Lorenz Kruger (1989). "Part 3: The Inference Experts". The Empire of Chance: How Probability Changed Science and Everyday Life. Cambridge University Press. pp. 70–122. ISBN 978-0-521-39838-1.
5. ^ Lehmann, E. L. (2011). Fisher, Neyman, and the creation of classical statistics. New York: Springer. ISBN 978-1441994998.
6. ^ a b Neyman, J; Pearson, E. S. (January 1, 1933). "On the Problem of the most Efficient Tests of Statistical Hypotheses". Philosophical Transactions of the Royal Society A 231 (694–706): 289–337. doi:10.1098/rsta.1933.0009.
7. ^ Aldrich, John. "Earliest Known Uses of Some of the Words of Probability & Statistics". Retrieved 30 June 2014. Last update 12 march 2003. From Jeff Miller.
8. ^ Lehmann, E. L. (December 1993). "The Fisher, Neyman-Pearson Theories of Testing Hypotheses: One Theory or Two?". Journal of the American Statistical Association 88 (424): 1242–1249. doi:10.1080/01621459.1993.10476404.
9. ^ The Design of Experiments (2 ed.). Edinburgh: Oliver and Boyd. 1937. The book was published in 9 editions from 1935 to 1971. The last two editions were published posthumously. The publisher of the 8th edition of 1966 was Hafner of Edinburgh. The publisher of the 9th edition of 1971 was Macmillan with an ISBN of 0-02-844690-9. A more recent publication was as part of Statistical methods, experimental design, and scientific inference by the Oxford University Press in 1990 with an ISBN of 0198522290. While pagination was inconsistent among editions, Fisher maintained consistent section numbering where feasible. The most relevant sections of the text are (Chapter II: The Principles of Experimentation, Illustrated by a Psycho-physical Experiment, Section 8: The Null Hypothesis) and (Chapter X: The Generalization of the Null Hypothesis).
10. ^ Winkler, Robert L; Hays, William L (1975). Statistics : probability, inference, and decision. New York: Holt, Rinehart and Winston. p. 403. ISBN 0-03-014011-0.
11. ^ "Statistical Significance Tests". Br. J. Clin. Pharmac. 14: 325–331. 1982.
12. ^ Statistical Methods for Research Workers (11th Ed): Chapter IV: Tests of Goodness of Fit, Independence and Homogeneity; With Table of χ2. Regarding a significance test supporting goodness of fit: If the calculated probability is high then "there is certainly no reason to suspect that the [null] hypothesis is tested. If it is [low] it is strongly indicated that the [null] hypothesis fails to account for the whole of the facts."
13. ^ Jones, B; P Jarvis; J A Lewis; A F Ebbutt (6 July 1996). "Trials to assess equivalence: the importance of rigorous methods". BMJ 313: 36–39. It is suggested that the default position (the null hypothesis) should be that the treatments are not equivalent. Conclusions should be made on the basis of confidence intervals rather than significance.
14. ^ Fisher, R.A. (1966). The design of experiments. 8th edition. Hafner:Edinburgh.
15. ^ For example see Null hypothesis
16. ^ Lombardi, Celia M.; Hurlbert, Stuart H. (2009). "Misprescription and misuse of one-tailed tests". Austral Ecology 34: 447–468. doi:10.1111/j.1442-9993.2009.01946.x. Discusses the merits and historical usage of one-tailed tests in biology at length.
17. ^ Bland, J Martin; Altman, Douglas G (23 July 1994). "One and two sided tests of significance". BMJ 309: 248. With respect to medical statistics: "In general a one sided test is appropriate when a large difference in one direction would lead to the same action as no difference at all. Expectation of a difference in a particular direction is not adequate justification." "Two sided tests should be used unless there is a very good reason for doing otherwise. If one sided tests are to be used the direction of the test must be specified in advance. One sided tests should never be used simply as a device to make a conventionally non-significant difference significant."
18. ^ Jones, Lyle V.; Tukey, John W. (2000). "A Sensible Formulation of the Significance Test". Psychological Methods 5 (4): 411–414. doi:10.1037//1082-989X.5.4.411. Test results are signed: significant positive effect, significant negative effect or insignificant effect of unknown sign. This is a more nuanced conclusion than that of the two-tailed test. It has the advantages of one-tailed tests without the disadvantages.
19. ^ Hurlbert, S. H.; Lombardi, C. M. (2009). "Final collapse of the Neyman-Pearson decision theoretic framework and rise of the neoFisherian". Ann. Zool. Fennici 46: 311–349. ISSN 1797-2450.