# Foundations of statistics

The foundations of statistics consist of the mathematical and philosophical basis for arguments and inferences made using statistics. This includes the justification for the methods of statistical inference, estimation and hypothesis testing, the quantification of uncertainty in the conclusions of statistical arguments, and the interpretation of those conclusions in probabilistic terms. A valid foundation can be used to explain statistical paradoxes such as Simpson's paradox, provide a precise description of observed statistical laws,[1] and guide the application of statistical conclusions in social and scientific applications.

Statistical inference addresses issues related to the analysis and interpretation of data. Examples include the use of Bayesian inference versus frequent inference; the distinction between Fisher's "significance testing" and the Neyman-Pearson "hypothesis testing"; and whether the likelihood principle should be followed. Some of these issues have been subject to unresolved debate for up to two centuries.[2] Others have achieved a pragmatic consensus for specific applications, such as the use of Bayesian methods in fitting complex ecological models.[3]

Bandyopadhyay & Forster[4] describe four statistical paradigms: classical statistics (or error statistics), Bayesian statistics, likelihood-based statistics, and the use of the Akaike Information Criterion as a statistical basis. More recently, Judea Pearl reintroduced a formal mathematics for attributing causality in statistical systems that addresses fundamental limitations of both Bayesian and Neyman-Pearson methods.

## Fisher's "significance testing" vs. Neyman–Pearson "hypothesis testing"

During the second quarter of the 20th century, the development of classical statistics led to the emergence of two competing models for inductive statistical testing.[5][6] The merits of these models were extensively debated[7] for over 25 years until Fisher's passing. Although a hybrid approach combining elements of both methods is commonly taught and utilized, the philosophical questions raised during the debate remain unresolved.

### Significance testing

Fisher played a significant role in popularizing significance testing through his publications, such as "Statistical Methods for Research Workers" in 1925 and "The Design of Experiments" in 1935.[8] His aim was to achieve scientific experimental outcomes without bias from prior opinions. Significance testing is a probabilistic form of deductive inference, akin to Modus tollens. A simplified statement of the test can be described as follows: "If the evidence contradicts the hypothesis to a sufficient degree, the hypothesis is rejected." In practice, a statistic is computed based on the experimental data, and the probability of obtaining a value greater than that statistic under a default or "null" model is compared to a predetermined threshold. This threshold represents the level of discord required (typically established by convention). One common application of this method is to determine whether a treatment has a noticeable effect based on a comparative experiment. In this case, the null hypothesis corresponds to the absence of a treatment effect, implying that the treated group and the control group are drawn from the same population. Statistical significance measures probability and does not address practical significance. It can be viewed as a criterion for the statistical signal-to-noise ratio. It is important to note that the test cannot prove the hypothesis (of no treatment effect), but it can provide evidence against it. The method relies on formulating an imaginary infinite population, representing the null hypothesis, within a specified statistical model.

The Fisherman significance test involves a single hypothesis, but the choice of the test statistic requires an understanding of relevant directions of deviation from the hypothesized model.

### Hypothesis testing

Neyman and Pearson collaborated on the problem of selecting the most appropriate hypothesis based solely on experimental evidence, which differed from significance testing. Their most renowned joint paper, published in 1933,[9] introduced the Neyman-Pearson lemma, which states that a ratio of probabilities serves as an effective criterion for hypothesis selection (with the choice of the threshold being arbitrary). The paper demonstrated the optimality of the Student's t-test, one of the significance tests. Neyman believed that hypothesis testing represented a generalization and improvement of significance testing. The rationale for their methods can be found in their collaborative papers.[10]

Hypothesis testing involves considering multiple hypotheses and selecting one among them, akin to making a multiple-choice decision. The absence of evidence is not an immediate factor to be taken into account. The method is grounded in the assumption of repeated sampling from the same population (the classical frequenter assumption), although Fisher criticized this assumption (Rubin, 2020).[11]

### Grounds of disagreement

The duration of the dispute allowed for a comprehensive discussion of various fundamental issues in the field of statistics.

#### Fisher's attack[12]

Repeated sampling of the same population

Type II errors

• Which result from an alternative hypothesis

Inductive behavior

#### Neyman's rebuttal[13]

Fisher's attack on inductive behavior has been largely successful because he selected the field of battle. While operational decisions are routinely made on a variety of criteria (such as cost), scientific conclusions from experimentation are typically made based on probability alone. Fisher's theory of fiduciary inference is flawed

A purely probabilistic theory of tests requires an alternative hypothesis. Fisher's attacks on Type II errors have faded with time. In the intervening years, statistics have separated the exploratory from the confirmatory. In the current environment, the concept of Type II errors are used in power calculations for confirmatory hypothesis tests' sample size determination.

#### Discussion

Fisher's attack based on frequenter probability failed but was not without result. He identified a specific case (2×2 table) where the two schools of testing reached different results. This case is one of several that are still troubling. Commentators believe that the "right" answer is context-dependent.[14] Fiducial probability has not fared well, being virtually without advocates, while frequenter probability remains a mainstream interpretation.

Fisher's attack on inductive behavior has been largely successful because he selected the field of battle. While ''operational decisions'' are routinely made on a variety of criteria (such as cost), ''scientific conclusions'' from experimentation are typically made based on probability alone.

During this exchange, Fisher also discussed the requirements for inductive inference, specifically criticizing cost functions that penalize erroneous judgments. Neyman countered by mentioning the use of such functions by Gauss and Laplace. These arguments occurred 15 years after textbooks began teaching a hybrid theory of statistical testing.

Fisher and Neyman held different perspectives on the foundations of statistics (though they both opposed the Bayesian viewpoint):[14]

• The interpretation of probability
• The disagreement between Fisher's inductive reasoning and Neyman's inductive behavior reflected the Bayesian-Frequent divide. Fisher was willing to revise his opinion (reaching a provisional conclusion) based on calculated probability, while Neyman was more inclined to adjust his observable behavior (making a decision) based on computed costs.
• The appropriate formulation of scientific questions, with a particular focus on modeling[7][15]
• Whether it is justifiable to reject a hypothesis based on a low probability without knowing the probability of an alternative
• Whether a hypothesis could ever be accepted based solely on data
• In mathematics, deduction proves, while counter-examples disprove.
• In the Popperian philosophy of science, progress is made when theories are disproven.
• Subjectivity: Although Fisher and Neyman endeavored to minimize subjectivity, they both acknowledged the significance of "good judgment." Each accused the other of subjectivity.
• Fisher subjectively selected the null hypothesis.
• Neyman-Pearson subjectively determined the criterion for selection (which was not limited to probability).
• Both subjectively established numeric thresholds.

Fisher and neyman diverged in their attitudes and, perhaps, their language. Fisher was a scientist and an intuitive mathematician, and inductive reasoning came naturally to him. Neyman, on the other hand, was a rigorous mathematician who relied on deductive reasoning rather than probability calculations based on experiments.[5] Hence, there was an inherent clash between applied and theoretical approaches (between science and mathematics).

### Related history

In 1938, Neyman relocated to the West Coast of the United States of America, effectively ending his collaboration with Pearson and their work on hypothesis testing.[5] Subsequent developments in the field were carried out by other researchers.

By 1940, textbooks began presenting a hybrid approach that combined elements of significance testing and hypothesis testing.[16] However, none of the main contributors were directly involved in the further development of the hybrid approach currently taught in introductory statistics.[6]

Statistics subsequently branched out into various directions, including decision theory, Bayesian statistics, exploratory data analysis, robust statistics, and non-parametric statistics. Neyman-Pearson hypothesis testing made significant contributions to decision theory, which is widely employed, particularly in statistical quality control. Hypothesis testing also extended its applicability to incorporate prior probabilities, giving it a Bayesian character. While Neyman -Pearson hypothesis testing has evolved into an abstract mathematical subject taught at the post-graduate level,[17] much of what is taught and used in undergraduate education under the umbrella of hypothesis testing can be attributed to Fisher.

### Contemporary opinion

There have been no major conflicts between the two classical schools of testing in recent decades, although occasional criticism and disputes persist. However, it is highly unlikely that one theory of statistical testing will completely supplant the other in the foreseeable future.

The hybrid approach, which combines elements from both competing schools of testing, can be interpreted in different ways. Some view it as an amalgamation of two mathematically complementary ideas,[14] while others see it as a flawed union of philosophically incompatible concepts.[18] Fisher's approach had certain philosophical advantages, while neyman and Pearson emphasized rigorous mathematics. Hypothesis testing remains a subject of controversy for some users, but the most widely accepted alternative method, confidence intervals, is based on the same mathematical principles.

Due to the historical development of testing, there is no single authoritative source that fully encompasses the hybrid theory as it is commonly practiced in statistics. Additionally, the terminology used in this context may lack consistency. Empirical evidence indicates that individuals, including students and instructors in introductory statistics courses, often have a limited understanding of the meaning of hypothesis testing.[19]

### Summary

• The interpretation of probability remains unresolved, although fiduciary probability is not widely embraced.
• Neither of the test methods has been completely abandoned, as they are extensively utilized for different objectives.
• Textbooks have integrated both test methods into the framework of hypothesis testing.
• Some mathematicians argue, with a few exceptions, that significance tests can be considered a specific instance of hypothesis tests.
• On the other hand, some perceive these problems and methods as separate or incompatible.
• The ongoing dispute has harmed statistical education.

## Bayesian inference versus frequent inference

Two distinct interpretations of probability have existed for a long time, one based on objective evidence and the other on subjective degrees of belief. The debate between Gauss and Laplace could have taken place more than 200 years ago, giving rise to two competing schools of statistics. Classical inferential statistics emerged primarily during the second quarter of the 20th century,[6] largely in response to the controversial principle of indifference used in Bayesian probability at that time. The resurgence of Bayesian inference was a reaction to the limitations of frequent probability, leading to further developments and reactions.

While the philosophical interpretations have a long history, the specific statistical terminology is relatively recent. The terms "Bayesian" and "frequent" became standardized in the second half of the 20th century.[20] However, the terminology can be confusing, as the "classical" interpretation of probability aligns with Bayesian principles, while "classical" statistics follow the frequent approach. Moreover, even within the term "frequentest," there are variations in interpretation, differing between philosophy and physics.

The intricate details of philosophical probability interpretations are explored elsewhere. In the field of statistics, these alternative interpretations allow for the analysis of different datasets using distinct methods based on various models, aiming to achieve slightly different objectives. When comparing the competing schools of thought in statistics, pragmatic criteria beyond philosophical considerations are taken into account.

### Major contributors

Fisher and Neyman were significant figures in the development of frequentist (classical) methods.[5] While Fisher had a unique interpretation of probability that differed from Bayesian principles, Neyman adhered strictly to the frequentist approach. In the realm of Bayesian statistical philosophy, mathematics, and methods, de Finetti,[21] Jeffreys,[22] and Savage[23] emerged as notable contributors during the 20th century. Savage played a crucial role in popularizing de Finetti's ideas in English-speaking regions and establishing rigorous Bayesian mathematics. In 1965, Dennis Lindley's two-volume work titled "Introduction to Probability and Statistics from a Bayesian Viewpoint" played a vital role in introducing Bayesian methods to a wide audience. For three generations, statistics have progressed significantly, and the views of early contributors are not necessarily considered authoritative in present times.

### Contrasting approaches

#### Frequent inference

The earlier description briefly highlights frequent inference, which encompasses Fisher's "significance testing" and Neyman-Pearson's "hypothesis testing." Frequent inference incorporates various perspectives and allows for scientific conclusions, operational decisions, and parameter estimation with or without confidence intervals.

#### Bayesian inference

A classical frequency distribution provides information about the probability of the observed data. By applying Bayes' theorem, a more abstract concept is introduced, which involves estimating the probability of a hypothesis (associated with a theory) given the data. This concept, formerly referred to as "inverse probability," is realized through Bayesian inference. Bayesian inference involves updating the probability estimate for a hypothesis as new evidence becomes available. It explicitly considers both the evidence and prior beliefs, enabling the incorporation of multiple sets of evidence.

#### Comparisons of characteristics

Frequentists and Bayesians employ distinct probability models. Frequent typically view parameters as fixed but unknown, whereas Bayesians assign probability distributions to these parameters. As a result, Bayesian discuss probabilities that frequent do not acknowledge. Bayesian consider the probability of a theory, whereas true frequentists can only assess the evidence's consistency with the theory. For instance, a frequent does not claim a 95% probability that the true value of a parameter falls within a confidence interval; rather, they state that 95% of confidence intervals encompass the true value.

Bayesian Frequentist
Basis Belief (prior) Behavior (method)
Resulting Characteristic Principled Philosophy Opportunistic Methods
Distributions One distribution Many distributions (bootstrap?)
Ideal Application Dynamic (repeated sampling) Static (one sample)
Target Audience Individual (subjective) Community (objective)
Modeling Characteristic Aggressive Defensive
Alternative comparison[25][26]
Bayesian Frequentist
Strengths
• Complete
• Coherent
• Prescriptive
• Strong inference from model
• Inferences well calibrated
• No need to specify prior distributions
• Flexible range of procedures
• Strong model formulation & assessment
• Unbiasness, sufficiency, ancillary...
• Widely applicable and dependable
• Asymptotic theory
• Easy to interpret
• Can be calculated by hand
Weaknesses
• Too subjective for scientific inference
• Denies the role of randomization in design
• Requires and relies on full specification of a model (likelihood and prior)
• Weak model formulation & assessment
• Incomplete
• Ambiguous
• Incoherent
• Not prescriptive
• No unified theory
• Potential overemphasis on asymptotic properties
• Weak inference from model

### Mathematical results

Both the frequent and Bayesian schools are subject to mathematical critique, and neither readily embraces such criticism. For instance, Stein's paradox highlights the intricacy of determining a "flat" or "uninformative" prior probability distribution in high-dimensional spaces.[2] While Bayesians perceive this as tangential to their fundamental philosophy, they find frequent plagued with inconsistencies, paradoxes, and unfavorable mathematical behavior. Frequent travelers can account for most of these issues. Certain "problematic" scenarios, like estimating the weight variability of a herd of elephants based on a single measurement ("Basu's elephants"), exemplify extreme cases that defy statistical estimation. The principle of likelihood has been a contentious arena of debate.

### Statistical results

Both the frequent and Bayesian schools have demonstrated notable accomplishments in addressing practical challenges. Classical statistics, with its reliance on mechanical calculators and specialized printed tables, boasts a longer history of obtaining results. Bayesian methods, on the other hand, have shown remarkable efficacy in analyzing sequentially sampled information, such as radar and sonar data. Several Bayesian techniques, as well as certain recent frequent methods like the bootstrap, necessitate the computational capabilities that have become widely accessible in the past few decades. There is an ongoing discourse regarding the integration of Bayesian and frequent approaches,[25] although concerns have been raised regarding the interpretation of results and the potential diminishment of methodological diversity.

### Philosophical results

Bayesians share a common stance against the limitations of frequent, but they are divided into various philosophical camps (empirical, hierarchical, objective, personal, and subjective), each emphasizing different aspects. A philosopher of statistics from the frequent perspective has observed a shift from the statistical domain to philosophical interpretations of probability over the past two generations.[27] Some perceive that the successes achieved with Bayesian applications do not sufficiently justify the associated philosophical framework.[28] Bayesian methods often develop practical models that deviate from traditional inference and have minimal reliance on philosophy.[29] Neither the frequent nor the Bayesian philosophical interpretations of probability can be considered entirely robust. The frequent view is criticized for being overly rigid and restrictive, while the Bayesian view can encompass both objective and subjective elements, among others.

### Illustrative quotations

• "Carefully used, the frequent approach yields broadly applicable if sometimes clumsy answers"[30]
• "To insist on unbiased [frequent] techniques may lead to negative (but unbiased) estimates of variance; the use of p-values in multiple tests may lead to blatant contradictions; conventional 0.95  confidence regions may consist of the whole real line. No wonder that mathematicians find it often difficult to believe that conventional statistical methods are a branch of mathematics."[31]
• "Bayesianism is a neat and fully principled philosophy, while frequent is a grab-bag of opportunistic, individually optimal, methods."[24]
• "Bayes' rule says there is a simple, elegant way to combine current information with prior experience to state how much is known. It implies that sufficiently good data will bring previously disparate observers to an agreement. It makes full use of available information, and it produces decisions having the least possible error rate."[32]
• "Bayesian statistics is about making probability statements, frequent statistics is about evaluating probability statements."[33]
• "Statisticians are often put in a setting reminiscent of Arrow’s paradox, where we are asked to provide estimates that are informative and unbiased and confidence statements that are correct conditional on the data and also on the underlying true parameter."[33] (These are conflicting requirements.)
• "Formal inferential aspects are often a relatively small part of statistical analysis"[30]
• "The two philosophies, Bayesian and frequent, are more orthogonal than antithetical."[24]
• "A hypothesis that may be true is rejected because it has failed to predict observable results that have not occurred. This seems a remarkable procedure."[22]

### Summary

• Bayesian theory has a mathematical advantage.
• Frequent probability has existence and consistency problems.
• But finding good priors to apply Bayesian theory remains (very?) difficult.
• Both theories have impressive records of successful application.
• Neither the philosophical interpretation of probability nor its support is robust.
• There is increasing skepticism about the connection between application and philosophy.
• Some statisticians are recommending active collaboration (beyond a cease-fire).

## The likelihood principle

In common usage, likelihood is often considered synonymous with probability. However, according to statistics, this is not the case. In statistics, probability refers to variable data given a fixed hypothesis, whereas likelihood refers to variable hypotheses given a fixed set of data. For instance, when making repeated measurements with a ruler under fixed conditions, each set of observations corresponds to a probability distribution, and the observations can be seen as a sample from that distribution, following the frequent interpretation of probability. On the other hand, a set of observations can also arise from sampling various distributions based on different observational conditions. The probabilistic relationship between a fixed sample and a variable distribution stemming from a variable hypothesis is referred to as likelihood, representing the Bayesian view of probability. For instance, a set of length measurements may represent readings taken by observers with specific characteristics and conditions.

Likelihood is a concept that was introduced and developed by Fisher over a span of more than 40 years, although earlier references to the concept exist and Fisher's support for it was not wholehearted.[34] The concept was subsequently accepted and substantially revised by Jeffreys.[35] In 1962, Birnbaum "proved" the likelihood principle based on premises that were widely accepted among statisticians,[36] although his proof has been subject to dispute by statisticians and philosophers. Notably, by 1970, Birnbaum had rejected one of these premises (the conditionality principle) and had also abandoned the likelihood principle due to their incompatibility with the frequent "confidence concept of statistical evidence."[37][38] The likelihood principle asserts that all the information in a sample is contained within the likelihood function, which is considered a valid probability distribution by Bayesians but not by frequent.

Certain significance tests employed by frequentists are not consistent with the likelihood principle. Bayesian, on the other hand, embrace the principle as it aligns with their philosophical standpoint (perhaps in response to frequent discomfort). The likelihood approach is compatible with Bayesian statistical inference, where the posterior Bayes distribution for a parameter is derived by multiplying the prior distribution by the likelihood function using Bayes's Theorem.[34] Frequent interpret the likelihood principle unfavorably, as it suggests a lack of concern for the reliability of evidence. The likelihood principle, according to Bayesian statistics, implies that information about the experimental design used to collect evidence does not factor into the statistical analysis of the data.[39] Some Bayesian, including Savage,[citation needed] acknowledge this implication as a vulnerability.

The likelihood principle's staunchest proponents argue that it provides a more solid foundation for statistics compared to the alternatives presented by Bayesian and frequent approaches.[40] These supporters include some statisticians and philosophers of science.[41] While Bayesian recognize the importance of likelihood for calculations, they contend that the posterior probability distribution serves as the appropriate basis for inference.[42]

## Modeling

Inferential statistics relies on statistical models. Classical hypothesis testing, for instance, has often relied on the assumption of data normality. To reduce reliance on this assumption, robust and nonparametric statistics have been developed. Bayesian statistics, on the other hand, interpret new observations based on prior knowledge, assuming continuity between the past and present. The experimental design assumes some knowledge of the factors to be controlled, varied, randomized, and observed. Statisticians are aware of the challenges in establishing causation, often stating that "correlation does not imply causation," which is more of a limitation in modeling than a mathematical constraint.

As statistics and data sets have become more complex,[a][b] questions have arisen regarding the validity of models and the inferences drawn from them. There is a wide range of conflicting opinions on modeling.

Models can be based on scientific theory or ad hoc data analysis, each employing different methods. Advocates exist for each approach.[44] Model complexity is a trade-off and less subjective approaches such as the Akaike information criterion and Bayesian information criterion aim to strike a balance.[45]

Concerns have been raised even about simple regression models used in the social sciences, as a multitude of assumptions underlying model validity are often neither mentioned nor verified. In some cases, a favorable comparison between observations and the model is considered sufficient.[46]

Bayesian statistics focuses so tightly on the posterior probability that it ignores the fundamental comparison of observations and model.[dubious ][29]

Traditional observation-based models often fall short in addressing many significant problems, requiring the utilization of a broader range of models, including algorithmic ones. "If the model is a poor emulation of nature, the conclusions may be wrong."[47]

Modeling is frequently carried out inadequately, with improper methods employed, and the reporting of models is often subpar.[48]

Given the lack of a strong consensus on the philosophical review of statistical modeling, many statisticians adhere to the cautionary words of George Box: "All models are wrong, but some are useful."

For a concise introduction to the fundamentals of statistics, refer to Stuart, A.; old, J.K. (1994). "Ch. 8 – Probability and statistical inference" in Kendall's Advanced Theory of Statistics, Volume I: Distribution Theory (6th ed.), published by Edward Arnold.

In his book Statistics as Principled Argument, Robert P. Abelson presents the perspective that statistics serve as a standardized method for resolving disagreements among scientists, who could otherwise engage in endless debates about the merits of their respective positions. From this standpoint, statistics can be seen as a form of rhetoric. However, the effectiveness of statistical methods depends on the consensus among all involved parties regarding the chosen approach.[49]

## Footnotes

1. ^ Some large models attempt to predict the behavior of voters in the United States of America. The population is around 300  million. Each voter may be influenced by many factors. For some of the complications of voter behavior (most easily understood by the natives) see: Gelman[43]
2. ^ Efron (2013) mentions millions of data points and thousands of parameters from scientific studies.[24]

## Citations

1. ^ Kitcher & Salmon (2009) p.51
2. ^ a b
3. ^ van de Schoot, Rens; Depaoli, Sarah; King, Ruth; Kramer, Bianca; Märtens, Kaspar; Tadesse, Mahlet G.; Vannucci, Marina; Gelman, Andrew; Veen, Duco; Willemsen, Joukje; Yau, Christopher (2021-01-14). "Bayesian statistics and modelling". Nature Reviews Methods Primers. 1 (1). doi:10.1038/s43586-020-00001-2. hdl:20.500.11820/9fc72a0b-33e4-4a9c-bdb7-d88dab16f621. ISSN 2662-8449.
4. ^
5. ^ a b c d
6. ^ a b c
7. ^ a b
8. ^
9. ^
10. ^
11. ^ Rubin, M (2020). ""Repeated sampling from the same population?" A critique of Neyman and Pearson's responses to Fisher". European Journal for Philosophy of Science. 10 (42): 1–15. doi:10.1007/s13194-020-00309-6. S2CID 221939887.
12. ^
13. ^
14. ^ a b c
15. ^
16. ^
17. ^
18. ^
19. ^
20. ^
21. ^
22. ^ a b
23. ^
24. ^ a b c d
25. ^ a b
26. ^
27. ^
28. ^
29. ^ a b
30. ^ a b c
31. ^
32. ^
33. ^ a b
34. ^ a b
35. ^
36. ^
37. ^ Birnbaum, A., (1970) Statistical Methods in Scientific Inference. Nature, 225, 14 March 1970, pp.1033.
38. ^ Giere, R. (1977) Allan Birnbaum's Conception of Statistical Evidence. Synthese, 36, pp.5-13.
39. ^
40. ^
41. ^
42. ^
43. ^ Gelman. "Red-blue talk UBC" (PDF). Statistics. Columbia U. Archived (PDF) from the original on 2013-10-06. Retrieved 2013-09-16.
44. ^
45. ^
46. ^
47. ^
48. ^
49. ^ Abelson, Robert P. (1995). Statistics as Principled Argument. Lawrence Erlbaum Associates. ISBN 978-0-8058-0528-4. ... the purpose of statistics is to organize a useful argument from quantitative evidence, using a form of principled rhetoric.