# Causal inference

Causal inference is the process of drawing a conclusion about a causal connection based on the conditions of the occurrence of an effect. The main difference between causal inference and inference of association is that the former analyzes the response of the effect variable when the cause is changed.[1][2] The science of why things occur is called etiology. Causal inference is an example of causal reasoning.

## Definition

Inferring the cause of something has been described as:

• "...reason[ing] to the conclusion that something is, or is likely to be, the cause of something else".[3]
• "Identification of the cause or causes of a phenomenon, by establishing covariation of cause and effect, a time-order relationship with the cause preceding the effect, and the elimination of plausible alternative causes."[4]

## Methods

Epidemiological studies employ different epidemiological methods of collecting and measuring evidence of risk factors and effect and different ways of measuring association between the two. A hypothesis is formulated, and then tested with statistical methods. It is statistical inference that helps decide if data are due to chance, also called random variation, or indeed correlated and if so how strongly. However, correlation does not imply causation, so further methods must be used to infer causation.[citation needed]

Common frameworks for causal inference are structural equation modeling and the Rubin causal model.[citation needed]

## In epidemiology

Epidemiology studies patterns of health and disease in defined populations of living beings in order to infer causes and effects. An association between an exposure to a putative risk factor and a disease may be suggestive of, but is not equivalent to causality because correlation does not imply causation. Historically, Koch's postulates have been used since the 19th century to decide if a microorganism was the cause of a disease. In the 20th century the Bradford Hill criteria, described in 1965[5] have been used to assess causality of variables outside microbiology, although even these criteria are not exclusive ways to determine causality.

In molecular epidemiology the phenomena studied are on a molecular biology level, including genetics, where biomarkers are evidence of cause or effects.

A recent trend[when?] is to identify evidence for influence of the exposure on molecular pathology within diseased tissue or cells, in the emerging interdisciplinary field of molecular pathological epidemiology (MPE).[third-party source needed] Linking the exposure to molecular pathologic signatures of the disease can help to assess causality.[third-party source needed] Considering the inherent nature of heterogeneity of a given disease, the unique disease principle, disease phenotyping and subtyping are trends in biomedical and public health sciences, exemplified as personalized medicine and precision medicine.[third-party source needed]

## In computer science

Determination of cause and effect from joint observational data for two time-independent variables, say X and Y, has been tackled using asymmetry between evidence for some model in the directions, X → Y and Y → X. The primary approaches are based on Algorithmic information theory models and noise models.[citation needed]

### Algorithmic information models

Compare two programs, both of which output both X and Y.

• Store Y and a compressed form of X in terms of uncompressed Y.
• Store X and a compressed form of Y in terms of uncompressed X.

The shortest such program implies the uncompressed stored variable more-likely causes the computed variable.[6][7]

### Noise models

Incorporate an independent noise term in the model to compare the evidences of the two directions.

Here are some of the noise models for the hypothesis Y → X with the noise E:

• Additive noise:[8] ${\displaystyle Y=F(X)+E}$
• Linear noise:[9] ${\displaystyle Y=pX+qE}$
• Post-non-linear:[10] ${\displaystyle Y=G(F(X)+E)}$
• Heteroskedastic noise: ${\displaystyle Y=F(X)+E.G(X)}$
• Functional noise:[11] ${\displaystyle Y=F(X,E)}$

The common assumption in these models are:

• There are no other causes of Y.
• X and E have no common causes.
• Distribution of cause is independent from causal mechanisms.

On an intuitive level, the idea is that the factorization of the joint distribution P(Cause,Effect) into P(Cause)*P(Effect | Cause) typically yields models of lower total complexity than the factorization into P(Effect)*P(Cause | Effect). Although the notion of “complexity” is intuitively appealing, it is not obvious how it should be precisely defined.[11] A different family of methods attempt to discover causal "footprints" from large amounts of labeled data, and allow the prediction of more flexible causal relations.[12]

## In statistics and economics

In statistics and economics, causality is often tested for using regression. Several methods can be used to distinguish actual causality from spurious indications of causality. First, the explanatory variable could be one that conceptually could not be caused by the dependent variable, thereby avoiding the possibility of being misled by reverse causation: for example, if the independent variable is rainfall and the dependent variable is the futures price of some agricultural commodity. Second, the instrumental variables technique may be employed to remove any reverse causation by introducing a role for other variables (instruments) that are known to be unaffected by the dependent variable. Third, the principle that effects cannot precede causes can be invoked, by including on the right side of the regression only variables that precede in time the dependent variable. Fourth, other regressors are included to ensure that confounding variables are not causing a regressor to spuriously appear to be significant. Correlation by coincidence, as opposed to correlation reflecting actual dependence of the underlying process, can be ruled out by using large samples and by performing cross validation to check that correlations are maintained on data that were not used in the regression.[citation needed]

## In social science

The social sciences have moved increasingly toward a quantitative framework for assessing causality. Much of this has been described as a means of providing greater rigor to social science methodology. Political science was significantly influenced by the publication of Designing Social Inquiry, by Gary King, Robert Keohane, and Sidney Verba, in 1994. King, Keohane, and Verba (often abbreviated as KKV) recommended that researchers applying both quantitative and qualitative methods adopt the language of statistical inference to be clearer about their subjects of interest and units of analysis.[13][14] Proponents of quantitative methods have also increasingly adopted the potential outcomes framework, developed by Donald Rubin, as a standard for inferring causality.[citation needed]

Debates over the appropriate application of quantitative methods to infer causality resulted in increased attention to the reproducibility of studies. Critics of widely-practiced methodologies argued that researchers have engaged in P hacking to publish articles on the basis of spurious correlations.[15] To prevent this, some have advocated that researchers preregister their research designs prior to conducting to their studies so that they do not inadvertently overemphasize a non-reproducible finding that was not the initial subject of inquiry but was found to be statistically significant during data analysis.[16] Internal debates about methodology and reproducibility within the social sciences have at times been acrimonious.[citation needed]

While much of the emphasis remains on statistical inference in the potential outcomes framework, social science methodologists have developed new tools to conduct causal inference with both qualitative and quantitative methods, sometimes called a “mixed methods” approach.[17][18] Advocates of diverse methodological approaches argue that different methodologies are better suited to different subjects of study. Sociologist Herbert Smith and Political Scientists James Mahoney and Gary Goertz have cited the observation of Paul Holland, a statistician and author of the 1986 article “Statistics and Causal Inference,” that statistical inference is most appropriate for assessing the “effects of causes” rather than the “causes of effects.”[19][20] Qualitative methodologists have argued that formalized models of causation, including process tracing and fuzzy set theory, provide opportunities to infer causation through the identification of critical factors within case studies or through a process of comparison among several case studies.[14] These methodologies are also valuable for subjects in which a limited number of potential observations or the presence of confounding variables would limit the applicability of statistical inference.[citation needed]

## References

1. ^ Pearl, Judea (1 January 2009). "Causal inference in statistics: An overview" (PDF). Statistics Surveys. 3: 96–146. doi:10.1214/09-SS057.
2. ^ Morgan, Stephen; Winship, Chris (2007). Counterfactuals and Causal inference. Cambridge University Press. ISBN 978-0-521-67193-4.
3. ^ "causal inference". Encyclopædia Britannica, Inc. Retrieved 24 August 2014.
4. ^ John Shaughnessy; Eugene Zechmeister; Jeanne Zechmeister (2000). Research Methods in Psychology. McGraw-Hill Humanities/Social Sciences/Languages. pp. Chapter 1 : Introduction. ISBN 978-0077825362. Archived from the original on 15 October 2014. Retrieved 24 August 2014.
5. ^ Hill, Austin Bradford (1965). "The Environment and Disease: Association or Causation?". Proceedings of the Royal Society of Medicine. 58 (5): 295–300. doi:10.1177/003591576505800503. PMC 1898525. PMID 14283879.
6. ^ Kailash Budhathoki and Jilles Vreeken "Causal Inference by Compression" 2016 IEEE 16th International Conference on Data Mining (ICDM)
7. ^ Marx, Alexander; Vreeken, Jilles (2018). "Telling cause from effect by local and global regression". Knowledge and Information Systems. 60 (3): 1277–1305. doi:10.1007/s10115-018-1286-7.
8. ^ Hoyer, Patrik O., et al. "Nonlinear causal discovery with additive noise models." NIPS. Vol. 21. 2008.
9. ^ Shimizu, Shohei; et al. (2011). "DirectLiNGAM: A direct method for learning a linear non-Gaussian structural equation model" (PDF). The Journal of Machine Learning Research. 12: 1225–1248.
10. ^ Zhang, Kun, and Aapo Hyvärinen. "On the identifiability of the post-nonlinear causal model." Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence. AUAI Press, 2009.
11. ^ a b Mooij, Joris M., et al. "Probabilistic latent variable models for distinguishing between cause and effect." NIPS. 2010.
12. ^ Lopez-Paz, David, et al. "Towards a learning theory of cause-effect inference" ICML. 2015
13. ^ King, Gary (2012). Designing social inquiry : scientific inference in qualitative research. Princeton Univ. Press. ISBN 978-0691034713. OCLC 754613241.
14. ^ a b Mahoney, James (January 2010). "After KKV". World Politics. 62 (1): 120–147. doi:10.1017/S0043887109990220. JSTOR 40646193.
15. ^ Dominus, Susan (18 October 2017). "When the Revolution Came for Amy Cuddy". The New York Times. ISSN 0362-4331. Retrieved 2 March 2019.
16. ^ "The Statistical Crisis in Science". American Scientist. 6 February 2017. Retrieved 18 April 2019.
17. ^ Creswell, John W.; Clark, Vicki L. Plano (2011). Designing and Conducting Mixed Methods Research. SAGE Publications. ISBN 9781412975179.
18. ^ Seawright, Jason (September 2016). Multi-Method Social Science by Jason Seawright. Cambridge Core. doi:10.1017/CBO9781316160831. ISBN 9781316160831. Retrieved 18 April 2019.
19. ^ Smith, Herbert L. (10 February 2014). "Effects of Causes and Causes of Effects: Some Remarks from the Sociological Side". Sociological Methods and Research. 43 (3): 406–415. doi:10.1177/0049124114521149. PMC 4251584. PMID 25477697.
20. ^ Goertz, Gary; Mahoney, James (2006). "A Tale of Two Cultures: Contrasting Quantitative and Qualitative Research". Political Analysis. 14 (3): 227–249. doi:10.1093/pan/mpj017. ISSN 1047-1987.