Data dredging (also data fishing, data snooping, equation fitting and p-hacking) is the use of data mining to uncover relationships in data.
The process of data mining involves automatically testing huge numbers of hypotheses about a single data set by exhaustively searching for combinations of variables that might show a correlation. Conventional tests of statistical significance are based on the probability that an observation arose by chance, and necessarily accept some risk of mistaken test results, called the significance. When large numbers of tests are performed, some produce false results, hence 5% of randomly chosen hypotheses turn out to be significant at the 5% level, 1% turn out to be significant at the 1% significance level, and so on, by chance alone. When enough hypotheses are tested, it is virtually certain that some falsely appear statistically significant, since almost every data set with any degree of randomness is likely to contain some spurious correlations. If they are not cautious, researchers using data mining techniques can be easily misled by these apparently significant results.
The multiple comparisons hazard is common in data dredging. Moreover, subgroups are sometimes explored without alerting the reader to the number of questions at issue, which can lead to misinformed conclusions.
Drawing conclusions from data
The conventional frequentist statistical hypothesis testing procedure is to formulate a research hypothesis, such as "people in higher social classes live longer", then collect relevant data, followed by carrying out a statistical significance test to see whether the results could be due to the effects of chance. (The last step is called testing against the null hypothesis).
A key point in proper statistical analysis is to test a hypothesis with evidence (data) that was not used in constructing the hypothesis. This is critical because every data set contains some patterns due entirely to chance. If the hypothesis is not tested on a different data set from the same population, it is impossible to determine if the patterns found are chance patterns. See testing hypotheses suggested by the data.
Here is a simple example. Throwing a coin five times, with a result of 2 heads and 3 tails, might lead one to hypothesize that the coin favors tails by 3/5 to 2/5. If this hypothesis is then tested on the existing data set, it is confirmed, but the confirmation is meaningless. The proper procedure would have been to form in advance a hypothesis of what the tails probability is, and then throw the coin various times to see if the hypothesis is rejected or not. If three tails and two heads are observed, another hypothesis, that the tails probability is 3/5, could be formed, but it could only be tested by a new set of coin tosses. It is important to realize that the statistical significance under the incorrect procedure is completely spurious – significance tests do not protect against data dredging.
Hypothesis suggested by non-representative data
In a list of 367 people, at least two have the same day and month of birth. Interestingly, such a coincidence becomes likely even for 22 people. Suppose Mary and John both celebrate birthdays on August 7.
Data snooping would, by design, try to find additional similarities between Mary and John, such as:
- Are they the youngest and the oldest persons in the list?
- Have they met in person once? Twice? Three times?
- Do their fathers have the same first name, or mothers have the same maiden name?
By going through hundreds or thousands of potential similarities between John and Mary, each having a low probability of being true, we can almost certainly find some similarity between them. Perhaps John and Mary are the only two persons in the list who switched minors three times in college, a fact we found out by exhaustively comparing their lives' histories. Our hypothesis, biased by data-snooping, can then become "People born on August 7 have a much higher chance of switching minors more than twice in college."
The data itself very strongly supports that correlation, since no one with a different birthday had switched minors three times in college.
However, when we turn to the larger sample of the general population and attempt to reproduce the results, we find that there is no statistical correlation between August 7 birthdays and changing college minors more than once. The "fact" exists only for a very small, specific sample, not for the public as a whole. See also Reproducible research.
Bias is a systematic error in the analysis. For example, doctors directed HIV patients at high cardiovascular risk to a particular HIV treatment, abacavir, and lower-risk patients to other drugs, preventing a simple assessment of abacavir compared to other treatments. An analysis that did not correct for this bias unfairly penalised the abacavir, since its patients were more high-risk so more of them had heart attacks. This problem can be very severe, for example, in the observational study.
Missing factors, unmeasured confounders, and loss to follow-up can also lead to bias. By selecting papers with a significant p-value, negative studies are selected against—which is the publication bias.
Another aspect of the conditioning of statistical tests by knowledge of the data can be seen while using the frequent in the data analysis linear regression. A crucial step in the process is to decide which covariates to include in a relationship explaining one or more other variables. There are both statistical (see Stepwise regression) and substantive considerations that lead the authors to favor some of their models over others, and there is a liberal use of statistical tests. However, to discard one or more variables from an explanatory relation on the basis of the data, means one cannot validly apply standard statistical procedures to the retained variables in the relation as though nothing had happened. In the nature of the case, the retained variables have had to pass some kind of preliminary test (possibly an imprecise intuitive one) that the discarded variables failed. In 1966, Selvin and Stuart compared variables retained in the model to the fish that don't fall through the net—in the sense that their effects are bound to be bigger than those that do fall through the net. Not only does this alter the performance of all subsequent tests on the retained explanatory model—it may introduce bias and alter mean-square-error in estimation.
Examples in meteorology and epidemiology
In meteorology, dataset A is often weather data up to the present, which ensures that, even subconsciously, subset B of the data could not influence the formulation of the hypothesis. Of course, such a discipline necessitates waiting for new data to come in, to show the formulated theory's predictive power versus the null hypothesis. This process ensures that no one can accuse the researcher of hand-tailoring the predictive model to the data on hand, since the upcoming weather is not yet available.
As another example, suppose that observers note that a particular town appears to have a cancer cluster, but lack a firm hypothesis of why this is so. However, they have access to a large amount of demographic data about the town and surrounding area, containing measurements for the area of hundreds or thousands of different variables, mostly uncorrelated. Even if all these variables are independent of the cancer incidence rate, it is highly likely that at least one variable correlates significantly with the cancer rate across the area. While this may suggest a hypothesis, further testing using the same variables but with data from a different location is needed to confirm. Note that a p-value of 0.01 suggests that 1% of the time a result at least that extreme would be obtained by chance; if hundreds or thousands of hypotheses (with mutually relatively uncorrelated independent variables) are tested, then one is more likely than not to get at least one null hypothesis with a p-value less than 0.01.
Looking for patterns in data is legitimate. Applying a statistical test of significance (hypothesis testing) to the same data the pattern was learned from is wrong. One way to construct hypotheses while avoiding data dredging is to conduct randomized out-of-sample tests. The researcher collects a data set, then randomly partitions it into two subsets, A and B. Only one subset—say, subset A—is examined for creating hypotheses. Once a hypothesis is formulated, it must be tested on subset B, which was not used to construct the hypothesis. Only where B also supports such a hypothesis is it reasonable to believe the hypothesis might be valid.
Another remedy for data dredging is to record the number of all significance tests conducted during the experiment and simply multiply the final significance level by this number (the Bonferroni correction); however, this is a very conservative metric. Methods particularly useful in analysis of variance, and in constructing simultaneous confidence bands for regressions involving basis functions are the Scheffé's method and, if the researcher has in mind only pairwise comparisons, the Tukey method. The use of a false discovery rate is a more sophisticated approach that has become a popular method for control of multiple hypothesis tests.
When neither approach is practical, one can make a clear distinction between data analyses that are confirmatory and analyses that are exploratory. Statistical inference is appropriate only for the former.
Ultimately, the statistical significance of a test and the statistical confidence of a finding are joint properties of data and the method used to examine the data. Thus, if someone says that a certain event has probability of 20% ± 2% 19 times out of 20, this means that if the probability of the event is estimated by the same method used to obtain the 20% estimate, the result is between 18% and 22% with probability 0.95. No claim of statistical significance can be made by only looking, without due regard to the method used to assess the data.
- Base rate fallacy
- Bonferroni inequalities
- Predictive analytics
- Misuse of statistics
- Lincoln–Kennedy coincidences urban legend
- Young, S. S., Karr, A. (2011). "Deming, data and observational studies". Significance 8 (3).
- Smith, G. D., Shah, E. (2002). "Data dredging, bias, or confounding". BMJ 325 (7378): 1437–1438. doi:10.1136/bmj.325.7378.1437. PMC 1124898. PMID 12493654.
- Selvin, H.C.; Stuart, A. (1966). "Data-Dredging Procedures in Survey Analysis". The American Statistician 20 (3): 20–23. doi:10.1080/00031305.1966.10480401. JSTOR 2681493.
- Berk, R.; Brown, L.; Zhao, L. (2009). "Statistical Inference After Model Selection". J Quant Criminol. doi:10.1007/s10940-009-9077-7.
- Ioannidis, John P.A. (August 30, 2005). "Why Most Published Research Findings Are False". PLoS Medicine (San Francisco: Public Library of Science) 2 (8): e124. doi:10.1371/journal.pmed.0020124. ISSN 1549-1277. PMC 1182327. PMID 16060722. Retrieved 2009-11-29.