# Missing data

In statistics, missing data, or missing values, occur when no data value is stored for the variable in an observation. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data.

Missing data can occur because of nonresponse: no information is provided for several items or no information is provided for a whole unit. Some items are more sensitive for nonresponse than others, for example items about private subjects such as income.

Dropout is a type of missingness that occurs mostly when studying development over time. In this type of study the measurement is repeated after a certain period of time. Missingness occurs when participants drop out before the test ends and one or more measurements are missing.

Sometimes missing values are caused by the researcher—for example, when data collection is done improperly or mistakes are made in data entry.[1] Data often are missing in research in economics, sociology, and political science because governments choose not to, or fail to, report critical statistics.[2]

## Types of missing data

Understanding the reasons why data are missing can help with analyzing the remaining data. If values are missing at random, the data sample may still be representative of the population. But if the values are missing systematically, analysis may be harder. For example, in a study of the relation between IQ and income, participants with an above-average IQ might tend to skip the question ‘What is your salary?’ Analysis may falsely show no association between IQ and salary, while in fact there may be a relationship. Because of these problems, methodologists routinely advise researchers to design studies to minimize the incidence of missing values.[1]

### Missing completely at random

Values in a data set are missing completely at random (MCAR) if the events that lead to any particular data-item being missing are independent both of observable variables and of unobservable parameters of interest, and occur entirely at random.[3] When data are MCAR, the analyses performed on the data are unbiased; however, data are rarely MCAR.

### Missing at random

Missing at random (MAR) is an alternative, and occurs when the missingness is related to a particular variable, but it is not related to the value of the variable that has missing data.[3] An example of this is accidentally omitting an answer on a questionnaire.

### Missing not at random

Missing not at random (MNAR) is data that is missing for a specific reason (i.e. the value of the variable that's missing is related to the reason it's missing).[3] An example of this is if certain question on a questionnaire tend to be skipped deliberately by participants with certain characteristics.

## Techniques of dealing with missing data

Missing data reduce the representativeness of the sample and can therefore distort inferences about the population. If it is possible try to think about how to prevent data from missingness before the actual data gathering takes place. For example in computer questionnaires it is often not possible to skip a question. A question has to be answered, otherwise one cannot continue to the next. So missing values due to the participant are eliminated by this type of questionnaire, though this method may not be permitted by an ethics board overseeing the research. And in survey research, it is common to make multiple efforts to contact each individual in the sample, often sending letters to attempt to persuade those who have decided not to participate to change their minds (Stoop et al. 2010: 161-187). However, such techniques can either help or hurt in terms of reducing the negative inferential effects of missing data, because the kind of people who are willing to be persuaded to participate after initially refusing or not being home are likely to be significantly different from the kinds of people who will still refuse or remain unreachable after additional effort (Stoop et al. 2010: 188-198).

In situations where missing data are likely to occur, the researcher is often advised to plan to use methods of data analysis methods that are robust to missingness. An analysis is robust when we are confident that mild to moderate violations of the technique's key assumptions will produce little or no bias, or distortion in the conclusions drawn about the population.

### Imputation

If it is known that the data analysis technique which is to be used isn't content robust, it is good to consider imputing the missing data. This can be done in several ways. Recommended is to use multiple imputations. Rubin (1987) argued that even a small number (5 or fewer) of repeated imputations enormously improves the quality of estimation.[1]

For many practical purposes, 2 or 3 imputations capture most of the relative efficiency that could be captured with a larger number of imputations. However, a too-small number of imputations can lead to a substantial loss of statistical power, and some scholars now recommend 20 to 100 or more.[4] Any multiply-imputed data analysis must be repeated for each of the imputed data sets and, in some cases, the relevant statistics must be combined in a relatively complicated way.[1]

Examples of imputations are listed below.

### Partial imputation

The expectation-maximization algorithm is an approach in which values of the statistics which would be computed if a complete dataset were available are estimated (imputed), taking into account the pattern of missing data. In this approach, values for individual missing data-items are not usually imputed.

### Partial deletion

Methods which involve reducing the data available to a dataset having no missing values include:

### Full analysis

Methods which take full account of all information available, without the distortion resulting from using imputed values as if they were actually observed:

### Interpolation

Main article: Interpolation

In the mathematical field of numerical analysis, interpolation is a method of constructing new data points within the range of a discrete set of known data points.

## Model-Based Techniques

Model based techniques, often using graphs, offer additional tools for testing missing data types (MCAR, MAR, MNAR) and for estimating parameters under missing data conditions. For example, a test for refuting MAR/MCAR reads as follows:

For any three variables X,Y, and Z where Z is fully observed and X and Y partially observed, the data should satisfy: $X \perp\!\!\!\perp R_y |(R_x,Z)$.

In words, the observed portion of X should be independent on the missingness status of Y, conditional on every value of Z. Failure to satisfy this condition indicates that the problem belongs to the MNAR category.[5]

(Remark: These tests are necessary for variable-based MAR which is a slight variation of event-based MAR.[6][7][8])

When data falls into MNAR category techniques are available for consistently estimating parameters when certain conditions hold in the model.[9] For example, if Y explains the reason for missingness in X and Y itself has missing values, the joint probability distribution of X and Y can still be estimated if the missingness of Y is random. The estimand in this case will be:

\begin{align} P(X,Y)& =P(X|Y) P(Y) \\ & =P(X|Y,R_x=0,R_y=0) P(Y|R_y=0) \end{align}

where $R_x=0$ and $R_y=0$ denote the observed portions of their respective variables.

Different model structures may yield different estimands and different procedures of estimation whenever consistent estimation is possible. The preceding estimand calls for first estimating $P(X|Y)$ from complete data and multiplying it by $P(Y)$ estimated from cases in which Y is observed regardless of the status of X. Moreover, in order to obtain a consistent estimate it is crucial that the first term be $P(X|Y)$ as opposed to $P(Y|X)$.

In many cases model based techniques permit the model structure to undergo refutation tests.[8] Any model which implies the independence between a partially observed variable X and the missingness indicator of another variable Y (i.e. $R_y$), conditional on $R_x$ can be submitted to the following refutation test: $X \perp\!\!\!\perp R_y | R_x =0$.

Finally, the estimands that emerge from these techniques are derived in closed form and do not require iterative procedures such as Expectation Maximization that are susceptible to local optima.[10]

## References

1. ^ a b c d
2. ^ Messner SF (1992). "Exploring the Consequences of Erratic Data Reporting for Cross-National Research on Homicide". Journal of Quantitative Criminology 8 (2): 155–173. doi:10.1007/bf01066742.
3. ^ a b c Polit DF Beck CT (2012). Nursing Research: Generating and Assessing Evidence for Nursing Practice, 9th ed. Philadelphia, USA: Wolters Klower Health, Lippincott Williams & Wilkins.
4. ^ Graham J.W., Olchowski A.E., Gilreath T.D. (2007). "How Many Imputations Are Really Needed? Some Practical Clarifications of Multiple Imputation Theory". Preventative Science 8 (3): 208–213. doi:10.1007/s11121-007-0070-9.
5. ^ Mohan, Karthika; Pearl, Judea (2014). "On the testability of models with missing data". Proceedings of AISTAT-2014, Forthcoming.
6. ^ Darwiche, Adnan (2009). Modeling and Reasoning with Bayesian Networks. Cambridge University Press.
7. ^ Potthoff, R.F.; Tudor, G.E.; Pieper, K.S.; Hasselblad, V. (2006). "Can one assess whether missing data are missing at random in medical studies?". Statistical Methods in Medical Research 15 (3): 213–234.
8. ^ a b Pearl, Judea; Mohan, Karthika (2013). [Available at http://ftp.cs.ucla.edu/pub/stat\_ser/r417.pdf Recoverability and Testability of Missing data: Introduction and Summary of Results] (Technical report). UCLA Computer Science Department, R-417.
9. ^ Mohan, Karthika; Pearl, Judea; Tian, Jin (2013). "Advances in Neural Information Processing Systems 26". Graphical Models for Inference with Missing Data. pp. 1277–1285.
10. ^ Mohan, K.; Van den Broeck, G.; Choi, A.; Pearl, J. (2014). "An Efficient Method for Bayesian Network Parameter Learning from Incomplete Data". Presented at Causal Modeling and Machine learning Workshop, ICML-2014.