||This article needs attention from an expert in Statistics. (February 2009)|
||This article needs additional citations for verification. (May 2010)|
In statistics, imputation is the process of replacing missing data with substituted values. When substituting for a data point, it is known as "unit imputation"; when substituting for a component of a data point, it is known as "item imputation". Because missing data can create problems for analyzing data, imputation is seen as a way to avoid pitfalls involved with listwise deletion of cases that have missing values. That is to say, when one or more values are missing for a case, most statistical packages default to discarding any case that has a missing value, which may introduce bias or affect the representativeness of the results. Imputation preserves all cases by replacing missing data with a probable value based on other available information. Once all missing values have been imputed, the data set can then be analysed using standard techniques for complete data.
Imputation techniques 
Imputation theory is constantly developing and thus requires consistent attention to new information regarding the subject. There have been many theories embraced by scientists to account for missing data but the majority of them introduce large amounts of bias. A few of the well known attempts to deal with missing data include: hot deck and cold deck imputation; listwise and pairwise deletion; mean imputation; regression imputation; last observation carried forward; stochastic imputation; and multiple imputation.
Case deletion 
By far, the most common means of dealing with missing data is listwise deletion, which is when all cases with a missing value are deleted. If the data are missing completely at random, then listwise deletion does not add any bias, but it does decrease the power of the analysis by deleting all cases with missing values thus decreasing the effective sample size. For example, if 1000 cases are collected but 80 have missing values, the effective sample size for a complete-case analysis is on 920. If the cases are not missing completely at random, then listwise deletion will introduce bias because the sub-sample of cases represented by the missing data are not representative of the original sample (and if the original sample was itself a representative sample of a population, the complete cases are not representative of that population either).
Pairwise deletion (or "available case analysis") involves only deleting specific variable cell values along with the outcome variable when a particular variable is required in an analysis and has a missing value, but the case will exist in all other situations, thus the total N for analysis will not be consistent across parameter estimations. Because of the incomplete N values at some points in time, while still maintaining complete case comparison for other parameters, pairwise deletion introduces impossible mathematical situations such as correlations that are over 100%.
Single imputation 
A once-common method of imputation was hot-deck imputation where a missing value was imputed from a randomly selected similar record. The term "hot deck" dates back to the storage of data on punched cards, and indicates that the information donors come from the same dataset as the recipients. The stack of cards was "hot" because it was currently being processed.
One form of hot-deck imputation is called "last observation carried forward", which involves sorting a dataset according to any of a number of variables, thus creating an ordered dataset. The technique then finds the first missing value and uses the cell value immediately prior to the data that are missing to impute the missing value. The process is repeated for the next cell with a missing value until all missing values have been imputed.
Cold-deck imputation, by contrast, selects donors from another dataset. Since computer power has advanced rapidly and punched cards are no longer used, more sophisticated methods of imputation have generally superseded the original random and sorted hot deck imputation techniques.
Another imputation technique involves replacing any missing value with the mean of that variable for all other cases, which has the benefit of not changing the sample mean for that variable. However, mean imputation ruins the correlations among variables because it changes the way two variables covary. That is, if two variables move in tandem (give or take a certain amount of error), then using mean imputation to fill in the missing values destroys the natural relationship between the predictor and outcome variables since the imputed values will all be the mean of the predictor and remain static for all missing values while the outcome still varies independently. Thus, mean imputation has some attractive properties for univariate analysis but becomes problematic for multivariate analysis.
Regression imputation has the opposite problem of mean imputation. A regression model is estimated to predict the missing values and the missing data is imputed in relation to this. In other words, available information for complete and incomplete cases is used to predict whether a value on a specific variable is missing or not. Fitted values from the regression model are then used to impute the missing values. The problem is that the imputed data do not have an error term included in their estimation, thus the estimates fit perfectly along the regression line with out any residual variance. This causes relationships to be over identified and suggest greater precision in the imputed values than is warranted. The regression model predicts the most likely value of missing data but does not supply uncertainty about that value.
Stochastic regression was a fairly successful attempt to correct the lack of an error term in regression imputation by adding the average regression variance to the regression imputations to introduce error. Stochastic regression is an excellent technique, and shows much less bias than the above mentioned techniques, but it still missed one thing - if data are imputed then intuitively one would think that more noise should be introduced to the problem than simple residual variance.
Multiple imputation 
In order to deal with the problem of increased noise due to imputation, Rubin (1987) developed a method for averaging the outcomes across multiple imputed data sets to account for this. The way this works is that imputation processes similar to stochastic regression are run on the same data set multiple times and the imputed data sets are saved for later analysis. Each imputed data set is analyzed separately and the results are averaged except for the standard error term (SE). The SE is constructed by the within variance of each data set as well as the variance between imputed items on each data set. These two variances are added together and the square root of them determines the SE, thus the noise due to imputation as well as the residual variance are introduced to the regression model.
Maximum likelihood involves the estimation of parameters based on the observed data using Bayesian estimation techniques, maximum likelihood estimation within what is known as a density function, and the EM algorithm to forward the observations toward maximum likelihood through a sequential progress of minimizing the difference between maximum likelihood estimations. Maximum likelihood estimation is equivalent to multiple imputation in most cases though maximum likelihood is easier to implement than multiple imputation thus usually more preferred, but multiple imputation has slightly more situations in which it is applicable.
In machine learning, it is sometimes possible to train a classifier directly over the original data without imputing it first. That was shown to yield better performance in cases where the missing data is structurally absent, rather than missing due to measurement noise.
See also 
- Gelman, Andrew, and Jennifer Hill. Data analysis using regression and multilevel/hierarchical models. Cambridge University Press, 2006. Ch.25
- Enders, C.K. (2010). Applied missing data analysis. New York: Guilford Press.
- Enders, C.K. (2010). Applied missing data analysis. New York: Guilford Press.
- Rubin, D.B. (1987) Multiple Imputation for Nonresponse in Surveys. New York: Wiley & Sons.
- Rahman, M.M.; Davis, D.N. (July 2012). "Fuzzy Unordered Rules Induction Algorithm Used as Missing Value Imputation Methods for K-Mean Clustering on Real Cardiovascular Data". Proceedings of The World Congress on Engineering 2012 1 (1): 391–394
- Little, R. J. A. (1988). Missing-data adjustments in large surveys. Journal of Business and Economic Statistics, 6(3), 287-296. Retrieved from EbscoHost.
- Little, R.J.A. & Rubin, D.B. (2002). Statistical analysis with missing data, 2nd edition. New York: Wiley & Sons.
- Rubin, D.B. (1976) Inference and missing data. Biometrika, 63, 581-592.
- Missing Data: Instrument-Level Heffalumps and Item-Level Woozles
- Multiple imputation FAQs, Penn State U
- A description of hot deck imputation from Statistics Finland.
- Paper extending Rao-Shao approach and discussing problems with multiple imputation.
- Paper Fuzzy Unordered Rules Induction Algorithm Used as Missing Value Imputation Methods for K-Mean Clustering on Real Cardiovascular Data.