Imputation (genetics)

From Wikipedia, the free encyclopedia

In genetics, imputation is the statistical inference of unobserved genotypes.[1] It is achieved by using known haplotypes in a population, for instance from the HapMap or the 1000 Genomes Project in humans, thereby allowing to test for association between a trait of interest (e.g. a disease) and experimentally untyped genetic variants, but whose genotypes have been statistically inferred ("imputed").[2] Genotype imputation is usually performed on SNPs, the most common kind of genetic variation.

Genotype imputation hence helps tremendously in narrowing down the location of probably causal variants in genome-wide association studies, because it increases the SNP density (the genome size remains constant, but the number of genetic variants increases) and thus reduces the distance between two adjacent SNPs.


In genetic epidemiology and quantitative genetics, researchers aim at identifying genomic locations where variation between individuals is associated with variation in traits of interest between individuals. Such studies hence require access to the genetic makeup of a set of individuals. Sequencing the whole genome of each individual in the study is often too costly, so only a subset of the genome can therefore be measured. This often means, first, only considering single-nucleotide polymorphisms (SNPs) and neglecting copy number variants, and second, only measuring SNPs known to be variable enough in the population that they are likely to be also variable in the set of individuals under consideration. The most informative subset of SNPs is chosen based on the distribution of common genetic variation along the genome, for instance as produced by the HapMap or the 1000 Genomes Project in humans. These SNPs are then used to build a micro-array, thereby allowing each individual in the study to be genotyped at all these SNPs simultaneously.

Genotyping arrays used for genome-wide association studies (GWAS) are based on tagging SNPs and therefore do not directly genotype all variation in the genome. Imputation of the genotypes to a reference panel that has been genotyped for a greater number of variants boosts the coverage of genomic variation beyond the original genotypes. As a consequence, one can assess the effect of more SNPs than those on the original micro-array. Importantly, imputation has facilitated meta-analysis of datasets that have been genotyped on different arrays, by increasing the overlap of variants available for analysis between arrays.

As whole-genome sequencing (WGS) becomes cheaper, imputation finds another use case: it can improve low-coverage WGS reads by filling gaps and low-confidence areas. In this use case, imputation provides higher accuracies compared to the SNP array.[3] Imputation on low-coverage WGS is reasonably accurate for non-African ancient human genomes down to 0.5× coverage.[4]


There are several software packages available to impute genotypes from a genotyping array to reference panels, such as 1000 Genomes Project haplotypes. These tools include MaCH[5] Minimac, IMPUTE2[6] and Beagle.[7] Each tool provides specific pros and cons in terms of speed and accuracy.[8] Additional phasing tools such as SHAPEIT2[9] allow prephasing of input haplotypes for improved imputation accuracy and computational performance.

In early imputation usage, haplotypes from HapMap populations were used as a reference panel, but this has been succeeded by the availability of haplotypes from the 1000 Genomes Project[10] as reference panels with more samples, across more diverse populations, and with greater genetic marker density. As of mid-2014, whole-genome sequence data is publicly available from the 1000 Genomes Project website[11] for 2535 individuals from 26 different populations around the world.

Statistical models[edit]

Designing accurate statistical models for genotype imputation is very much related to the problem of haplotype estimation ("phasing") and is an active area of research.[12] Imputation is almost always preceded by a phasing step.[1][3] As of 2022, all modern phasing and imputation software are based on the Li & Stevens hidden Markov model construct.[13]

See also[edit]


  1. ^ a b Scheet, Paul; Stephens, Matthew (2006). "A Fast and Flexible Statistical Model for Large-Scale Population Genotype Data: Applications to Inferring Missing Genotypes and Haplotypic Phase". The American Journal of Human Genetics. 78 (4): 629–644. doi:10.1086/502802. PMC 1424677. PMID 16532393.
  2. ^ Marchini, J.; Howie, B. (2010). "Genotype imputation for genome-wide association studies". Nature Reviews Genetics. 11 (7): 499–511. doi:10.1038/nrg2796. PMID 20517342. S2CID 1465707.
  3. ^ a b Deng, T; Zhang, P; Garrick, D; Gao, H; Wang, L; Zhao, F (2021). "Comparison of Genotype Imputation for SNP Array and Low-Coverage Whole-Genome Sequencing Data". Frontiers in Genetics. 12: 704118. doi:10.3389/fgene.2021.704118. PMC 8762119. PMID 35046990.
  4. ^ Sousa da Mota, Bárbara; Rubinacci, Simone; Cruz Dávalos, Diana Ivette; G. Amorim, Carlos Eduardo; Sikora, Martin; Johannsen, Niels N.; Szmyt, Marzena H.; Włodarczak, Piotr; Szczepanek, Anita; Przybyła, Marcin M.; Schroeder, Hannes; Allentoft, Morten E.; Willerslev, Eske; Malaspinas, Anna-Sapfo; Delaneau, Olivier (20 June 2023). "Imputation of ancient human genomes". Nature Communications. 14 (1): 3660. Bibcode:2023NatCo..14.3660S. doi:10.1038/s41467-023-39202-0. PMC 10282092. PMID 37339987.
  5. ^ Li, Y; Willer, CJ; Ding, J; Scheet, P; Abecasis, GR (Dec 2010). "MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes". Genetic Epidemiology. 34 (8): 816–34. doi:10.1002/gepi.20533. PMC 3175618. PMID 21058334.
  6. ^ Howie, B; Fuchsberger, C; Stephens, M; Marchini, J; Abecasis, GR (Jul 22, 2012). "Fast and accurate genotype imputation in genome-wide association studies through pre-phasing". Nature Genetics. 44 (8): 955–9. doi:10.1038/ng.2354. PMC 3696580. PMID 22820512.
  7. ^ Browning, Brian L.; Browning, Sharon R. (2009). "A Unified Approach to Genotype Imputation and Haplotype-Phase Inference for Large Data Sets of Trios and Unrelated Individuals". The American Journal of Human Genetics. 84 (2): 210–223. doi:10.1016/j.ajhg.2009.01.005. PMC 2668004. PMID 19200528.
  8. ^ Howie, Bryan; Fuchsberger, Christian; Stephens, Matthew; Marchini, Jonathan; Abecasis, Gonçalo R (22 July 2012). "Fast and accurate genotype imputation in genome-wide association studies through pre-phasing". Nature Genetics. 44 (8): 955–959. doi:10.1038/ng.2354. PMC 3696580. PMID 22820512.
  9. ^ Delaneau, Olivier; Marchini, Jonathan; Zagury, Jean-François (4 December 2011). "A linear complexity phasing method for thousands of genomes". Nature Methods. 9 (2): 179–181. doi:10.1038/nmeth.1785. PMID 22138821. S2CID 13765612.
  10. ^ Durbin, Richard M.; Altshuler, David L.; Durbin, Richard M.; Abecasis, Gonçalo R.; Bentley, David R.; Chakravarti, Aravinda; Clark, Andrew G.; Collins, Francis S. (28 October 2010). "A map of human genome variation from population-scale sequencing". Nature. 467 (7319): 1061–1073. Bibcode:2010Natur.467.1061T. doi:10.1038/nature09534. PMC 3042601. PMID 20981092.
  11. ^ "1000 Genomes - A Deep Catalog of Human Genetic Variation". Retrieved 17 July 2014.
  12. ^ Howie, Bryan; Donnelly, Peter; Marchini, Jonathan (2009). "A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies". PLOS Genetics. 5 (6): e1000529. doi:10.1371/journal.pgen.1000529. PMC 2689936. PMID 19543373.
  13. ^ De Marino, A; Mahmoud, AA; Bose, M; Bircan, KO; Terpolovsky, A; Bamunusinghe, V; Bohn, S; Khan, U; Novković, B; Yazdi, PG (2022). "A comparative analysis of current phasing and imputation software". PLOS ONE. 17 (10): e0260177. Bibcode:2022PLoSO..1760177D. doi:10.1371/journal.pone.0260177. PMC 9581364. PMID 36260643.