Polygenic score

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

A polygenic score, also called a polygenic risk score, genetic risk score, or genome-wide score, is a number based on variation in multiple genetic loci and their associated weights (see regression analysis).[1][2] It serves as the best prediction for the trait that can be made when taking into account variation in multiple genetic variants.

Polygenic scores are widely employed in animal, plant, and behavioral genetics for predicting and understanding genetic architectures. In a genome-wide association study (GWAS), polygenic scores having substantially higher predictive performance than the genome-wide statistically-significant hits indicates that the trait in question is affected by a larger number of variants than just the hits and larger sample sizes will yield more hits; a conjunction of low variance explained and high heritability as measured by GCTA, twin studies or other methods, indicates that a trait may be massively polygenic and affected by thousands of variants. Once a polygenic score has been created, which explains at least a few percent of a phenotype's variance and can therefore be assumed to effectively incorporate a significant fraction of the genetic variants affecting that phenotype, it can be used in several different ways: as a lower bound to test whether heritability estimates may be biased; as a measure of genetic overlap of traits (genetic correlation), which might indicate e.g. shared genetic bases for groups of mental disorders; as a means to assess group differences in a trait such as height, or to examine changes in a trait over time due to natural selection indicative of a soft selective sweep (as e.g. for intelligence where the changes in frequency would be too small to detect on each individual hit but not on the overall polygenic score); in Mendelian randomization (assuming no pleiotropy with relevant traits); to detect & control for the presence of genetic confounds in outcomes (e.g. the correlation of schizophrenia with poverty); or to investigate gene–environment interactions.

Polygenic scores are widely used in animal breeding (usually termed genomic prediction) due to their efficacy in improving livestock breeding and crops.[3] Their use in human studies is increasing.[4][5]

Estimating weights[edit]

Weights are usually estimated using some form of regression analysis. Because the number of genomic variants (usually SNPs) is usually larger than the sample size, one cannot use OLS multiple regression (p > n problem[6][7]). Instead, researchers have opted to use other methods including regressing variants one at a time (usually used in studies with human data). Due to concerns about weakening predictive power, polygenic scores can be constructed by multiple-testing different sets of SNPs selected at various thresholds, such as all SNPs which are genome-wide statistically-significant hits or all SNPs p<0.05 or all SNPs with p<0.50, and the one with greatest performance used for further analysis; especially for highly polygenic traits, the best polygenic score will tend to use most or all SNPs.[8]

The standard GWAS regression can be improved on using penalized regression methods like the LASSO/ridge regression.[1] (Penalized regression can be interpreted as placing informative priors on how many genetic variants are expected to affect a trait, and the distribution of their effect sizes; Bayesian counterparts exist for LASSO/ridge, and other priors have been suggested & used. They can perform better in some circumstances.[9]) A multi-dataset, multi-method study[7] found that of 15 different methods compared across four datasets, minimum redundancy maximum relevance was the best performing method. Furthermore, variable selection methods tended to outperform other methods. Variable selection methods do not use all the available genomic variants present in a dataset, but attempt to select an optimal subset of variants to use. This leads to less overfitting but more bias (see bias-variance tradeoff).

Predictive validity[edit]

The benefit of polygenic score is that they can be used to predict the future. This has large practical benefits for animal breeding because it increases the selection precision and allows for shorter generations, both of which speed up evolution.[10][3] For humans, it can be used to predict future disease susceptibility and for embryo selection.[4][11]

Some accuracy values are given below for comparison purposes. These are given in terms of correlations and have been converted from explained variance if given in that format in the source.

In humans[edit]

  • In 2016, r ≈ 0.30 for educational attainment variation at age 16.[5] This polygenic score was based off a GWAS using data from 293,000 persons.[12]
  • In 2016, r ≈ 0.31 for case/control status for first-episode psychosis.[13]

In non-human animals[edit]

  • In 2016, r ≈ 0.30 for variation in milk fat%.[14]
  • In 2014, r ≈ 0.18 to 0.46 for various measures of meat yield, carcass value etc.[15]

In plants[edit]

  • In 2015, r ≈ 0.55 for total root length in Maize (Zea mays L.).[16]
  • In 2014, r ≈ 0.03 to 0.99 across four traits in barley.[17]


  1. ^ a b de Vlaming, Ronald; Groenen, Patrick J. F. (2015). "The Current and Future Use of Ridge Regression for Prediction in Quantitative Genetics". BioMed Research International. 2015: 143712. doi:10.1155/2015/143712. PMC 4529984. PMID 26273586.
  2. ^ Dudbridge, Frank (2013-03-21). "Power and Predictive Accuracy of Polygenic Risk Scores". PLOS Genet. 9 (3): e1003348. doi:10.1371/journal.pgen.1003348. ISSN 1553-7404. PMC 3605113. PMID 23555274.
  3. ^ a b Spindel, Jennifer E.; McCouch, Susan R. (2016-09-01). "When more is better: how data sharing would accelerate genomic selection of crop plants". New Phytologist. 212 (4): 814–826. doi:10.1111/nph.14174. ISSN 1469-8137. PMID 27716975.
  4. ^ a b Spiliopoulou, Athina; Nagy, Reka; Bermingham, Mairead L.; Huffman, Jennifer E.; Hayward, Caroline; Vitart, Veronique; Rudan, Igor; Campbell, Harry; Wright, Alan F. (2015-07-15). "Genomic prediction of complex human traits: relatedness, trait architecture and predictive meta-models". Human Molecular Genetics. 24 (14): 4167–4182. doi:10.1093/hmg/ddv145. ISSN 0964-6906. PMC 4476450. PMID 25918167.
  5. ^ a b Selzam, S.; Krapohl, E.; von Stumm, S.; O'Reilly, P. F.; Rimfeld, K.; Kovas, Y.; Dale, P. S.; Lee, J. J.; Plomin, R. (2016-07-19). "Predicting educational achievement from DNA". Molecular Psychiatry. 22 (2): 267–272. doi:10.1038/mp.2016.107. ISSN 1476-5578. PMC 5285461. PMID 27431296.
  6. ^ James, Gareth (2013). An Introduction to Statistical Learning: with Applications in R. Springer. ISBN 978-1461471370.
  7. ^ a b Haws, David C.; Rish, Irina; Teyssedre, Simon; He, Dan; Lozano, Aurelie C.; Kambadur, Prabhanjan; Karaman, Zivan; Parida, Laxmi (2015-10-06). "Variable-Selection Emerges on Top in Empirical Comparison of Whole-Genome Complex-Trait Prediction Methods". PLOS ONE. 10 (10): e0138903. doi:10.1371/journal.pone.0138903. ISSN 1932-6203. PMC 4595020. PMID 26439851.
  8. ^ Ware et al 2017, "Heterogeneity in polygenic scores for common human traits"
  9. ^ Gianola & Rosa 2015, "One Hundred Years of Statistical Developments in Animal Breeding"
  10. ^ Heslot, Nicolas; Jannink, Jean-Luc; Sorrells, Mark E. (2015-01-02). "Perspectives for Genomic Selection Applications and Research in Plants". Crop Science. 55 (1): 1. doi:10.2135/cropsci2014.03.0249. ISSN 0011-183X.
  11. ^ Shulman, Carl; Bostrom, Nick (2014-02-01). "Embryo Selection for Cognitive Enhancement: Curiosity or Game-changer?". Global Policy. 5 (1): 85–92. CiteSeerX doi:10.1111/1758-5899.12123. ISSN 1758-5899.
  12. ^ Okbay, Aysu; Beauchamp, Jonathan P.; Fontana, Mark Alan; Lee, James J.; Pers, Tune H.; Rietveld, Cornelius A.; Turley, Patrick; Chen, Guo-Bo; Emilsson, Valur (2016). "Genome-wide association study identifies 74 loci associated with educational attainment". Nature. 533 (7604): 539–542. doi:10.1038/nature17671. PMC 4883595. PMID 27225129.
  13. ^ Vassos, Evangelos; Forti, Marta Di; Coleman, Jonathan; Iyegbe, Conrad; Prata, Diana; Euesden, Jack; O’Reilly, Paul; Curtis, Charles; Kolliakou, Anna (2017-03-15). "An Examination of Polygenic Score Risk Prediction in Individuals With First-Episode Psychosis". Biological Psychiatry. 81 (6): 470–477. doi:10.1016/j.biopsych.2016.06.028. PMID 27765268.
  14. ^ Hayr, M. K.; Druet, T.; Garrick, D. J. (2016-04-01). "027 Performance of genomic prediction using haplotypes in New Zealand dairy cattle". Journal of Animal Science. 94 (supplement2): 13. doi:10.2527/msasas2016-027. ISSN 1525-3163.
  15. ^ Chen, L.; Vinsky, M.; Li, C. (2015-02-01). "Accuracy of predicting genomic breeding values for carcass merit traits in Angus and Charolais beef cattle". Animal Genetics. 46 (1): 55–59. doi:10.1111/age.12238. ISSN 1365-2052. PMID 25393962.
  16. ^ Pace, Jordon; Yu, Xiaoqing; Lübberstedt, Thomas (2015-09-01). "Genomic prediction of seedling root length in maize (Zea mays L.)". The Plant Journal. 83 (5): 903–912. doi:10.1111/tpj.12937. ISSN 1365-313X. PMID 26189993.
  17. ^ Sallam, A. H.; Endelman, J. B.; Jannink, J.-L.; Smith, K. P. (2015-03-01). "Assessing Genomic Selection Prediction Accuracy in a Dynamic Barley Breeding Population". The Plant Genome. 8 (1): 0. doi:10.3835/plantgenome2014.05.0020. ISSN 1940-3372.

Further reading[edit]