Jump to content

Polygenic score

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by Gwern (talk | contribs) at 00:16, 17 November 2016 (expand). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

A polygenic score, also called a polygenic risk score, genetic risk score, or genome-wide score, is a number based on variation in multiple genetic loci and their associated weights (see regression analysis).[1][2] It serves as the best prediction for the trait that can be made when taking into account variation in multiple genetic variants.

Polygenic scores are widely employed in animal, plant, and behavioral genetics for prediction and understanding genetic architectures. In a GWAS, polygenic scores having substantially higher predictive performance than the genome-wide statistically-significant hits indicates that the trait in question is affected by a larger number of variants than just the hits and larger sample sizes will yield more hits; a conjunction of low variance explained and high heritability as measured by GCTA, twin studies or other methods indicates that a trait may be massively polygenic and affected by thousands of variants. Once a polygenic score explaining at least a few percent of variance has been created which effectively identifies most of the genetic variants affecting a trait, it can be used as a lower bound to test whether heritability estimates may be biased, measure the genetic overlap of traits (genetic correlation) which might indicate eg shared genetic bases for groups of mental disorders, used to measure group differences in a trait such as height, examine changes in a trait over time due to natural selection indicative of a soft selective sweep such as intelligence (where the changes in frequency would be too small to detect on each individual hit but the polygenic score declines), used in Mendelian randomization (assuming no pleiotropy with relevant traits), detect & control for the presence of genetic confounds in outcomes (eg the correlation of schizophrenia with poverty), and investigate gene–environment interactions.

Polygenic scores are widely used in animal breeding (usually termed genomic prediction) due to their practical use in breeding improved livestock and crops.[3] Their use in human studies are increasing.[4][5]

Estimating weights

Weights are usually estimated using some form of regression analysis. Because the number of genomic variants (usually SNPs) is usually larger than the sample size, one cannot use OLS multiple regression (p > n problem[6][7]). Instead, researchers have opted to use other methods including regressing variants one at a time (usually used in studies with human data) and using penalized regression methods like the LASSO/ridge regression.[1] (Penalized regression can be interpreted as placing priors on how many genetic variants are expected to affect a trait, and the distribution of their effect sizes; Bayesian counterparts exist for LASSO/ridge, and other priors have been suggested & used. They can perform better in some circumstances.[8]) A multi-dataset, multi-method study[7] found that of 15 different methods compared across four datasets, minimum redundancy maximum relevance was the best performing method. Furthermore, variable selection methods tended to outperform other methods. Variable selection methods do not use all the available genomic variants present in a dataset, but attempt to select an optimal subset of variants to use. This leads to less overfitting but more bias (see bias-variance tradeoff).

Predictive validity

The benefit of polygenic score is that they can be used to predict the future. This has large practical benefits for animal breeding because it increases the selection precision and allows for shorter generations, both of which speed up evolution.[9][3] For humans, it can be used to predict future disease susceptibility and for embryo selection.[4][10]

Some accuracy values are given below for comparison purposes. These are given in terms of correlations and have been converted from explained variance if given in that format in the source.

In humans

  • In 2016, r ≈ 0.30 for educational attainment variation at age 16.[5] This polygenic score was based off the a GWAS using data from 293k persons.[11]
  • In 2016, r ≈ 0.31 for case/control status for first-episode psychosis.[12]

In non-human animals

  • In 2016, r ≈ 0.30 for variation in milk fat%.[13]
  • In 2014, r ≈ 0.18 to 0.46 for various measures of meat yield, carcass value etc.[14]

In plants

  • In 2015, r ≈ 0.55 for total root length in Maize (Zea mays L.).[15]
  • In 2014, r ≈ 0.03 to 0.99 across four traits in barley.[16]

References

  1. ^ a b de Vlaming, Ronald; Groenen, Patrick J. F. (2015). "The Current and Future Use of Ridge Regression for Prediction in Quantitative Genetics". BioMed Research International. 2015: 1–18. doi:10.1155/2015/143712.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  2. ^ Dudbridge, Frank (2013-03-21). "Power and Predictive Accuracy of Polygenic Risk Scores". PLOS Genet. 9 (3): e1003348. doi:10.1371/journal.pgen.1003348. ISSN 1553-7404. PMC 3605113. PMID 23555274.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  3. ^ a b Spindel, Jennifer E.; McCouch, Susan R. (2016-09-01). "When more is better: how data sharing would accelerate genomic selection of crop plants". New Phytologist: n/a–n/a. doi:10.1111/nph.14174. ISSN 1469-8137.
  4. ^ a b Spiliopoulou, Athina; Nagy, Reka; Bermingham, Mairead L.; Huffman, Jennifer E.; Hayward, Caroline; Vitart, Veronique; Rudan, Igor; Campbell, Harry; Wright, Alan F. (2015-07-15). "Genomic prediction of complex human traits: relatedness, trait architecture and predictive meta-models". Human Molecular Genetics. 24 (14): 4167–4182. doi:10.1093/hmg/ddv145. ISSN 0964-6906. PMC 4476450. PMID 25918167.
  5. ^ a b Selzam, S.; Krapohl, E.; von Stumm, S.; O'Reilly, P. F.; Rimfeld, K.; Kovas, Y.; Dale, P. S.; Lee, J. J.; Plomin, R. (2016-07-19). "Predicting educational achievement from DNA". Molecular Psychiatry. doi:10.1038/mp.2016.107. ISSN 1476-5578.
  6. ^ James, Gareth (2013). An Introduction to Statistical Learning: with Applications in R. Springer. ISBN 978-1461471370.
  7. ^ a b Haws, David C.; Rish, Irina; Teyssedre, Simon; He, Dan; Lozano, Aurelie C.; Kambadur, Prabhanjan; Karaman, Zivan; Parida, Laxmi (2015-10-06). "Variable-Selection Emerges on Top in Empirical Comparison of Whole-Genome Complex-Trait Prediction Methods". PLOS ONE. 10 (10): e0138903. doi:10.1371/journal.pone.0138903. ISSN 1932-6203. PMC 4595020. PMID 26439851.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  8. ^ Gianola & Rosa 2015, "One Hundred Years of Statistical Developments in Animal Breeding"
  9. ^ Heslot, Nicolas; Jannink, Jean-Luc; Sorrells, Mark E. (2015-01-02). "Perspectives for Genomic Selection Applications and Research in Plants". Crop Science. 55 (1). doi:10.2135/cropsci2014.03.0249. ISSN 0011-183X.
  10. ^ Shulman, Carl; Bostrom, Nick (2014-02-01). "Embryo Selection for Cognitive Enhancement: Curiosity or Game-changer?". Global Policy. 5 (1): 85–92. doi:10.1111/1758-5899.12123. ISSN 1758-5899.
  11. ^ Okbay, Aysu; Beauchamp, Jonathan P.; Fontana, Mark Alan; Lee, James J.; Pers, Tune H.; Rietveld, Cornelius A.; Turley, Patrick; Chen, Guo-Bo; Emilsson, Valur. "Genome-wide association study identifies 74 loci associated with educational attainment". Nature. 533 (7604): 539–542. doi:10.1038/nature17671. PMC 4883595. PMID 27225129.
  12. ^ Vassos, Evangelos; Forti, Marta Di; Coleman, Jonathan; Iyegbe, Conrad; Prata, Diana; Euesden, Jack; O’Reilly, Paul; Curtis, Charles; Kolliakou, Anna. "An Examination of Polygenic Score Risk Prediction in Individuals With First-Episode Psychosis". Biological Psychiatry. doi:10.1016/j.biopsych.2016.06.028.
  13. ^ Hayr, M. K.; Druet, T.; Garrick, D. J. (2016-04-01). "027 Performance of genomic prediction using haplotypes in New Zealand dairy cattle". Journal of Animal Science. 94 (supplement2). doi:10.2527/msasas2016-027. ISSN 1525-3163.
  14. ^ Chen, L.; Vinsky, M.; Li, C. (2015-02-01). "Accuracy of predicting genomic breeding values for carcass merit traits in Angus and Charolais beef cattle". Animal Genetics. 46 (1): 55–59. doi:10.1111/age.12238. ISSN 1365-2052.
  15. ^ Pace, Jordon; Yu, Xiaoqing; Lübberstedt, Thomas (2015-09-01). "Genomic prediction of seedling root length in maize (Zea mays L.)". The Plant Journal. 83 (5): 903–912. doi:10.1111/tpj.12937. ISSN 1365-313X.
  16. ^ Sallam, A. H.; Endelman, J. B.; Jannink, J.-L.; Smith, K. P. (2015-03-01). "Assessing Genomic Selection Prediction Accuracy in a Dynamic Barley Breeding Population". The Plant Genome. 8 (1). doi:10.3835/plantgenome2014.05.0020. ISSN 1940-3372.

Further reading