Jump to content

Polygenic score: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Penalized regression: Detail about ridge regression
Penalized regression: Detail + ref about LASSO regression
Line 24: Line 24:


=== Penalized regression ===
=== Penalized regression ===
[[Regularized least squares|Penalized regression]] methods, such as [[Lasso (statistics)|LASSO]] and [[ridge regression]], can also be used to improve the accuracy of polygenic scores. Penalized regression can be interpreted as placing informative prior probabilities on how many genetic variants are expected to affect a trait, and the distribution of their effect sizes. In other words, these methods in effect "penalize" the large coefficients in a regression model and shrink them conservatively. Ridge regression accomplishes this by shrinking the prediction with a term that penalizes the sum of the squared coefficients.<ref name="Vlaming2015" /> Bayesian counterparts exist for LASSO/ridge, and other priors have been suggested & used. They can perform better in some circumstances.<ref>Gianola & Rosa 2015, [http://www.annualreviews.org/eprint/AEMMACidWrxGtS8eBFrT/full/10.1146/annurev-animal-022114-110733 "One Hundred Years of Statistical Developments in Animal Breeding"]</ref>) A multi-dataset, multi-method study<ref name=":3" /> found that of 15 different methods compared across four datasets, [[Minimum redundancy feature selection|minimum redundancy maximum relevance]] was the best performing method. Furthermore, [[Feature selection|variable selection]] methods tended to outperform other methods. Variable selection methods do not use all the available genomic variants present in a dataset, but attempt to select an optimal subset of variants to use. This leads to less overfitting but more bias (see [[Bias–variance tradeoff|bias-variance tradeoff]]).
[[Regularized least squares|Penalized regression]] methods, such as [[Lasso (statistics)|LASSO]] and [[ridge regression]], can also be used to improve the accuracy of polygenic scores. Penalized regression can be interpreted as placing informative prior probabilities on how many genetic variants are expected to affect a trait, and the distribution of their effect sizes. In other words, these methods in effect "penalize" the large coefficients in a regression model and shrink them conservatively. Ridge regression accomplishes this by shrinking the prediction with a term that penalizes the sum of the squared coefficients.<ref name="Vlaming2015" /> LASSO accomplishes something similar by penalizing the sum of absolute coefficients.<ref>{{cite journal|last1=Vattikuti|first1= S.|last2=Lee|first2=J. J.|last3=Chang|first3=C. C.|last4=Hsu|first4=S. D. H.|last5=Chow|first5=C. C.|display-authors=etal|title=Applying compressed sensing to genome-wide association studies | journal=GigaScience|volume=3|issue=1|year=2014|doi=10.1186/2047-217X-3-10}}</ref> Bayesian counterparts exist for LASSO and ridge regression, and other priors have been suggested and used. They can perform better in some circumstances.<ref>Gianola & Rosa 2015, [http://www.annualreviews.org/eprint/AEMMACidWrxGtS8eBFrT/full/10.1146/annurev-animal-022114-110733 "One Hundred Years of Statistical Developments in Animal Breeding"]</ref>) A multi-dataset, multi-method study<ref name=":3" /> found that of 15 different methods compared across four datasets, [[Minimum redundancy feature selection|minimum redundancy maximum relevance]] was the best performing method. Furthermore, [[Feature selection|variable selection]] methods tended to outperform other methods. Variable selection methods do not use all the available genomic variants present in a dataset, but attempt to select an optimal subset of variants to use. This leads to less overfitting but more bias (see [[Bias–variance tradeoff|bias-variance tradeoff]]).


== Predictive validity ==
== Predictive validity ==

Revision as of 18:47, 13 June 2019

A polygenic score, also called a polygenic risk score, genetic risk score, or genome-wide score, is a number based on variation in multiple genetic loci and their associated weights (see regression analysis).[1][2] It serves as the best prediction for the trait that can be made when taking into account variation in multiple genetic variants.

Polygenic scores are widely employed in animal, plant, and behavioral genetics for predicting and understanding genetic architectures. In a genome-wide association study (GWAS), polygenic scores having substantially higher predictive performance than the genome-wide statistically-significant hits indicates that the trait in question is affected by a larger number of variants than just the hits and larger sample sizes will yield more hits; a conjunction of low variance explained and high heritability as measured by GCTA, twin studies or other methods, indicates that a trait may be massively polygenic and affected by thousands of variants. Once a polygenic score has been created, which explains at least a few percent of a phenotype's variance and can therefore be assumed to effectively incorporate a significant fraction of the genetic variants affecting that phenotype, it can be used in several different ways: as a lower bound to test whether heritability estimates may be biased; as a measure of genetic overlap of traits (genetic correlation), which might indicate e.g. shared genetic bases for groups of mental disorders; as a means to assess group differences in a trait such as height, or to examine changes in a trait over time due to natural selection indicative of a soft selective sweep (as e.g. for intelligence where the changes in frequency would be too small to detect on each individual hit but not on the overall polygenic score); in Mendelian randomization (assuming no pleiotropy with relevant traits); to detect & control for the presence of genetic confounds in outcomes (e.g. the correlation of schizophrenia with poverty); or to investigate gene–environment interactions.

Polygenic scores are widely used in animal breeding and plant breeding (usually termed genomic prediction or genomic selection) due to their efficacy in improving livestock breeding and crops.[3] Their use in human studies is increasing.[4][5]

History

One of the first precursors to the modern polygenic score was proposed under the term marker-based selection (MAS) in 1990.[6] According to MAS, breeders are able to increase the efficiency of artificial selection by estimating the regression coefficients of genetic markers that are correlated with differences in the trait of interest and assigning individual animals a "score" from this information. A major development of these fundamentals was proposed in 2001 by researchers who discovered that the use of a Bayesian prior could help to mitigate the problem of the number of markers being greater than the sample of animals.[7]

These methods were first applied to humans in the late 2000s, starting with a proposal in 2007 that these scores could be used in human genetics to identify individuals at high risk for disease.[8] This was successfully applied in empirical research for the first time in 2009 by researchers who organized a genome-wide association study (GWAS) of schizophrenia to construct scores of risk propensity. This study was also the first to use the term polygenic score for a prediction drawn from a linear combination of single-nucleotide polymorphism (SNP) genotypes, which was able to explain 3% of the variance in schizophrenia.[9]

Years of education was the first human cognitive phenotype to be successfully studied in a GWAS. The most recent study of this phenotype was the largest GWAS yet conducted as of 2018, with polygenic scores constructed from 1.1 million participants able to predict upwards of 10% of the variance in various cognitive traits.[10]

Methods of construction

A polygenic score (PGS) is constructed from the "weights" derived from a genome-wide association study (GWAS). In a GWAS, a set of genetic markers (usually SNPs) is genotyped on a training sample, and effect sizes are estimated for each marker's association with the trait of interest. These weights are then used to assign individualized polygenic scores in an independent replication sample.[2] The estimated score, , generally follows the form

,

where the of an individual is equal to the weighted sum of the individual's marker genotypes, , at SNPs.[2] Weights are estimated using some form of regression analysis. Because the number of genomic variants is usually larger than the sample size, one cannot use OLS multiple regression (p > n problem[11][12]). Researchers have proposed various methodologies that deal with this problem as well as how to generate the weights of the SNPs, , and how to determine which SNPs should be included.

Naïve methods

The simplest so-called "naïve" method of construction sets weights equal to the coefficient estimates from a regression of the trait on each genetic variant. The included SNPs may be selected using an algorithm that attempts to ensure that each marker is approximately independent. Failing to account for non-random association of genetic variants will typically reduce the score's predictive accuracy. This is important because genetic variants are often correlated with other nearby variants, such that the weight of a causal variant will be attenuated if it is more strongly correlated with its neighbors than a null variant. This is called linkage disequilibrium, a common phenomenon that arises from the shared evolutionary history of neighboring genetic variants. Further restriction can be achieved by multiple-testing different sets of SNPs selected at various thresholds, such as all SNPs which are genome-wide statistically-significant hits or all SNPs p < 0.05 or all SNPs with p < 0.50, and the one with greatest performance used for further analysis; especially for highly polygenic traits, the best polygenic score will tend to use most or all SNPs.[13]

Bayesian methods

Bayesian approaches, originally pioneered in concept in 2001,[7] attempt to explicitly model preexisting genetic architecture, thereby accounting for the distribution of effect sizes with a prior that should improve the accuracy of a polygenic score. One of the most popular modern Bayesian methods uses "linkage disequilibrium prediction" (LDpred for short) to set the weight for each SNP equal to the average of its posterior distribution after linkage disequilibrium has been accounted for. LDpred tends to outperform simpler methods of pruning and thresholding, especially at large sample sizes; for example, its estimations have improved the predicted variance of a polygenic score for schizophrenia in a large data set from 20.1% to 25.3%.[14]

Penalized regression

Penalized regression methods, such as LASSO and ridge regression, can also be used to improve the accuracy of polygenic scores. Penalized regression can be interpreted as placing informative prior probabilities on how many genetic variants are expected to affect a trait, and the distribution of their effect sizes. In other words, these methods in effect "penalize" the large coefficients in a regression model and shrink them conservatively. Ridge regression accomplishes this by shrinking the prediction with a term that penalizes the sum of the squared coefficients.[1] LASSO accomplishes something similar by penalizing the sum of absolute coefficients.[15] Bayesian counterparts exist for LASSO and ridge regression, and other priors have been suggested and used. They can perform better in some circumstances.[16]) A multi-dataset, multi-method study[12] found that of 15 different methods compared across four datasets, minimum redundancy maximum relevance was the best performing method. Furthermore, variable selection methods tended to outperform other methods. Variable selection methods do not use all the available genomic variants present in a dataset, but attempt to select an optimal subset of variants to use. This leads to less overfitting but more bias (see bias-variance tradeoff).

Predictive validity

The benefit of polygenic score is that they can be used to predict the future. This has large practical benefits for animal breeding because it increases the selection precision and allows for shorter generations, both of which speed up evolution.[17][3] For humans, it can be used to predict future disease susceptibility and for embryo selection.[4][18]

Some accuracy values are given below for comparison purposes. These are given in terms of correlations and have been converted from explained variance if given in that format in the source.

In humans

  • In 2016, r ≈ 0.30 for educational attainment variation at age 16.[5] This polygenic score was based off a GWAS using data from 293,000 persons.[19]
  • In 2016, r ≈ 0.31 for case/control status for first-episode psychosis.[20]

In non-human animals

  • In 2016, r ≈ 0.30 for variation in milk fat%.[21]
  • In 2014, r ≈ 0.18 to 0.46 for various measures of meat yield, carcass value etc.[22]

In plants

  • In 2015, r ≈ 0.55 for total root length in Maize (Zea mays L.).[23]
  • In 2014, r ≈ 0.03 to 0.99 across four traits in barley.[24]

References

  1. ^ a b de Vlaming, Ronald; Groenen, Patrick J. F. (2015). "The Current and Future Use of Ridge Regression for Prediction in Quantitative Genetics". BioMed Research International. 2015: 143712. doi:10.1155/2015/143712. PMC 4529984. PMID 26273586.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  2. ^ a b c Dudbridge, Frank (2013). "Power and predictive accuracy of polygenic risk scores". PLOS Genetics. 9 (3): e1003348. doi:10.1371/journal.pgen.1003348. ISSN 1553-7404. PMC 3605113. PMID 23555274.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  3. ^ a b Spindel, Jennifer E.; McCouch, Susan R. (2016-09-01). "When more is better: how data sharing would accelerate genomic selection of crop plants". New Phytologist. 212 (4): 814–826. doi:10.1111/nph.14174. ISSN 1469-8137. PMID 27716975.
  4. ^ a b Spiliopoulou, Athina; Nagy, Reka; Bermingham, Mairead L.; Huffman, Jennifer E.; Hayward, Caroline; Vitart, Veronique; Rudan, Igor; Campbell, Harry; Wright, Alan F. (2015-07-15). "Genomic prediction of complex human traits: relatedness, trait architecture and predictive meta-models". Human Molecular Genetics. 24 (14): 4167–4182. doi:10.1093/hmg/ddv145. ISSN 0964-6906. PMC 4476450. PMID 25918167.
  5. ^ a b Selzam, S.; Krapohl, E.; von Stumm, S.; O'Reilly, P. F.; Rimfeld, K.; Kovas, Y.; Dale, P. S.; Lee, J. J.; Plomin, R. (2016-07-19). "Predicting educational achievement from DNA". Molecular Psychiatry. 22 (2): 267–272. doi:10.1038/mp.2016.107. ISSN 1476-5578. PMC 5285461. PMID 27431296.
  6. ^ Lande, R.; Thompson, R. (1990). "Efficiency of marker-assisted selection in the improvement of quantitative traits". Genetics. 124 (3): 743–756. doi:10.1046/j.1365-2540.1998.00308.x.
  7. ^ a b Meuwissen, T. H. E.; Hayes, B. J.; Goddard, M. E. (2001). "Prediction of total genetic value using genome-wide dense marker maps". Genetics. 157 (4): 1819–1829.
  8. ^ Wray, N. R.; Goddard, M. E.; Visscher, P. M. (2007). "Prediction of individual genetic risk to disease from genome-wide association studies". Genome Research. 17 (10): 1520–1528. doi:10.1101/gr.6665407.
  9. ^ Purcell, S. M.; et al. (2009). "Common polygenic variation contributes to risk of schizophrenia and bipolar disorder". Nature. 460 (August): 748–752. doi:10.1038/nature08185.
  10. ^ Lee, J. J.; et al. (2018). "Gene discovery and polygenic prediction from a genome-wide association study of educational attainment in 1.1 million individuals". Nature Genetics. 50 (8): 1112–1121. doi:10.1038/s41588-018-0147-3.
  11. ^ James, Gareth (2013). An Introduction to Statistical Learning: with Applications in R. Springer. ISBN 978-1461471370.
  12. ^ a b Haws, David C.; Rish, Irina; Teyssedre, Simon; He, Dan; Lozano, Aurelie C.; Kambadur, Prabhanjan; Karaman, Zivan; Parida, Laxmi (2015-10-06). "Variable-Selection Emerges on Top in Empirical Comparison of Whole-Genome Complex-Trait Prediction Methods". PLOS ONE. 10 (10): e0138903. doi:10.1371/journal.pone.0138903. ISSN 1932-6203. PMC 4595020. PMID 26439851.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  13. ^ Ware, E. B.; et al. (2017). "Heterogeneity in polygenic scores for common human traits". BioRxiv. doi:10.1101/106062.
  14. ^ Vilhjálmsson, B. J.; et al. (2015). "Modeling linkage disequilibrium increases accuracy of polygenic risk scores". American Journal of Human Genetics. 97 (4): 576–592. doi:10.1016/j.ajhg.2015.09.001.
  15. ^ Vattikuti, S.; Lee, J. J.; Chang, C. C.; Hsu, S. D. H.; Chow, C. C.; et al. (2014). "Applying compressed sensing to genome-wide association studies". GigaScience. 3 (1). doi:10.1186/2047-217X-3-10.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  16. ^ Gianola & Rosa 2015, "One Hundred Years of Statistical Developments in Animal Breeding"
  17. ^ Heslot, Nicolas; Jannink, Jean-Luc; Sorrells, Mark E. (2015-01-02). "Perspectives for Genomic Selection Applications and Research in Plants". Crop Science. 55 (1): 1. doi:10.2135/cropsci2014.03.0249. ISSN 0011-183X.
  18. ^ Shulman, Carl; Bostrom, Nick (2014-02-01). "Embryo Selection for Cognitive Enhancement: Curiosity or Game-changer?". Global Policy. 5 (1): 85–92. CiteSeerX 10.1.1.428.8837. doi:10.1111/1758-5899.12123. ISSN 1758-5899.
  19. ^ Okbay, Aysu; Beauchamp, Jonathan P.; Fontana, Mark Alan; Lee, James J.; Pers, Tune H.; Rietveld, Cornelius A.; Turley, Patrick; Chen, Guo-Bo; Emilsson, Valur (2016). "Genome-wide association study identifies 74 loci associated with educational attainment". Nature. 533 (7604): 539–542. doi:10.1038/nature17671. PMC 4883595. PMID 27225129.
  20. ^ Vassos, Evangelos; Forti, Marta Di; Coleman, Jonathan; Iyegbe, Conrad; Prata, Diana; Euesden, Jack; O’Reilly, Paul; Curtis, Charles; Kolliakou, Anna (2017-03-15). "An Examination of Polygenic Score Risk Prediction in Individuals With First-Episode Psychosis". Biological Psychiatry. 81 (6): 470–477. doi:10.1016/j.biopsych.2016.06.028. PMID 27765268.
  21. ^ Hayr, M. K.; Druet, T.; Garrick, D. J. (2016-04-01). "027 Performance of genomic prediction using haplotypes in New Zealand dairy cattle". Journal of Animal Science. 94 (supplement2): 13. doi:10.2527/msasas2016-027. ISSN 1525-3163.
  22. ^ Chen, L.; Vinsky, M.; Li, C. (2015-02-01). "Accuracy of predicting genomic breeding values for carcass merit traits in Angus and Charolais beef cattle". Animal Genetics. 46 (1): 55–59. doi:10.1111/age.12238. ISSN 1365-2052. PMID 25393962.
  23. ^ Pace, Jordon; Yu, Xiaoqing; Lübberstedt, Thomas (2015-09-01). "Genomic prediction of seedling root length in maize (Zea mays L.)". The Plant Journal. 83 (5): 903–912. doi:10.1111/tpj.12937. ISSN 1365-313X. PMID 26189993.
  24. ^ Sallam, A. H.; Endelman, J. B.; Jannink, J.-L.; Smith, K. P. (2015-03-01). "Assessing Genomic Selection Prediction Accuracy in a Dynamic Barley Breeding Population". The Plant Genome. 8 (1): 0. doi:10.3835/plantgenome2014.05.0020. ISSN 1940-3372.

Further reading