Polygenic score: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
m link
Added some explanatory language and extended the introduction with machine learning and current state in the field; added sections "Validation methods", "Research on clinical applications", "Genetic Architecture"; added many references.
Line 5: Line 5:
{{short description|Numerical score aimed at predicting a trait based on variation in multiple genetic loci}}
{{short description|Numerical score aimed at predicting a trait based on variation in multiple genetic loci}}
[[File:PRS Illustration.png|thumb|301x301px|An illustration of the distribution and stratification ability of a polygenic risk score]]
[[File:PRS Illustration.png|thumb|301x301px|An illustration of the distribution and stratification ability of a polygenic risk score]]
In [[genetics]], a '''polygenic score''', also called a '''polygenic risk score (PRS)''', '''genetic risk score''', or '''genome-wide score''', is a number that summarises the estimated effect of many genetic variants on an individual's phenotype, typically calculated as a weighted sum of trait-associated alleles.<ref name="Dudbridge">{{cite journal | vauthors = Dudbridge F | title = Power and predictive accuracy of polygenic risk scores | journal = PLOS Genetics | volume = 9 | issue = 3 | pages = e1003348 | date = March 2013 | pmid = 23555274 | pmc = 3605113 | doi = 10.1371/journal.pgen.1003348 }}</ref><ref name=":4">{{cite journal | vauthors = Torkamani A, Wineinger NE, Topol EJ | s2cid = 46893131 | title = The personal and clinical utility of polygenic risk scores | journal = Nature Reviews. Genetics | volume = 19 | issue = 9 | pages = 581–590 | date = September 2018 | pmid = 29789686 | doi = 10.1038/s41576-018-0018-x }}</ref><ref>{{cite journal | vauthors = Lambert SA, Abraham G, Inouye M | title = Towards clinical utility of polygenic risk scores | journal = Human Molecular Genetics | volume = 28 | issue = R2 | pages = R133–R142 | date = November 2019 | pmid = 31363735 | doi = 10.1093/hmg/ddz187 | doi-access = free }}</ref> It reflects an individuals estimated genetic predisposition for a given trait and can be used as a predictor for that trait.<ref name="Vlaming2015">{{cite journal | vauthors = de Vlaming R, Groenen PJ | title = The Current and Future Use of Ridge Regression for Prediction in Quantitative Genetics | journal = BioMed Research International | volume = 2015 | pages = 143712 | date = 2015 | pmid = 26273586 | pmc = 4529984 | doi = 10.1155/2015/143712 }}</ref><ref name="pmid29132412">{{cite journal | vauthors = Lewis CM, Vassos E | title = Prospects for using risk scores in polygenic medicine | journal = Genome Medicine | volume = 9 | issue = 1 | pages = 96 | date = November 2017 | pmid = 29132412 | pmc = 5683372 | doi = 10.1186/s13073-017-0489-y }}</ref><ref name="pmid30104762">{{cite journal | vauthors = Khera AV, Chaffin M, Aragam KG, Haas ME, Roselli C, Choi SH, Natarajan P, Lander ES, Lubitz SA, Ellinor PT, Kathiresan S | display-authors = 6 | title = Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations | journal = Nature Genetics | volume = 50 | issue = 9 | pages = 1219–1224 | date = September 2018 | pmid = 30104762 | pmc = 6128408 | doi = 10.1038/s41588-018-0183-z }}</ref><ref name="pmid31833054">{{cite journal | vauthors = Yanes T, Meiser B, Kaur R, Scheepers-Joynt M, McInerny S, Taylor S, Barlow-Stewart K, Antill Y, Salmon L, Smyth C, Young MA, James PA | display-authors = 6 | title = Uptake of polygenic risk information among women at increased risk of breast cancer | journal = Clinical Genetics | volume = 97 | issue = 3 | pages = 492–501 | date = March 2020 | pmid = 31833054 | doi = 10.1111/cge.13687 | s2cid = 209342044 | url = https://espace.library.uq.edu.au/view/UQ:5ce332e/UQ5ce332e_OA.pdf }}</ref><ref name="Vilhjálmsson_2015">{{cite journal | vauthors = Vilhjálmsson BJ, Yang J, Finucane HK, Gusev A, Lindström S, Ripke S, Genovese G, Loh PR, Bhatia G, Do R, Hayeck T, Won HH, Kathiresan S, Pato M, Pato C, Tamimi R, Stahl E, Zaitlen N, Pasaniuc B, Belbin G, Kenny EE, Schierup MH, De Jager P, Patsopoulos NA, McCarroll S, Daly M, Purcell S, Chasman D, Neale B, Goddard M, Visscher PM, Kraft P, Patterson N, Price AL | display-authors = 6 | title = Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores | journal = American Journal of Human Genetics | volume = 97 | issue = 4 | pages = 576–92 | date = October 2015 | pmid = 26430803 | pmc = 4596916 | doi = 10.1016/j.ajhg.2015.09.001 }}</ref> Polygenic scores are widely used in [[animal breeding]] and [[plant breeding]] (usually termed ''genomic prediction'' or ''genomic selection'') due to their efficacy in improving livestock breeding and crops.<ref name=":0">{{cite journal | vauthors = Spindel JE, McCouch SR | title = When more is better: how data sharing would accelerate genomic selection of crop plants | journal = The New Phytologist | volume = 212 | issue = 4 | pages = 814–826 | date = December 2016 | pmid = 27716975 | doi = 10.1111/nph.14174 | author2-link = Susan McCouch | doi-access = free }}</ref> They are also increasingly being used for risk prediction in humans for [[Genetic disorder#Multifactorial and polygenic .28complex.29 disorders|complex diseases]]<ref>{{cite web | first = Antonio | last = Regalado | name-list-style = vanc | date = 8 March 2019 | title = 23andMe thinks polygenic risk scores are ready for the masses, but experts aren't so sure|url=https://www.technologyreview.com/2019/03/08/136730/23andme-thinks-polygenic-risk-scores-are-ready-for-the-masses-but-experts-arent-so-sure/|access-date=2020-08-14|website=MIT Technology Review|language=en}}</ref> which are typically affected by many genetic variants that each confer a small effect on overall risk.<ref>{{cite journal | vauthors = Visscher PM, Wray NR, Zhang Q, Sklar P, McCarthy MI, Brown MA, Yang J | title = 10 Years of GWAS Discovery: Biology, Function, and Translation | journal = American Journal of Human Genetics | volume = 101 | issue = 1 | pages = 5–22 | date = July 2017 | pmid = 28686856 | pmc = 5501872 | doi = 10.1016/j.ajhg.2017.06.005 }}</ref><ref name=":1">{{cite journal | vauthors = Spiliopoulou A, Nagy R, Bermingham ML, Huffman JE, Hayward C, Vitart V, Rudan I, Campbell H, Wright AF, Wilson JF, Pong-Wong R, Agakov F, Navarro P, Haley CS | display-authors = 6 | title = Genomic prediction of complex human traits: relatedness, trait architecture and predictive meta-models | journal = Human Molecular Genetics | volume = 24 | issue = 14 | pages = 4167–82 | date = July 2015 | pmid = 25918167 | pmc = 4476450 | doi = 10.1093/hmg/ddv145 }}</ref>
In [[genetics]], a '''polygenic score''', also called a '''polygenic risk score (PRS)''', '''genetic risk score''', or '''genome-wide score''', is a number that summarises the estimated effect of many genetic variants on an individual's [[phenotype]], typically calculated as a weighted sum of trait-associated [[allele|alleles]].<ref name="Dudbridge">{{cite journal | vauthors = Dudbridge F | title = Power and predictive accuracy of polygenic risk scores | journal = PLOS Genetics | volume = 9 | issue = 3 | pages = e1003348 | date = March 2013 | pmid = 23555274 | pmc = 3605113 | doi = 10.1371/journal.pgen.1003348 }}</ref><ref name=":4">{{cite journal | vauthors = Torkamani A, Wineinger NE, Topol EJ | s2cid = 46893131 | title = The personal and clinical utility of polygenic risk scores | journal = Nature Reviews. Genetics | volume = 19 | issue = 9 | pages = 581–590 | date = September 2018 | pmid = 29789686 | doi = 10.1038/s41576-018-0018-x }}</ref><ref>{{cite journal | vauthors = Lambert SA, Abraham G, Inouye M | title = Towards clinical utility of polygenic risk scores | journal = Human Molecular Genetics | volume = 28 | issue = R2 | pages = R133–R142 | date = November 2019 | pmid = 31363735 | doi = 10.1093/hmg/ddz187 | doi-access = free }}</ref> It reflects an individual's estimated genetic predisposition for a given trait and can be used as a predictor for that trait.<ref name="Vlaming2015">{{cite journal | vauthors = de Vlaming R, Groenen PJ | title = The Current and Future Use of Ridge Regression for Prediction in Quantitative Genetics | journal = BioMed Research International | volume = 2015 | pages = 143712 | date = 2015 | pmid = 26273586 | pmc = 4529984 | doi = 10.1155/2015/143712 }}</ref><ref name="pmid29132412">{{cite journal | vauthors = Lewis CM, Vassos E | title = Prospects for using risk scores in polygenic medicine | journal = Genome Medicine | volume = 9 | issue = 1 | pages = 96 | date = November 2017 | pmid = 29132412 | pmc = 5683372 | doi = 10.1186/s13073-017-0489-y }}</ref><ref name="pmid30104762">{{cite journal | vauthors = Khera AV, Chaffin M, Aragam KG, Haas ME, Roselli C, Choi SH, Natarajan P, Lander ES, Lubitz SA, Ellinor PT, Kathiresan S | display-authors = 6 | title = Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations | journal = Nature Genetics | volume = 50 | issue = 9 | pages = 1219–1224 | date = September 2018 | pmid = 30104762 | pmc = 6128408 | doi = 10.1038/s41588-018-0183-z }}</ref><ref name="pmid31833054">{{cite journal | vauthors = Yanes T, Meiser B, Kaur R, Scheepers-Joynt M, McInerny S, Taylor S, Barlow-Stewart K, Antill Y, Salmon L, Smyth C, Young MA, James PA | display-authors = 6 | title = Uptake of polygenic risk information among women at increased risk of breast cancer | journal = Clinical Genetics | volume = 97 | issue = 3 | pages = 492–501 | date = March 2020 | pmid = 31833054 | doi = 10.1111/cge.13687 | s2cid = 209342044 | url = https://espace.library.uq.edu.au/view/UQ:5ce332e/UQ5ce332e_OA.pdf }}</ref><ref name="Vilhjálmsson_2015">{{cite journal | vauthors = Vilhjálmsson BJ, Yang J, Finucane HK, Gusev A, Lindström S, Ripke S, Genovese G, Loh PR, Bhatia G, Do R, Hayeck T, Won HH, Kathiresan S, Pato M, Pato C, Tamimi R, Stahl E, Zaitlen N, Pasaniuc B, Belbin G, Kenny EE, Schierup MH, De Jager P, Patsopoulos NA, McCarroll S, Daly M, Purcell S, Chasman D, Neale B, Goddard M, Visscher PM, Kraft P, Patterson N, Price AL | display-authors = 6 | title = Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores | journal = American Journal of Human Genetics | volume = 97 | issue = 4 | pages = 576–92 | date = October 2015 | pmid = 26430803 | pmc = 4596916 | doi = 10.1016/j.ajhg.2015.09.001 }}</ref> In other words, it gives an estimate of how likely an individual is to have a given trait only based on genetics, without taking environmental factors into account. Polygenic scores are widely used in [[animal breeding]] and [[plant breeding]] (usually termed ''genomic prediction'' or ''genomic selection'') due to their efficacy in improving livestock breeding and crops.<ref name=":0">{{cite journal | vauthors = Spindel JE, McCouch SR | title = When more is better: how data sharing would accelerate genomic selection of crop plants | journal = The New Phytologist | volume = 212 | issue = 4 | pages = 814–826 | date = December 2016 | pmid = 27716975 | doi = 10.1111/nph.14174 | author2-link = Susan McCouch | doi-access = free }}</ref>
Recent progress in machine learning (ML) analysis of large genomic datasets has enabled the creation of polygenic predictors of complex human traits, including risk for many important [[Genetic disorder#Multifactorial and polygenic .28complex.29 disorders|complex diseases]],<ref>{{cite web | first = Antonio | last = Regalado | name-list-style = vanc | date = 8 March 2019 | title = 23andMe thinks polygenic risk scores are ready for the masses, but experts aren't so sure|url=https://www.technologyreview.com/2019/03/08/136730/23andme-thinks-polygenic-risk-scores-are-ready-for-the-masses-but-experts-arent-so-sure/|access-date=2020-08-14|website=MIT Technology Review|language=en}}</ref><ref name="Lello2019">{{cite journal
| vauthors = Lello L, Raben TG, Yong SY, Tellier LC, Hsu SD
| date = 2019-10-25
| title = Genomic Prediction of 16 Complex Disease Risks Including Heart Attack, Diabetes, Breast and Prostate Cancer
| url = https://www.nature.com/articles/s41598-019-51258-x
| journal = Scientific Reports
| volume = 9
| issue = 1
| pages = 15286
| doi = 10.1038/s41598-019-51258-x
| pmid = 31653892
| access-date = 2021-04-12
}}
</ref> which are typically affected by many genetic variants that each confer a small effect on overall risk.<ref>{{cite journal | vauthors = Visscher PM, Wray NR, Zhang Q, Sklar P, McCarthy MI, Brown MA, Yang J | title = 10 Years of GWAS Discovery: Biology, Function, and Translation | journal = American Journal of Human Genetics | volume = 101 | issue = 1 | pages = 5–22 | date = July 2017 | pmid = 28686856 | pmc = 5501872 | doi = 10.1016/j.ajhg.2017.06.005 }}</ref><ref name=":1">{{cite journal | vauthors = Spiliopoulou A, Nagy R, Bermingham ML, Huffman JE, Hayward C, Vitart V, Rudan I, Campbell H, Wright AF, Wilson JF, Pong-Wong R, Agakov F, Navarro P, Haley CS | display-authors = 6 | title = Genomic prediction of complex human traits: relatedness, trait architecture and predictive meta-models | journal = Human Molecular Genetics | volume = 24 | issue = 14 | pages = 4167–82 | date = July 2015 | pmid = 25918167 | pmc = 4476450 | doi = 10.1093/hmg/ddv145 }}</ref> In a polygenic risk predictor the lifetime (or age-range) risk for the disease is a numerical function (Polygenic Risk Score or PRS) which depends on the states of thousands of individual genetic variants (i.e., [[Single Nucleotide Polymorphisms]], or SNPs).

Polygenic Risk Scores are an area of intense scientific investigation: hundreds of papers are written each year on topics such as learning algorithms for genomic prediction, new predictor training, validation testing of predictors, clinical application of PRS.<ref name="GeeWhiz">{{cite web
|url=https://www.economist.com/science-and-technology/2019/11/07/modern-genetics-will-improve-health-and-usher-in-designer-children
|title=Modern genetics will improve health and usher in “designer” children
|website=The Economist
|date=2019-11-09
|access-date=2021-04-12}}
</ref>
<ref>{{cite web
|url=https://www.theguardian.com/society/2018/oct/08/test-could-predict-risk-of-future-heart-disease-for-just-40
|title=Test could predict risk of future heart disease for just £40
|website=The Guardian
|date=2018-10-08
|access-date=2021-04-12
}}
</ref>
<ref name="Raben2021">{{cite arXiv
| vauthors = Raben TG, Lello L, Widen E, Hsu SD
| eprint = 2101.05870
| title = From Genotype to Phenotype: polygenic prediction of complex human traits
| class = q-bio
| date = 2021-01-14
}}
</ref>
<ref name="pmid30104762" /><ref name="Lello2019" /> In 2018 the American Heart Association named polygenic risk scores as one of the major breakthroughs in research in heart disease and stroke.
<ref>{{cite web
|url=https://www.sciencedaily.com/releases/2019/06/190611081903.htm
|title=Big picture genetic scoring approach reliably predicts heart disease.
|website=Science Daily
|date=2019-06-11
|access-date=2021-04-12}}
</ref>



== History ==
== History ==
Line 14: Line 62:


== Methods of construction ==
== Methods of construction ==
A polygenic score (PGS) is constructed from the "weights" derived from a [[genome-wide association study]] (GWAS). In a GWAS, a set of genetic markers (usually [[Single-nucleotide polymorphism|SNPs]]) is genotyped on a training sample, and effect sizes are estimated for each marker's association with the trait of interest. These weights are then used to assign individualized polygenic scores in an independent replication sample.<ref name="Dudbridge" /> The estimated score, <math>\hat{S}</math>, generally follows the form
A polygenic score (PGS) is constructed from the "weights" derived from a [[genome-wide association study]] (GWAS), or from some form of machine learning algorithm. In a GWAS, a set of genetic markers (usually [[Single-nucleotide polymorphism|SNPs]]) is genotyped on a training sample, and effect sizes are estimated for each marker's association with the trait of interest. These weights are then used to assign individualized polygenic scores in an independent replication sample.<ref name="Dudbridge" /> The estimated score, <math>\hat{S}</math>, generally follows the form
:<math>\hat{S} = \sum_{j=1}^{m} X_{j} \hat{\beta}_{j}</math>,
:<math>\hat{S} = \sum_{j=1}^{m} X_{j} \hat{\beta}_{j}</math>,
where the <math>\hat{S}</math> of an individual is equal to the weighted sum of the individual's marker genotypes, <math>X_j</math>, at <math>{m}</math> SNPs.<ref name="Dudbridge" /> Weights are estimated using some form of [[regression analysis]]. Because the number of genomic variants is usually larger than the sample size, one cannot use [[Ordinary least squares|OLS multiple regression]] (''p'' > ''n'' problem<ref>{{Cite book|url=http://www-bcf.usc.edu/~gareth/ISL/index.html|title=An Introduction to Statistical Learning: with Applications in R|last=James|first=Gareth | name-list-style = vanc |publisher=Springer|year=2013|isbn=978-1461471370}}</ref><ref name=":3">{{cite journal | vauthors = Haws DC, Rish I, Teyssedre S, He D, Lozano AC, Kambadur P, Karaman Z, Parida L | display-authors = 6 | title = Variable-Selection Emerges on Top in Empirical Comparison of Whole-Genome Complex-Trait Prediction Methods | journal = PLOS ONE | volume = 10 | issue = 10 | pages = e0138903 | date = 2015-10-06 | pmid = 26439851 | pmc = 4595020 | doi = 10.1371/journal.pone.0138903 | bibcode = 2015PLoSO..1038903H }}</ref>). Researchers have proposed various methodologies that deal with this problem as well as how to generate the weights of the SNPs, <math>\hat{\beta}_{j}</math>, and how to determine which <math>{m}</math> SNPs should be included.
where the <math>\hat{S}</math> of an individual is equal to the weighted sum of the individual's marker genotypes, <math>X_j</math>, at <math>{m}</math> SNPs.<ref name="Dudbridge" /> Weights are estimated using some form of [[regression analysis]]. Because the number of genomic variants is usually larger than the sample size, one cannot use [[Ordinary least squares|OLS multiple regression]] (''p'' > ''n'' problem<ref>{{Cite book|url=http://www-bcf.usc.edu/~gareth/ISL/index.html|title=An Introduction to Statistical Learning: with Applications in R|last=James|first=Gareth | name-list-style = vanc |publisher=Springer|year=2013|isbn=978-1461471370}}</ref><ref name=":3">{{cite journal | vauthors = Haws DC, Rish I, Teyssedre S, He D, Lozano AC, Kambadur P, Karaman Z, Parida L | display-authors = 6 | title = Variable-Selection Emerges on Top in Empirical Comparison of Whole-Genome Complex-Trait Prediction Methods | journal = PLOS ONE | volume = 10 | issue = 10 | pages = e0138903 | date = 2015-10-06 | pmid = 26439851 | pmc = 4595020 | doi = 10.1371/journal.pone.0138903 | bibcode = 2015PLoSO..1038903H }}</ref>). Researchers have proposed various methodologies that deal with this problem as well as how to generate the weights of the SNPs, <math>\hat{\beta}_{j}</math>, and how to determine which <math>{m}</math> SNPs should be included.
Line 26: Line 74:
=== Penalized regression ===
=== Penalized regression ===
[[Regularized least squares|Penalized regression]] methods, such as [[Lasso (statistics)|LASSO]] and [[ridge regression]], can also be used to improve the accuracy of polygenic scores. Penalized regression can be interpreted as placing informative prior probabilities on how many genetic variants are expected to affect a trait, and the distribution of their effect sizes. In other words, these methods in effect "penalize" the large coefficients in a regression model and shrink them conservatively. Ridge regression accomplishes this by shrinking the prediction with a term that penalizes the sum of the squared coefficients.<ref name="Vlaming2015" /> LASSO accomplishes something similar by penalizing the sum of absolute coefficients.<ref>{{cite journal | vauthors = Vattikuti S, Lee JJ, Chang CC, Hsu SD, Chow CC | title = Applying compressed sensing to genome-wide association studies | journal = GigaScience | volume = 3 | issue = 1 | pages = 10 | year = 2014 | pmid = 25002967 | doi = 10.1186/2047-217X-3-10 | pmc = 4078394 | doi-access = free }}</ref> Bayesian counterparts exist for LASSO and ridge regression, and other priors have been suggested and used. They can perform better in some circumstances.<ref>{{cite journal | vauthors = Gianola D, Rosa GJ | title = One hundred years of statistical developments in animal breeding | journal = Annual Review of Animal Biosciences | volume = 3 | pages = 19–56 | year = 2015 | pmid = 25387231 | doi = 10.1146/annurev-animal-022114-110733 }}</ref> A multi-dataset, multi-method study<ref name=":3" /> found that of 15 different methods compared across four datasets, [[Minimum redundancy feature selection|minimum redundancy maximum relevance]] was the best performing method. Furthermore, [[Feature selection|variable selection]] methods tended to outperform other methods. Variable selection methods do not use all the available genomic variants present in a dataset, but attempt to select an optimal subset of variants to use. This leads to less overfitting but more bias (see [[Bias–variance tradeoff|bias-variance tradeoff]]).
[[Regularized least squares|Penalized regression]] methods, such as [[Lasso (statistics)|LASSO]] and [[ridge regression]], can also be used to improve the accuracy of polygenic scores. Penalized regression can be interpreted as placing informative prior probabilities on how many genetic variants are expected to affect a trait, and the distribution of their effect sizes. In other words, these methods in effect "penalize" the large coefficients in a regression model and shrink them conservatively. Ridge regression accomplishes this by shrinking the prediction with a term that penalizes the sum of the squared coefficients.<ref name="Vlaming2015" /> LASSO accomplishes something similar by penalizing the sum of absolute coefficients.<ref>{{cite journal | vauthors = Vattikuti S, Lee JJ, Chang CC, Hsu SD, Chow CC | title = Applying compressed sensing to genome-wide association studies | journal = GigaScience | volume = 3 | issue = 1 | pages = 10 | year = 2014 | pmid = 25002967 | doi = 10.1186/2047-217X-3-10 | pmc = 4078394 | doi-access = free }}</ref> Bayesian counterparts exist for LASSO and ridge regression, and other priors have been suggested and used. They can perform better in some circumstances.<ref>{{cite journal | vauthors = Gianola D, Rosa GJ | title = One hundred years of statistical developments in animal breeding | journal = Annual Review of Animal Biosciences | volume = 3 | pages = 19–56 | year = 2015 | pmid = 25387231 | doi = 10.1146/annurev-animal-022114-110733 }}</ref> A multi-dataset, multi-method study<ref name=":3" /> found that of 15 different methods compared across four datasets, [[Minimum redundancy feature selection|minimum redundancy maximum relevance]] was the best performing method. Furthermore, [[Feature selection|variable selection]] methods tended to outperform other methods. Variable selection methods do not use all the available genomic variants present in a dataset, but attempt to select an optimal subset of variants to use. This leads to less overfitting but more bias (see [[Bias–variance tradeoff|bias-variance tradeoff]]).

== Validation methods ==
Broadly speaking there are two methods used for PRS validation.

# Test prediction quality in a new dataset containing individuals not used in the training of the predictor. This out-of-sample validation is now a standard requirement in peer review of new genomic predictors. Ideally these individuals would have experienced a different environment than the training set (e.g., were born and raised in a different part of the world, or in different decades). Examples of large scale out-of-sample validations include: CAD in French Canadians<ref>{{cite web
|url=https://www.ahajournals.org/doi/10.1161/CIRCGEN.119.002481
|title=Validation of Genome-Wide Polygenic Risk Scores for Coronary Artery Disease in French Canadians
|vauthors=Wünnemann F, Ken Sin L, Langford-Avelar A, Bussell D, Dubé MP, Tardif JC, Lettre G
|date=2019-06-11
|website=Circulation: Genomic and Precision Medicine
|publisher=AHA Journals
|access-date=2021-04-12}}
</ref>, breast cancer<ref>{{cite journal
|url=https://www.sciencedirect.com/science/article/pii/S0002929718304051
|title=Polygenic risk scores for prediction of breast cancer and breast cancer subtypes.
|vauthors=Mavaddat N, et al.
|date=2019-01-03
|journal=The American Journal of Human Genetics
|volume=104
|issue=1
|doi=10.1016/j.ajhg.2018.11.002
|pmid=30554720
|access-date=2021-04-12
}}
</ref>, blood and urine biomarkers<ref name="Widen2021">{{cite journal
|url=https://www.medrxiv.org/content/10.1101/2021.04.01.21254711v1
|title=Machine Learning Prediction of Biomarkers from SNPs and of Disease Risk from Biomarkers in the UK Biobank
|vauthors=Widen E, Raben TG, Lello L, Hsu SD
|date=2021-04-01
|journal=MedRxiv
|volume=preprint
|doi=10.1101/2021.04.01.21254711
|access-date=2021-04-12
}}
</ref>, among many more.
# Perhaps the most rigorous validation method is to compare siblings who have grown up together. It has been shown that PRS can predict which of two brothers or which of two sisters has a specific condition, such as heart disease or breast cancer. The predictors work almost as well in predicting sibling disease status as when comparing two random individuals from the general population who did not share family environments while growing up. This is strong evidence for causal genetic effects. These results also suggest that embryo selection using PRS can reduce disease risk for children born through IVF.<ref name="GeeWhiz" /><ref name="Treff2020>{{cite journal
| vauthors = Treff NR, Eccles J, Marin D, Messick E, Lello L, Gerber J, Jia X, Tellier LC
| date = 2020-06-12
| title = Preimplantation Genetic Testing for Polygenic Disease Relative Risk Reduction: Evaluation of Genomic Index Performance in 11,883 Adult Sibling Pairs.
| url = https://www.mdpi.com/2073-4425/11/6/648
| journal = Genes
| volume = 11
| issue = 6
| pages = 648
| doi = 10.3390/genes11060648
| pmid = 32545548
| access-date = 2021-04-12
}}
</ref><ref name="Lello2020Sibling>{{cite journal
| vauthors = Lello L, Raben TG, Hsu SD
| date = 2020-08-06
| title = Sibling validation of polygenic risk scores and complex trait prediction.
| url = https://www.nature.com/articles/s41598-020-69927-7
| journal = Scientific Reports
| volume = 10
| issue = 1
| doi = 10.1038/s41598-020-69927-7
| pmid = 32764582
| access-date = 2021-04-12
}}
</ref><ref>{{cite web
|url=https://www.ahajournals.org/doi/abs/10.1161/CIRCGEN.120.003262
|title=Concordance of a High Polygenic Score Among Relatives
|vauthors=Reid NJ, Brockman DG, Leonard CE, Pelletier R, Khera AV
|date=2021-04-02
|website=Circulation: Genomic and Precision Medicine
|publisher=AHA Journals
|access-date=2021-04-12
}}
</ref>



== Predictive performance ==
== Predictive performance ==
Line 55: Line 174:


The use of polygenic scores for [[embryo selection]] has been criticised due to ethical and safety issues as well as limited practical utility.<ref>{{Cite web|last=Birney|first=Ewan | name-list-style = vanc |date=|title=Why using genetic risk scores on embryos is wrong|url=http://ewanbirney.com/2019/11/why-using-genetic-risk-scores-on-embryos-is-wrong.html|url-status=live|archive-url=|archive-date=|access-date=2020-12-16|website=ewanbirney.com}}</ref><ref>{{cite journal | vauthors = Karavani E, Zuk O, Zeevi D, Barzilai N, Stefanis NC, Hatzimanolis A, Smyrnis N, Avramopoulos D, Kruglyak L, Atzmon G, Lam M, Lencz T, Carmi S | display-authors = 6 | title = Screening Human Embryos for Polygenic Traits Has Limited Utility | language = English | journal = Cell | volume = 179 | issue = 6 | pages = 1424–1435.e8 | date = November 2019 | pmid = 31761530 | doi = 10.1016/j.cell.2019.10.033 | pmc = 6957074 }}</ref><ref>{{cite journal | vauthors = Lázaro-Muñoz G, Pereira S, Carmi S, Lencz T | title = Screening embryos for polygenic conditions and traits: ethical considerations for an emerging technology | journal = Genetics in Medicine | pages = 1–3 | date = October 2020 | pmid = 33106616 | doi = 10.1038/s41436-020-01019-3 }}</ref> As of 2019, polygenic scores from well over a hundred phenotypes have been developed from genome-wide association statistics.<ref name="PGSCatalog">{{cite web |title=The Polygenic Score (PGS) Catalog | quote = An open database of polygenic scores and the relevant metadata required for accurate application and evaluation |url=http://www.pgscatalog.org |website=Polygenic Score (PGS) Catalog |access-date=29 April 2020}}</ref> These include scores that can be categorized as anthropometric, behavioural, cardiovascular, non-cancer illness, psychiatric/neurological, and response to treatment/medication.<ref name="atlas">{{cite journal | vauthors = Richardson TG, Harrison S, Hemani G, Davey Smith G | title = An atlas of polygenic risk score associations to highlight putative causal relationships across the human phenome | journal = eLife | volume = 8 | pages = e43657 | date = March 2019 | pmid = 30835202 | pmc = 6400585 | doi = 10.7554/eLife.43657 | doi-access = free }}</ref>
The use of polygenic scores for [[embryo selection]] has been criticised due to ethical and safety issues as well as limited practical utility.<ref>{{Cite web|last=Birney|first=Ewan | name-list-style = vanc |date=|title=Why using genetic risk scores on embryos is wrong|url=http://ewanbirney.com/2019/11/why-using-genetic-risk-scores-on-embryos-is-wrong.html|url-status=live|archive-url=|archive-date=|access-date=2020-12-16|website=ewanbirney.com}}</ref><ref>{{cite journal | vauthors = Karavani E, Zuk O, Zeevi D, Barzilai N, Stefanis NC, Hatzimanolis A, Smyrnis N, Avramopoulos D, Kruglyak L, Atzmon G, Lam M, Lencz T, Carmi S | display-authors = 6 | title = Screening Human Embryos for Polygenic Traits Has Limited Utility | language = English | journal = Cell | volume = 179 | issue = 6 | pages = 1424–1435.e8 | date = November 2019 | pmid = 31761530 | doi = 10.1016/j.cell.2019.10.033 | pmc = 6957074 }}</ref><ref>{{cite journal | vauthors = Lázaro-Muñoz G, Pereira S, Carmi S, Lencz T | title = Screening embryos for polygenic conditions and traits: ethical considerations for an emerging technology | journal = Genetics in Medicine | pages = 1–3 | date = October 2020 | pmid = 33106616 | doi = 10.1038/s41436-020-01019-3 }}</ref> As of 2019, polygenic scores from well over a hundred phenotypes have been developed from genome-wide association statistics.<ref name="PGSCatalog">{{cite web |title=The Polygenic Score (PGS) Catalog | quote = An open database of polygenic scores and the relevant metadata required for accurate application and evaluation |url=http://www.pgscatalog.org |website=Polygenic Score (PGS) Catalog |access-date=29 April 2020}}</ref> These include scores that can be categorized as anthropometric, behavioural, cardiovascular, non-cancer illness, psychiatric/neurological, and response to treatment/medication.<ref name="atlas">{{cite journal | vauthors = Richardson TG, Harrison S, Hemani G, Davey Smith G | title = An atlas of polygenic risk score associations to highlight putative causal relationships across the human phenome | journal = eLife | volume = 8 | pages = e43657 | date = March 2019 | pmid = 30835202 | pmc = 6400585 | doi = 10.7554/eLife.43657 | doi-access = free }}</ref>


== Research on Clinical Applications ==
A New England Journal of Medicine Perspective stated<ref>{{cite journal
| vauthors = Hunter DJ, Drazen JM
| date = 2019-06-20
| title = Has the Genome Granted Our Wish Yet?
| url = https://www.nejm.org/doi/full/10.1056/NEJMp1904511
| journal = The New England Journal of Medicine
| volume = 380
| issue = 1
| pages = 2391-2393
| doi = 10.1056/NEJMp1904511
| pmid = 31091368
| access-date = 2021-04-12
}}
</ref>
<blockquote>
"It is likely that tailoring decisions about prescribing preventive medicines or screening practices will be the main future use of genetic risk scores. If a PRS adds to existing clinical predictors of risk such as the Framingham Risk Score or the Q index for heart disease, it could be incorporated into preventive care as readily as any other biomarker."

"There seems little doubt that interpretation of these scores will become an accepted part of clinical practice in the future..."
</blockquote>
The UK National Health Service plans to genotype 5 million individuals and study the incorporation of PRS into standard clinical care.<ref>{{cite web
|url=https://www.theguardian.com/society/2019/mar/23/are-predictive-genetic-test-useful-to-predict-cancer-matt-hancock
|title=Are genetic tests useful to predict cancer?
|last=Devlin
|first=Hannah
|date=2019-03-23
|website=The Guardian
|access-date=2021-04-12
}}
</ref>

Commercial entities such as Myriad and Ambry provide polygenic breast cancer risk prediction, in addition to tests for monogenic risk alleles such as BRCA1 and BRCA2.<ref>{{cite web
|url=https://myriadmyrisk.com
|title=MYRIAD MyRisk
|website=MYRIAD
|access-date=2021-04-12}}
</ref>
<ref>{{cite web
|url=https://www.ambrygen.com/providers/genetic-testing/147/oncology/brcanext-expanded
|title=BRCANext-Expanded™
|website=Ambry Genetics
|access-date=2021-04-12}}
</ref>




== Non-predictive uses ==
== Non-predictive uses ==
In humans, polygenic scores were originally computed in an effort to predict the prevalence and etiology of complex, [[heritable]] diseases, which are typically affected by many genetic variants that individually confer a small effect to overall risk. A [[genome-wide association study]] (GWAS) of a such a [[polygenic]] trait is able to identify these individual genetic loci of small effect in a large enough sample, and various methods of aggregating the results can be used to form a polygenic score.{{clarify|date=May 2020}} This score will typically explain at least a few percent of a phenotype's variance, and can therefore be assumed to effectively incorporate a significant fraction of the genetic variants affecting that phenotype. A polygenic score can be used in several different ways: as a lower bound to test whether heritability estimates may be biased; as a measure of genetic overlap of traits ([[genetic correlation]]), which might indicate e.g. shared genetic bases for groups of mental disorders; as a means to assess group differences in a trait such as height, or to examine changes in a trait over time due to [[natural selection]] indicative of a [[soft selective sweep]] (as e.g. for intelligence where the changes in frequency would be too small to detect on each individual hit but not on the overall polygenic score); in [[Mendelian randomization]] (assuming no [[pleiotropy]] with relevant traits); to detect & control for the presence of genetic confounds in outcomes (e.g. the correlation of schizophrenia with poverty); or to investigate [[gene–environment interaction]]s and [[gene–environment correlation|correlation]]s.
In humans, polygenic scores were originally computed in an effort to predict the prevalence and etiology of complex, [[heritable]] diseases, which are typically affected by many genetic variants that individually confer a small effect to overall risk. A [[genome-wide association study]] (GWAS) of a such a [[polygenic]] trait is able to identify these individual genetic loci of small effect in a large enough sample, and various methods of aggregating the results can be used to form a polygenic score.{{clarify|date=May 2020}} This score will typically explain at least a few percent of a phenotype's variance, and can therefore be assumed to effectively incorporate a significant fraction of the genetic variants affecting that phenotype. A polygenic score can be used in several different ways: as a lower bound to test whether heritability estimates may be biased; as a measure of genetic overlap of traits ([[genetic correlation]]), which might indicate e.g. shared genetic bases for groups of mental disorders; as a means to assess group differences in a trait such as height, or to examine changes in a trait over time due to [[natural selection]] indicative of a [[soft selective sweep]] (as e.g. for intelligence where the changes in frequency would be too small to detect on each individual hit but not on the overall polygenic score); in [[Mendelian randomization]] (assuming no [[pleiotropy]] with relevant traits); to detect & control for the presence of genetic confounds in outcomes (e.g. the correlation of schizophrenia with poverty); or to investigate [[gene–environment interaction]]s and [[gene–environment correlation|correlation]]s.

== Genetic Architecture ==
It is possible to analyze the specific genetic variants (SNPs) utilized in human complex trait predictors, which can vary from hundreds to as many as thirty thousand. There are now dozens of well-validated PRS, for phenotypes including disease conditions (diabetes, heart disease, cancer) and quantitative traits (height, bone density, biomarkers).<ref name="Lello2019" />

The fraction of SNPs in or near genic regions varies widely by phenotype. For the majority of disease conditions studied, a large amount of the variance is accounted for by SNPs outside of coding regions. The state of these SNPs cannot be determined from exome-sequencing data. This suggests that exome data alone will miss much of the heritability for these traits—i.e., existing PRS cannot be computed from exome data alone.

The fraction of SNPs and of total variance that is in common between pairs of predictors is typically small. This is counter to previous intuitions concerning pleiotropy: it had been assumed that primarily protein-coding genic regions must be responsible for phenotype variation, and since the number of genes is limited (making up at most a few percent of the entire genome) any causal variant would be likely to affect multiple phenotypes or disease risks.<ref>{{cite book
|last=Judson
|first=Horace Freeland
|date=1996
|title=The Eighth Day of Creation: Makers of the Revolution in Biology
|location=Plainview, N.Y.
|publisher=Cold Spring Harbor Laboratory
|isbn=0-87969-478-5
}}
</ref> Previous reasoning concerning pleiotropy does not take into account the very high dimensionality of genomic information space. Once it is realized that causal variants can be located far from protein-coding genic regions the space of possibilities becomes immensely larger.

Direct analysis of existing PRS shows that the DNA regions used in disease risk predictors seem to be largely disjoint, suggesting that individual genetic disease risks are largely uncorrelated. It seems possible in theory for an individual to be a low-risk outlier in all conditions simultaneously.<ref name="Lello2019" />

Given roughly 10 million common variants between any two individual humans, and typically a few thousand SNPs (in largely disjoint regions) used to capture most of the variance in a phenotype predictor, the dimensionality of the space of individual variation (phenotypes) has been theorized to be on the order of a few thousand.<ref name="Raben2021 />

Another aspect of genetic architecture that was not broadly anticipated is that additive, or linear, models are capable of capturing most of the expected phenotypic variation. Tests for nonlinear effects (e.g., interactions between alleles) have typically found only small effects.<ref>{{cite bioRxiv
|vauthors=Hivert V, Sidorenko J, Rohart F, Goddard FR, Yang J, Wray NR, Yengo L, Visscher VM
|date= 2020-11-09
|title= Estimation of non-additive genetic variance in human complex traits from a large sample of unrelated individuals.
|biorxiv=10.1101/2020.11.09.375501}}
</ref><ref name="Raben2021"/> This approximate linearity also reduces the effects of pleiotropy (interactions between different genetic variants are smaller than expected), and increases confidence that PRS construction is tractable, with considerable improvements in the near term as datasets increase in size.<ref name="Raben2021" />



== References ==
== References ==

Revision as of 23:48, 12 April 2021

An illustration of the distribution and stratification ability of a polygenic risk score

In genetics, a polygenic score, also called a polygenic risk score (PRS), genetic risk score, or genome-wide score, is a number that summarises the estimated effect of many genetic variants on an individual's phenotype, typically calculated as a weighted sum of trait-associated alleles.[1][2][3] It reflects an individual's estimated genetic predisposition for a given trait and can be used as a predictor for that trait.[4][5][6][7][8] In other words, it gives an estimate of how likely an individual is to have a given trait only based on genetics, without taking environmental factors into account. Polygenic scores are widely used in animal breeding and plant breeding (usually termed genomic prediction or genomic selection) due to their efficacy in improving livestock breeding and crops.[9]

Recent progress in machine learning (ML) analysis of large genomic datasets has enabled the creation of polygenic predictors of complex human traits, including risk for many important complex diseases,[10][11] which are typically affected by many genetic variants that each confer a small effect on overall risk.[12][13] In a polygenic risk predictor the lifetime (or age-range) risk for the disease is a numerical function (Polygenic Risk Score or PRS) which depends on the states of thousands of individual genetic variants (i.e., Single Nucleotide Polymorphisms, or SNPs).

Polygenic Risk Scores are an area of intense scientific investigation: hundreds of papers are written each year on topics such as learning algorithms for genomic prediction, new predictor training, validation testing of predictors, clinical application of PRS.[14] [15] [16] [6][11] In 2018 the American Heart Association named polygenic risk scores as one of the major breakthroughs in research in heart disease and stroke. [17]


History

An early (2006) example of a genetic risk score applied to Type 2 Diabetes in humans. Individuals with Type 2 diabetes (white bars) have a higher score than controls (black bars).[18]

One of the first precursors to the modern polygenic score was proposed under the term marker-assisted selection (MAS) in 1990.[19] According to MAS, breeders are able to increase the efficiency of artificial selection by estimating the regression coefficients of genetic markers that are correlated with differences in the trait of interest and assigning individual animals a "score" from this information. A major development of these fundamentals was proposed in 2001 by researchers who discovered that the use of a Bayesian prior could help to mitigate the problem of the number of markers being greater than the sample of animals.[20]

These methods were first applied to humans in the late 2000s, starting with a proposal in 2007 that these scores could be used in human genetics to identify individuals at high risk for disease.[21] This was successfully applied in empirical research for the first time in 2009 by researchers who organized a genome-wide association study (GWAS) of schizophrenia to construct scores of risk propensity. This study was also the first to use the term polygenic score for a prediction drawn from a linear combination of single-nucleotide polymorphism (SNP) genotypes, which was able to explain 3% of the variance in schizophrenia.[22]

Methods of construction

A polygenic score (PGS) is constructed from the "weights" derived from a genome-wide association study (GWAS), or from some form of machine learning algorithm. In a GWAS, a set of genetic markers (usually SNPs) is genotyped on a training sample, and effect sizes are estimated for each marker's association with the trait of interest. These weights are then used to assign individualized polygenic scores in an independent replication sample.[1] The estimated score, , generally follows the form

,

where the of an individual is equal to the weighted sum of the individual's marker genotypes, , at SNPs.[1] Weights are estimated using some form of regression analysis. Because the number of genomic variants is usually larger than the sample size, one cannot use OLS multiple regression (p > n problem[23][24]). Researchers have proposed various methodologies that deal with this problem as well as how to generate the weights of the SNPs, , and how to determine which SNPs should be included.

Pruning and thresholding

The simplest so-called "pruning and thresholding" method of construction sets weights equal to the coefficient estimates from a regression of the trait on each genetic variant. The included SNPs may be selected using an algorithm that attempts to ensure that each marker is approximately independent. Failing to account for non-random association of genetic variants will typically reduce the score's predictive accuracy. This is important because genetic variants are often correlated with other nearby variants, such that the weight of a causal variant will be attenuated if it is more strongly correlated with its neighbors than a null variant. This is called linkage disequilibrium, a common phenomenon that arises from the shared evolutionary history of neighboring genetic variants. Further restriction can be achieved by multiple-testing different sets of SNPs selected at various thresholds, such as all SNPs which are genome-wide statistically-significant hits or all SNPs p < 0.05 or all SNPs with p < 0.50, and the one with greatest performance used for further analysis; especially for highly polygenic traits, the best polygenic score will tend to use most or all SNPs.[25]

Bayesian methods

Bayesian approaches, originally pioneered in concept in 2001,[20] attempt to explicitly model preexisting genetic architecture, thereby accounting for the distribution of effect sizes with a prior that should improve the accuracy of a polygenic score. One of the most popular modern Bayesian methods uses "linkage disequilibrium prediction" (LDpred for short) to set the weight for each SNP equal to the average of its posterior distribution after linkage disequilibrium has been accounted for. LDpred tends to outperform simpler methods of pruning and thresholding, especially at large sample sizes; for example, its estimations have improved the predicted variance of a polygenic score for schizophrenia in a large data set from 20.1% to 25.3%.[8]

Penalized regression

Penalized regression methods, such as LASSO and ridge regression, can also be used to improve the accuracy of polygenic scores. Penalized regression can be interpreted as placing informative prior probabilities on how many genetic variants are expected to affect a trait, and the distribution of their effect sizes. In other words, these methods in effect "penalize" the large coefficients in a regression model and shrink them conservatively. Ridge regression accomplishes this by shrinking the prediction with a term that penalizes the sum of the squared coefficients.[4] LASSO accomplishes something similar by penalizing the sum of absolute coefficients.[26] Bayesian counterparts exist for LASSO and ridge regression, and other priors have been suggested and used. They can perform better in some circumstances.[27] A multi-dataset, multi-method study[24] found that of 15 different methods compared across four datasets, minimum redundancy maximum relevance was the best performing method. Furthermore, variable selection methods tended to outperform other methods. Variable selection methods do not use all the available genomic variants present in a dataset, but attempt to select an optimal subset of variants to use. This leads to less overfitting but more bias (see bias-variance tradeoff).

Validation methods

Broadly speaking there are two methods used for PRS validation.

  1. Test prediction quality in a new dataset containing individuals not used in the training of the predictor. This out-of-sample validation is now a standard requirement in peer review of new genomic predictors. Ideally these individuals would have experienced a different environment than the training set (e.g., were born and raised in a different part of the world, or in different decades). Examples of large scale out-of-sample validations include: CAD in French Canadians[28], breast cancer[29], blood and urine biomarkers[30], among many more.
  2. Perhaps the most rigorous validation method is to compare siblings who have grown up together. It has been shown that PRS can predict which of two brothers or which of two sisters has a specific condition, such as heart disease or breast cancer. The predictors work almost as well in predicting sibling disease status as when comparing two random individuals from the general population who did not share family environments while growing up. This is strong evidence for causal genetic effects. These results also suggest that embryo selection using PRS can reduce disease risk for children born through IVF.[14][31][32][33]


Predictive performance

The benefit of polygenic scores is that they can be used to predict the future for crops, animal breeding, and humans alike. Although the same basic concepts underlie these areas of prediction, they face different challenges that require different methodologies. The ability to produce very large family size in nonhuman species, accompanied by deliberate selection, leads to a smaller effective population, higher degrees of linkage disequilibrium among individuals, and a higher average genetic relatedness among individuals within a population. For example, members of plant and animal breeds that humans have effectively created, such as modern maize or domestic cattle, are all technically "related". In human genomic prediction, by contrast, unrelated individuals in large populations are selected to estimate the effects of common SNPs. Because of smaller effective population in livestock, the mean coefficient of relationship between any two individuals is likely high, and common SNPs will tag causal variants at greater physical distance than for humans; this is the major reason for lower SNP-based heritability estimates for humans compared to livestock. In both cases, however, sample size is key for maximizing the accuracy of genomic prediction.[34]

While modern genomic prediction scoring in humans is generally referred to as a "polygenic score" (PGS) or a "polygenic risk score" (PRS), in livestock the more common term is "genomic estimated breeding value", or GEBV (similar to the more familiar "EBV", but with genotypic data). Conceptually, a GEBV is the same as a PGS: a linear function of genetic variants that are each weighted by the apparent effect of the variant. Despite this, polygenic prediction in livestock is useful for a fundamentally different reason than for humans. In humans, a PRS is used for the prediction of individual phenotype, while in livestock a GEBV is typically used to predict the offspring’s average value of a phenotype of interest in terms of the genetic material it inherited from a parent. In this way, a GEBV can be understood as the average of the offspring of an individual or pair of individual animals. GEBVs are also typically communicated in the units of the trait of interest. For example, the expected increase in milk production of the offspring of a specific parent compared to the offspring from a reference population might be a typical way of using a GEBV in dairy cow breeding and selection.[34]

Some accuracy values are given in the sections below for comparison purposes. These are given in terms of correlations and have been converted from explained variance if given in that format in the source.

In plants

The predictive value of polygenic scoring has large practical benefits for plant and animal breeding because it increases the selection precision and allows for shorter generations, both of which speed up evolution.[35] Genomic prediction with some version of polygenic scoring has been used in experiments on maize, small grains such as barley, wheat, oats and rye, and rice biparental families. In many cases, these predictions have been so successful that researchers have advocated for its use in combating global population growth and climate change.[9]

  • In 2015, r ≈ 0.55 for total root length in maize.[36]
  • In 2014, r ≈ 0.03 to 0.99 across four traits in barley.[37]

In non-human animals

In humans

For humans, while most polygenic scores are not predictive enough to diagnose disease, they could potentially be used in addition to other covariates (such as age, BMI, smoking status) to improve estimates of disease susceptibility.[41][2][13] Although issues such as systematically poorer performance in individuals of non-European ancestry limit ethical and practical widespread use,[42] several authors have noted that many causal variants that underlie common genetic variation in Europeans are shared across different continents for (e.g.) BMI and type 2 diabetes in African populations[43] as well as schizophrenia in Chinese populations.[44] Other researchers recognize that polygenic underprediction in non-European population should galvanize new GWAS that prioritize greater genetic diversity in order to maximize the potential health benefits brought about by predictive polygenic scores.[45]

  • In 2016, r ≈ 0.30 for educational attainment variation at age 16.[46] This polygenic score was based on a GWAS using data from 293,000 people.[47]
  • In 2016, r ≈ 0.31 for case/control status for first-episode psychosis.[48]
  • In 2017, r ≈ 0.29 for case/control status for schizophrenia in combined European and Chinese samples.[44]
  • In 2018, r ≈ 0.67 for height variation in adulthood, resulting in prediction within ~3 cm for most individuals in the study.[49]
  • In 2018, r ≈ 0.23 for intelligence from samples of 269,867 Europeans.[50]
  • In 2018, r ≈ 0.33 to 0.36 for educational attainment and r ≈ 0.26 to 0.32 for intelligence from over 1.1 million Europeans.[51]

The use of polygenic scores for embryo selection has been criticised due to ethical and safety issues as well as limited practical utility.[52][53][54] As of 2019, polygenic scores from well over a hundred phenotypes have been developed from genome-wide association statistics.[55] These include scores that can be categorized as anthropometric, behavioural, cardiovascular, non-cancer illness, psychiatric/neurological, and response to treatment/medication.[56]


Research on Clinical Applications

A New England Journal of Medicine Perspective stated[57]

"It is likely that tailoring decisions about prescribing preventive medicines or screening practices will be the main future use of genetic risk scores. If a PRS adds to existing clinical predictors of risk such as the Framingham Risk Score or the Q index for heart disease, it could be incorporated into preventive care as readily as any other biomarker."

"There seems little doubt that interpretation of these scores will become an accepted part of clinical practice in the future..."

The UK National Health Service plans to genotype 5 million individuals and study the incorporation of PRS into standard clinical care.[58]

Commercial entities such as Myriad and Ambry provide polygenic breast cancer risk prediction, in addition to tests for monogenic risk alleles such as BRCA1 and BRCA2.[59] [60]


Non-predictive uses

In humans, polygenic scores were originally computed in an effort to predict the prevalence and etiology of complex, heritable diseases, which are typically affected by many genetic variants that individually confer a small effect to overall risk. A genome-wide association study (GWAS) of a such a polygenic trait is able to identify these individual genetic loci of small effect in a large enough sample, and various methods of aggregating the results can be used to form a polygenic score.[clarification needed] This score will typically explain at least a few percent of a phenotype's variance, and can therefore be assumed to effectively incorporate a significant fraction of the genetic variants affecting that phenotype. A polygenic score can be used in several different ways: as a lower bound to test whether heritability estimates may be biased; as a measure of genetic overlap of traits (genetic correlation), which might indicate e.g. shared genetic bases for groups of mental disorders; as a means to assess group differences in a trait such as height, or to examine changes in a trait over time due to natural selection indicative of a soft selective sweep (as e.g. for intelligence where the changes in frequency would be too small to detect on each individual hit but not on the overall polygenic score); in Mendelian randomization (assuming no pleiotropy with relevant traits); to detect & control for the presence of genetic confounds in outcomes (e.g. the correlation of schizophrenia with poverty); or to investigate gene–environment interactions and correlations.

Genetic Architecture

It is possible to analyze the specific genetic variants (SNPs) utilized in human complex trait predictors, which can vary from hundreds to as many as thirty thousand. There are now dozens of well-validated PRS, for phenotypes including disease conditions (diabetes, heart disease, cancer) and quantitative traits (height, bone density, biomarkers).[11]

The fraction of SNPs in or near genic regions varies widely by phenotype. For the majority of disease conditions studied, a large amount of the variance is accounted for by SNPs outside of coding regions. The state of these SNPs cannot be determined from exome-sequencing data. This suggests that exome data alone will miss much of the heritability for these traits—i.e., existing PRS cannot be computed from exome data alone.

The fraction of SNPs and of total variance that is in common between pairs of predictors is typically small. This is counter to previous intuitions concerning pleiotropy: it had been assumed that primarily protein-coding genic regions must be responsible for phenotype variation, and since the number of genes is limited (making up at most a few percent of the entire genome) any causal variant would be likely to affect multiple phenotypes or disease risks.[61] Previous reasoning concerning pleiotropy does not take into account the very high dimensionality of genomic information space. Once it is realized that causal variants can be located far from protein-coding genic regions the space of possibilities becomes immensely larger.

Direct analysis of existing PRS shows that the DNA regions used in disease risk predictors seem to be largely disjoint, suggesting that individual genetic disease risks are largely uncorrelated. It seems possible in theory for an individual to be a low-risk outlier in all conditions simultaneously.[11]

Given roughly 10 million common variants between any two individual humans, and typically a few thousand SNPs (in largely disjoint regions) used to capture most of the variance in a phenotype predictor, the dimensionality of the space of individual variation (phenotypes) has been theorized to be on the order of a few thousand.[16]

Another aspect of genetic architecture that was not broadly anticipated is that additive, or linear, models are capable of capturing most of the expected phenotypic variation. Tests for nonlinear effects (e.g., interactions between alleles) have typically found only small effects.[62][16] This approximate linearity also reduces the effects of pleiotropy (interactions between different genetic variants are smaller than expected), and increases confidence that PRS construction is tractable, with considerable improvements in the near term as datasets increase in size.[16]


References

  1. ^ a b c Dudbridge F (March 2013). "Power and predictive accuracy of polygenic risk scores". PLOS Genetics. 9 (3): e1003348. doi:10.1371/journal.pgen.1003348. PMC 3605113. PMID 23555274.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  2. ^ a b Torkamani A, Wineinger NE, Topol EJ (September 2018). "The personal and clinical utility of polygenic risk scores". Nature Reviews. Genetics. 19 (9): 581–590. doi:10.1038/s41576-018-0018-x. PMID 29789686. S2CID 46893131.
  3. ^ Lambert SA, Abraham G, Inouye M (November 2019). "Towards clinical utility of polygenic risk scores". Human Molecular Genetics. 28 (R2): R133–R142. doi:10.1093/hmg/ddz187. PMID 31363735.
  4. ^ a b de Vlaming R, Groenen PJ (2015). "The Current and Future Use of Ridge Regression for Prediction in Quantitative Genetics". BioMed Research International. 2015: 143712. doi:10.1155/2015/143712. PMC 4529984. PMID 26273586.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  5. ^ Lewis CM, Vassos E (November 2017). "Prospects for using risk scores in polygenic medicine". Genome Medicine. 9 (1): 96. doi:10.1186/s13073-017-0489-y. PMC 5683372. PMID 29132412.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  6. ^ a b Khera AV, Chaffin M, Aragam KG, Haas ME, Roselli C, Choi SH, et al. (September 2018). "Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations". Nature Genetics. 50 (9): 1219–1224. doi:10.1038/s41588-018-0183-z. PMC 6128408. PMID 30104762.
  7. ^ Yanes T, Meiser B, Kaur R, Scheepers-Joynt M, McInerny S, Taylor S, et al. (March 2020). "Uptake of polygenic risk information among women at increased risk of breast cancer" (PDF). Clinical Genetics. 97 (3): 492–501. doi:10.1111/cge.13687. PMID 31833054. S2CID 209342044.
  8. ^ a b Vilhjálmsson BJ, Yang J, Finucane HK, Gusev A, Lindström S, Ripke S, et al. (October 2015). "Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores". American Journal of Human Genetics. 97 (4): 576–92. doi:10.1016/j.ajhg.2015.09.001. PMC 4596916. PMID 26430803.
  9. ^ a b Spindel JE, McCouch SR (December 2016). "When more is better: how data sharing would accelerate genomic selection of crop plants". The New Phytologist. 212 (4): 814–826. doi:10.1111/nph.14174. PMID 27716975.
  10. ^ Regalado A (8 March 2019). "23andMe thinks polygenic risk scores are ready for the masses, but experts aren't so sure". MIT Technology Review. Retrieved 2020-08-14.
  11. ^ a b c d Lello L, Raben TG, Yong SY, Tellier LC, Hsu SD (2019-10-25). "Genomic Prediction of 16 Complex Disease Risks Including Heart Attack, Diabetes, Breast and Prostate Cancer". Scientific Reports. 9 (1): 15286. doi:10.1038/s41598-019-51258-x. PMID 31653892. Retrieved 2021-04-12.
  12. ^ Visscher PM, Wray NR, Zhang Q, Sklar P, McCarthy MI, Brown MA, Yang J (July 2017). "10 Years of GWAS Discovery: Biology, Function, and Translation". American Journal of Human Genetics. 101 (1): 5–22. doi:10.1016/j.ajhg.2017.06.005. PMC 5501872. PMID 28686856.
  13. ^ a b Spiliopoulou A, Nagy R, Bermingham ML, Huffman JE, Hayward C, Vitart V, et al. (July 2015). "Genomic prediction of complex human traits: relatedness, trait architecture and predictive meta-models". Human Molecular Genetics. 24 (14): 4167–82. doi:10.1093/hmg/ddv145. PMC 4476450. PMID 25918167.
  14. ^ a b "Modern genetics will improve health and usher in "designer" children". The Economist. 2019-11-09. Retrieved 2021-04-12.
  15. ^ "Test could predict risk of future heart disease for just £40". The Guardian. 2018-10-08. Retrieved 2021-04-12.
  16. ^ a b c d Raben TG, Lello L, Widen E, Hsu SD (2021-01-14). "From Genotype to Phenotype: polygenic prediction of complex human traits". arXiv:2101.05870 [q-bio].
  17. ^ "Big picture genetic scoring approach reliably predicts heart disease". Science Daily. 2019-06-11. Retrieved 2021-04-12.
  18. ^ Weedon MN, McCarthy MI, Hitman G, Walker M, Groves CJ, Zeggini E, et al. (October 2006). "Combining information from common type 2 diabetes risk polymorphisms improves disease prediction". PLOS Medicine. 3 (10): e374. doi:10.1371/journal.pmed.0030374. PMC 1584415. PMID 17020404.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  19. ^ Xie C, Xu S (April 1998). "Efficiency of multistage marker-assisted selection in the improvement of multiple quantitative traits". Heredity. 80 ( Pt 4) (3): 489–98. doi:10.1046/j.1365-2540.1998.00308.x. PMID 9618913.
  20. ^ a b Meuwissen TH, Hayes BJ, Goddard ME (April 2001). "Prediction of total genetic value using genome-wide dense marker maps". Genetics. 157 (4): 1819–29. PMC 1461589. PMID 11290733.
  21. ^ Wray NR, Goddard ME, Visscher PM (October 2007). "Prediction of individual genetic risk to disease from genome-wide association studies". Genome Research. 17 (10): 1520–8. doi:10.1101/gr.6665407. PMC 1987352. PMID 17785532.
  22. ^ Purcell SM, Wray NR, Stone JL, Visscher PM, O'Donovan MC, Sullivan PF, Sklar P (August 2009). "Common polygenic variation contributes to risk of schizophrenia and bipolar disorder". Nature. 460 (7256): 748–52. Bibcode:2009Natur.460..748P. doi:10.1038/nature08185. PMC 3912837. PMID 19571811.
  23. ^ James G (2013). An Introduction to Statistical Learning: with Applications in R. Springer. ISBN 978-1461471370.
  24. ^ a b Haws DC, Rish I, Teyssedre S, He D, Lozano AC, Kambadur P, et al. (2015-10-06). "Variable-Selection Emerges on Top in Empirical Comparison of Whole-Genome Complex-Trait Prediction Methods". PLOS ONE. 10 (10): e0138903. Bibcode:2015PLoSO..1038903H. doi:10.1371/journal.pone.0138903. PMC 4595020. PMID 26439851.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  25. ^ Ware EB, Schmitz LL, Faul J, Gard A, Mitchell C, Smith JA, Zhao W, Weir D, Kardia SL (January 2017). "Heterogeneity in polygenic scores for common human traits". bioRxiv: 106062. doi:10.1101/106062.
  26. ^ Vattikuti S, Lee JJ, Chang CC, Hsu SD, Chow CC (2014). "Applying compressed sensing to genome-wide association studies". GigaScience. 3 (1): 10. doi:10.1186/2047-217X-3-10. PMC 4078394. PMID 25002967.
  27. ^ Gianola D, Rosa GJ (2015). "One hundred years of statistical developments in animal breeding". Annual Review of Animal Biosciences. 3: 19–56. doi:10.1146/annurev-animal-022114-110733. PMID 25387231.
  28. ^ Wünnemann F, Ken Sin L, Langford-Avelar A, Bussell D, Dubé MP, Tardif JC, Lettre G (2019-06-11). "Validation of Genome-Wide Polygenic Risk Scores for Coronary Artery Disease in French Canadians". Circulation: Genomic and Precision Medicine. AHA Journals. Retrieved 2021-04-12.
  29. ^ Mavaddat N, et al. (2019-01-03). "Polygenic risk scores for prediction of breast cancer and breast cancer subtypes". The American Journal of Human Genetics. 104 (1). doi:10.1016/j.ajhg.2018.11.002. PMID 30554720. Retrieved 2021-04-12.
  30. ^ Widen E, Raben TG, Lello L, Hsu SD (2021-04-01). "Machine Learning Prediction of Biomarkers from SNPs and of Disease Risk from Biomarkers in the UK Biobank". MedRxiv. preprint. doi:10.1101/2021.04.01.21254711. Retrieved 2021-04-12.
  31. ^ Treff NR, Eccles J, Marin D, Messick E, Lello L, Gerber J, Jia X, Tellier LC (2020-06-12). "Preimplantation Genetic Testing for Polygenic Disease Relative Risk Reduction: Evaluation of Genomic Index Performance in 11,883 Adult Sibling Pairs". Genes. 11 (6): 648. doi:10.3390/genes11060648. PMID 32545548. Retrieved 2021-04-12.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  32. ^ Lello L, Raben TG, Hsu SD (2020-08-06). "Sibling validation of polygenic risk scores and complex trait prediction". Scientific Reports. 10 (1). doi:10.1038/s41598-020-69927-7. PMID 32764582. Retrieved 2021-04-12.
  33. ^ Reid NJ, Brockman DG, Leonard CE, Pelletier R, Khera AV (2021-04-02). "Concordance of a High Polygenic Score Among Relatives". Circulation: Genomic and Precision Medicine. AHA Journals. Retrieved 2021-04-12.
  34. ^ a b Wray NR, Kemper KE, Hayes BJ, Goddard ME, Visscher PM (April 2019). "Complex Trait Prediction from Genome Data: Contrasting EBV in Livestock to PRS in Humans: Genomic Prediction". Genetics. 211 (4): 1131–1141. doi:10.1534/genetics.119.301859. PMC 6456317. PMID 30967442.
  35. ^ Heslot N, Jannink JL, Sorrells ME (January 2015). "Perspectives for Genomic Selection Applications and Research in Plants". Crop Science. 55 (1): 1–12. doi:10.2135/cropsci2014.03.0249. ISSN 0011-183X.
  36. ^ Pace J, Yu X, Lübberstedt T (September 2015). "Genomic prediction of seedling root length in maize (Zea mays L.)". The Plant Journal. 83 (5): 903–12. doi:10.1111/tpj.12937. PMID 26189993.
  37. ^ Sallam AH, Endelman JB, Jannink JL, Smith KP (2015-03-01). "Assessing Genomic Selection Prediction Accuracy in a Dynamic Barley Breeding Population". The Plant Genome. 8 (1): 0. doi:10.3835/plantgenome2014.05.0020. ISSN 1940-3372. PMID 33228279.
  38. ^ Hayr MK, Druet T, Garrick DJ (2016-04-01). "027 Performance of genomic prediction using haplotypes in New Zealand dairy cattle". Journal of Animal Science. 94 (supplement2): 13. doi:10.2527/msasas2016-027. ISSN 1525-3163.
  39. ^ Chen L, Vinsky M, Li C (February 2015). "Accuracy of predicting genomic breeding values for carcass merit traits in Angus and Charolais beef cattle". Animal Genetics. 46 (1): 55–9. doi:10.1111/age.12238. PMID 25393962.
  40. ^ Liu T, Qu H, Luo C, Shu D, Wang J, Lund MS, Su G (October 2014). "Accuracy of genomic prediction for growth and carcass traits in Chinese triple-yellow chickens". BMC Genetics. 15 (110): 110. doi:10.1186/s12863-014-0110-y. PMC 4201679. PMID 25316160.
  41. ^ Lewis CM, Vassos E (May 2020). "Polygenic risk scores: from research tools to clinical instruments". Genome Medicine. 12 (1): 44. doi:10.1186/s13073-020-00742-5. PMC 7236300. PMID 32423490.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  42. ^ Duncan L, Shen H, Gelaye B, Meijsen J, Ressler K, Feldman M, et al. (July 2019). "Analysis of polygenic risk score usage and performance in diverse human populations". Nature Communications. 10 (1): 3328. Bibcode:2019NatCo..10.3328D. doi:10.1038/s41467-019-11112-0. PMC 6658471. PMID 31346163.
  43. ^ Wang Y, Guo J, Ni G, Yang J, Visscher PM, Yengo L (July 2020). "Theoretical and empirical quantification of the accuracy of polygenic scores in ancestry divergent populations". Nature Communications. 11 (1): 3865. Bibcode:2020NatCo..11.3865W. doi:10.1038/s41467-020-17719-y. PMC 7395791. PMID 32737319.
  44. ^ a b Li Z, Chen J, Yu H, He L, Xu Y, Zhang D, et al. (November 2017). "Genome-wide association analysis identifies 30 new susceptibility loci for schizophrenia" (PDF). Nature Genetics. 49 (11): 1576–1583. doi:10.1038/ng.3973. PMID 28991256. S2CID 205355668.
  45. ^ Martin AR, Kanai M, Kamatani Y, Okada Y, Neale BM, Daly MJ (April 2019). "Clinical use of current polygenic risk scores may exacerbate health disparities". Nature Genetics. 51 (4): 584–591. doi:10.1038/s41588-019-0379-x. PMC 6563838. PMID 30926966.
  46. ^ Selzam S, Krapohl E, von Stumm S, O'Reilly PF, Rimfeld K, Kovas Y, et al. (February 2017). "Predicting educational achievement from DNA". Molecular Psychiatry. 22 (2): 267–272. doi:10.1038/mp.2016.107. PMC 5285461. PMID 27431296.
  47. ^ Okbay A, Beauchamp JP, Fontana MA, Lee JJ, Pers TH, Rietveld CA, et al. (May 2016). "Genome-wide association study identifies 74 loci associated with educational attainment". Nature. 533 (7604): 539–42. Bibcode:2016Natur.533..539O. doi:10.1038/nature17671. PMC 4883595. PMID 27225129.
  48. ^ Vassos E, Di Forti M, Coleman J, Iyegbe C, Prata D, Euesden J, et al. (March 2017). "An Examination of Polygenic Score Risk Prediction in Individuals With First-Episode Psychosis". Biological Psychiatry. 81 (6): 470–477. doi:10.1016/j.biopsych.2016.06.028. PMID 27765268.
  49. ^ Lello L, Avery SG, Tellier L, Vazquez AI, de Los Campos G, Hsu SD (October 2018). "Accurate Genomic Prediction of Human Height". Genetics. 210 (2): 477–497. doi:10.1534/genetics.118.301267. PMC 6216598. PMID 30150289.
  50. ^ Savage JE, Jansen PR, Stringer S, Watanabe K, Bryois J, de Leeuw CA, et al. (2018). "Genome-wide association meta-analysis in 269,867 individuals identifies new genetic and functional links to intelligence". Nature Genetics. 50 (7): 912–19. doi:10.1038/s41588-018-0152-6. PMC 6411041. PMID 29942086.
  51. ^ Lee JJ, Wedow R, Okbay A, Kong E, Maghzian O, Zacher M, et al. (2018). "Gene discovery and polygenic prediction from a genome-wide association study of educational attainment in 1.1 million individuals". Nature Genetics. 50 (8): 1112–1121. doi:10.1038/s41588-018-0147-3. PMC 6393768. PMID 30038396.
  52. ^ Birney E. "Why using genetic risk scores on embryos is wrong". ewanbirney.com. Retrieved 2020-12-16.{{cite web}}: CS1 maint: url-status (link)
  53. ^ Karavani E, Zuk O, Zeevi D, Barzilai N, Stefanis NC, Hatzimanolis A, et al. (November 2019). "Screening Human Embryos for Polygenic Traits Has Limited Utility". Cell. 179 (6): 1424–1435.e8. doi:10.1016/j.cell.2019.10.033. PMC 6957074. PMID 31761530.
  54. ^ Lázaro-Muñoz G, Pereira S, Carmi S, Lencz T (October 2020). "Screening embryos for polygenic conditions and traits: ethical considerations for an emerging technology". Genetics in Medicine: 1–3. doi:10.1038/s41436-020-01019-3. PMID 33106616.
  55. ^ "The Polygenic Score (PGS) Catalog". Polygenic Score (PGS) Catalog. Retrieved 29 April 2020. An open database of polygenic scores and the relevant metadata required for accurate application and evaluation
  56. ^ Richardson TG, Harrison S, Hemani G, Davey Smith G (March 2019). "An atlas of polygenic risk score associations to highlight putative causal relationships across the human phenome". eLife. 8: e43657. doi:10.7554/eLife.43657. PMC 6400585. PMID 30835202.
  57. ^ Hunter DJ, Drazen JM (2019-06-20). "Has the Genome Granted Our Wish Yet?". The New England Journal of Medicine. 380 (1): 2391–2393. doi:10.1056/NEJMp1904511. PMID 31091368. Retrieved 2021-04-12.
  58. ^ Devlin, Hannah (2019-03-23). "Are genetic tests useful to predict cancer?". The Guardian. Retrieved 2021-04-12.
  59. ^ "MYRIAD MyRisk". MYRIAD. Retrieved 2021-04-12.
  60. ^ "BRCANext-Expanded™". Ambry Genetics. Retrieved 2021-04-12.
  61. ^ Judson, Horace Freeland (1996). The Eighth Day of Creation: Makers of the Revolution in Biology. Plainview, N.Y.: Cold Spring Harbor Laboratory. ISBN 0-87969-478-5.
  62. ^ Hivert V, Sidorenko J, Rohart F, Goddard FR, Yang J, Wray NR, Yengo L, Visscher VM (2020-11-09). "Estimation of non-additive genetic variance in human complex traits from a large sample of unrelated individuals". bioRxiv 10.1101/2020.11.09.375501.

External links