A polygenic score, also called a polygenic risk score, genetic risk score, or genome-wide score, is a number based on variation in multiple genetic loci and their associated weights (see regression analysis). It serves as the best prediction for the trait that can be made when taking into account variation in multiple genetic variants.
Polygenic scores are widely employed in animal, plant, and behavioral genetics for predicting and understanding genetic architectures. In a genome-wide association study (GWAS), polygenic scores having substantially higher predictive performance than the genome-wide statistically-significant hits indicates that the trait in question is affected by a larger number of variants than just the hits and larger sample sizes will yield more hits; a conjunction of low variance explained and high heritability as measured by GCTA, twin studies or other methods, indicates that a trait may be massively polygenic and affected by thousands of variants. Once a polygenic score has been created, which explains at least a few percent of a phenotype's variance and can therefore be assumed to effectively incorporate a significant fraction of the genetic variants affecting that phenotype, it can be used in several different ways: as a lower bound to test whether heritability estimates may be biased; as a measure of genetic overlap of traits (genetic correlation), which might indicate e.g. shared genetic bases for groups of mental disorders; as a means to assess group differences in a trait such as height, or to examine changes in a trait over time due to natural selection indicative of a soft selective sweep (as e.g. for intelligence where the changes in frequency would be too small to detect on each individual hit but not on the overall polygenic score); in Mendelian randomization (assuming no pleiotropy with relevant traits); to detect & control for the presence of genetic confounds in outcomes (e.g. the correlation of schizophrenia with poverty); or to investigate gene–environment interactions and correlations.
Polygenic scores are widely used in animal breeding and plant breeding (usually termed genomic prediction or genomic selection) due to their efficacy in improving livestock breeding and crops. Their use in human studies is increasing.
One of the first precursors to the modern polygenic score was proposed under the term marker-assisted selection (MAS) in 1990. According to MAS, breeders are able to increase the efficiency of artificial selection by estimating the regression coefficients of genetic markers that are correlated with differences in the trait of interest and assigning individual animals a "score" from this information. A major development of these fundamentals was proposed in 2001 by researchers who discovered that the use of a Bayesian prior could help to mitigate the problem of the number of markers being greater than the sample of animals.
These methods were first applied to humans in the late 2000s, starting with a proposal in 2007 that these scores could be used in human genetics to identify individuals at high risk for disease. This was successfully applied in empirical research for the first time in 2009 by researchers who organized a genome-wide association study (GWAS) of schizophrenia to construct scores of risk propensity. This study was also the first to use the term polygenic score for a prediction drawn from a linear combination of single-nucleotide polymorphism (SNP) genotypes, which was able to explain 3% of the variance in schizophrenia.
Years of education was the first human cognitive phenotype to be successfully studied in a GWAS. The most recent study of this phenotype was the largest GWAS yet conducted as of 2018, with polygenic scores constructed from 1.1 million participants able to predict upwards of 10% of the variance in various cognitive traits.
Methods of construction
A polygenic score (PGS) is constructed from the "weights" derived from a genome-wide association study (GWAS). In a GWAS, a set of genetic markers (usually SNPs) is genotyped on a training sample, and effect sizes are estimated for each marker's association with the trait of interest. These weights are then used to assign individualized polygenic scores in an independent replication sample. The estimated score, , generally follows the form
where the of an individual is equal to the weighted sum of the individual's marker genotypes, , at SNPs. Weights are estimated using some form of regression analysis. Because the number of genomic variants is usually larger than the sample size, one cannot use OLS multiple regression (p > n problem). Researchers have proposed various methodologies that deal with this problem as well as how to generate the weights of the SNPs, , and how to determine which SNPs should be included.
The simplest so-called "naïve" method of construction sets weights equal to the coefficient estimates from a regression of the trait on each genetic variant. The included SNPs may be selected using an algorithm that attempts to ensure that each marker is approximately independent. Failing to account for non-random association of genetic variants will typically reduce the score's predictive accuracy. This is important because genetic variants are often correlated with other nearby variants, such that the weight of a causal variant will be attenuated if it is more strongly correlated with its neighbors than a null variant. This is called linkage disequilibrium, a common phenomenon that arises from the shared evolutionary history of neighboring genetic variants. Further restriction can be achieved by multiple-testing different sets of SNPs selected at various thresholds, such as all SNPs which are genome-wide statistically-significant hits or all SNPs p < 0.05 or all SNPs with p < 0.50, and the one with greatest performance used for further analysis; especially for highly polygenic traits, the best polygenic score will tend to use most or all SNPs.
Bayesian approaches, originally pioneered in concept in 2001, attempt to explicitly model preexisting genetic architecture, thereby accounting for the distribution of effect sizes with a prior that should improve the accuracy of a polygenic score. One of the most popular modern Bayesian methods uses "linkage disequilibrium prediction" (LDpred for short) to set the weight for each SNP equal to the average of its posterior distribution after linkage disequilibrium has been accounted for. LDpred tends to outperform simpler methods of pruning and thresholding, especially at large sample sizes; for example, its estimations have improved the predicted variance of a polygenic score for schizophrenia in a large data set from 20.1% to 25.3%.
Penalized regression methods, such as LASSO and ridge regression, can also be used to improve the accuracy of polygenic scores. Penalized regression can be interpreted as placing informative prior probabilities on how many genetic variants are expected to affect a trait, and the distribution of their effect sizes. In other words, these methods in effect "penalize" the large coefficients in a regression model and shrink them conservatively. Ridge regression accomplishes this by shrinking the prediction with a term that penalizes the sum of the squared coefficients. LASSO accomplishes something similar by penalizing the sum of absolute coefficients. Bayesian counterparts exist for LASSO and ridge regression, and other priors have been suggested and used. They can perform better in some circumstances. A multi-dataset, multi-method study found that of 15 different methods compared across four datasets, minimum redundancy maximum relevance was the best performing method. Furthermore, variable selection methods tended to outperform other methods. Variable selection methods do not use all the available genomic variants present in a dataset, but attempt to select an optimal subset of variants to use. This leads to less overfitting but more bias (see bias-variance tradeoff).
The benefit of polygenic scores is that they can be used to predict the future for crops, animal breeding, and humans alike. Although the same basic concepts underlie these areas of prediction, they face different challenges that require different methodologies. The ability to produce very large family size in nonhuman species, accompanied by deliberate selection, leads to a smaller effective population, higher degrees of linkage disequilibrium among individuals, and a higher average genetic relatedness among individuals within a population. For example, members of plant and animal breeds that humans have effectively created, such as modern maize or domestic cattle, are all technically "related". In human genomic prediction, by contrast, unrelated individuals in large populations are selected to estimate the effects of common SNPs. Because of smaller effective population in livestock, the mean coefficient of relationship between any two individuals is likely high, and common SNPs will tag causal variants at greater physical distance than for humans; this is the major reason for lower SNP-based heritability estimates for humans compared to livestock. In both cases, however, sample size is key for maximizing the accuracy of genomic prediction.
While modern genomic prediction scoring in humans is generally referred to as a "polygenic score" (PGS) or a "polygenic risk score" (PRS), in livestock the more common term is "genomic estimated breeding value", or GEBV (similar to the more familiar "EBV", but with genotypic data). Conceptually, a GEBV is the same as a PGS: a linear function of genetic variants that are each weighted by the apparent effect of the variant. Despite this, polygenic prediction in livestock is useful for a fundamentally different reason than for humans. In humans, a PRS is used for the prediction of individual phenotype, while in livestock a GEBV is typically used to predict the offspring’s average value of a phenotype of interest in terms of the genetic material it inherited from a parent. In this way, a GEBV can be understood as the average of the offspring of an individual or pair of individual animals. GEBVs are also typically communicated in the units of the trait of interest. For example, the expected increase in milk production of the offspring of a specific parent compared to the offspring from a reference population might be a typical way of using a GEBV in dairy cow breeding and selection.
Some accuracy values are given in the sections below for comparison purposes. These are given in terms of correlations and have been converted from explained variance if given in that format in the source.
The predictive value of polygenic scoring has large practical benefits for plant and animal breeding because it increases the selection precision and allows for shorter generations, both of which speed up evolution. Genomic prediction with some version of polygenic scoring has been used in experiments on maize, small grains such as barley, wheat, oats and rye, and rice biparental families. In many cases, these predictions have been so successful that researchers have advocated for its use in combating global population growth and climate change.
- In 2015, r ≈ 0.55 for total root length in maize.
- In 2014, r ≈ 0.03 to 0.99 across four traits in barley.
In non-human animals
- In 2016, r ≈ 0.30 for variation in milk fat percentage in three breeds of New Zealand dairy cattle.
- In 2014, r ≈ 0.18 to 0.46 for various measures of meat yield, carcass weight, and fat marbling in two breeds of beef cattle.
- In 2014, r ≈ 0.45 to 0.54 for growth traits in Chinese triple-yellow broiler chickens.
- In 2016, r ≈ 0.30 for educational attainment variation at age 16. This polygenic score was based off a GWAS using data from 293,000 persons.
- In 2016, r ≈ 0.31 for case/control status for first-episode psychosis.
- de Vlaming, Ronald; Groenen, Patrick J. F. (2015). "The Current and Future Use of Ridge Regression for Prediction in Quantitative Genetics". BioMed Research International. 2015: 143712. doi:10.1155/2015/143712. PMC 4529984. PMID 26273586.
- Dudbridge, Frank (2013). "Power and predictive accuracy of polygenic risk scores". PLOS Genetics. 9 (3): e1003348. doi:10.1371/journal.pgen.1003348. ISSN 1553-7404. PMC 3605113. PMID 23555274.
- Spindel, Jennifer E.; McCouch, Susan R. (2016-09-01). "When more is better: how data sharing would accelerate genomic selection of crop plants". New Phytologist. 212 (4): 814–826. doi:10.1111/nph.14174. ISSN 1469-8137. PMID 27716975.
- Spiliopoulou, Athina; Nagy, Reka; Bermingham, Mairead L.; Huffman, Jennifer E.; Hayward, Caroline; Vitart, Veronique; Rudan, Igor; Campbell, Harry; Wright, Alan F. (2015-07-15). "Genomic prediction of complex human traits: relatedness, trait architecture and predictive meta-models". Human Molecular Genetics. 24 (14): 4167–4182. doi:10.1093/hmg/ddv145. ISSN 0964-6906. PMC 4476450. PMID 25918167.
- Selzam, S.; Krapohl, E.; von Stumm, S.; O'Reilly, P. F.; Rimfeld, K.; Kovas, Y.; Dale, P. S.; Lee, J. J.; Plomin, R. (2016-07-19). "Predicting educational achievement from DNA". Molecular Psychiatry. 22 (2): 267–272. doi:10.1038/mp.2016.107. ISSN 1476-5578. PMC 5285461. PMID 27431296.
- Lande, R.; Thompson, R. (1990). "Efficiency of marker-assisted selection in the improvement of quantitative traits". Genetics. 124 (3): 743–756. doi:10.1046/j.1365-2540.1998.00308.x.
- Meuwissen, T. H. E.; Hayes, B. J.; Goddard, M. E. (2001). "Prediction of total genetic value using genome-wide dense marker maps". Genetics. 157 (4): 1819–1829.
- Wray, N. R.; Goddard, M. E.; Visscher, P. M. (2007). "Prediction of individual genetic risk to disease from genome-wide association studies". Genome Research. 17 (10): 1520–1528. doi:10.1101/gr.6665407. PMC 1987352.
- Purcell, S. M.; et al. (2009). "Common polygenic variation contributes to risk of schizophrenia and bipolar disorder". Nature. 460 (August): 748–752. doi:10.1038/nature08185. PMC 3912837.
- Lee, J. J.; et al. (2018). "Gene discovery and polygenic prediction from a genome-wide association study of educational attainment in 1.1 million individuals". Nature Genetics. 50 (8): 1112–1121. doi:10.1038/s41588-018-0147-3.
- James, Gareth (2013). An Introduction to Statistical Learning: with Applications in R. Springer. ISBN 978-1461471370.
- Haws, David C.; Rish, Irina; Teyssedre, Simon; He, Dan; Lozano, Aurelie C.; Kambadur, Prabhanjan; Karaman, Zivan; Parida, Laxmi (2015-10-06). "Variable-Selection Emerges on Top in Empirical Comparison of Whole-Genome Complex-Trait Prediction Methods". PLOS ONE. 10 (10): e0138903. doi:10.1371/journal.pone.0138903. ISSN 1932-6203. PMC 4595020. PMID 26439851.
- Ware, E. B.; et al. (2017). "Heterogeneity in polygenic scores for common human traits". BioRxiv. doi:10.1101/106062.
- Vilhjálmsson, B. J.; et al. (2015). "Modeling linkage disequilibrium increases accuracy of polygenic risk scores". American Journal of Human Genetics. 97 (4): 576–592. doi:10.1016/j.ajhg.2015.09.001.
- Vattikuti, S.; Lee, J. J.; Chang, C. C.; Hsu, S. D. H.; Chow, C. C.; et al. (2014). "Applying compressed sensing to genome-wide association studies". GigaScience. 3 (1). doi:10.1186/2047-217X-3-10.
- Gianola, D.; Rosa, G. J. M. (2015). "One hundred years of statistical developments in animal breeding". Annual Review of Animal Biosciences. 3: 19–56. doi:10.1146/annurev-animal-022114-110733.
- Wray, Naomi R.; Kemper, Kathryn E.; Hayes, Benjamin J.; Goddard, Michael E.; Visscher, Peter M. (2019). "Complex Trait Prediction from Genome Data: Contrasting EBV in Livestock to PRS in Humans". Genetics. 211 (4): 1131–1141. doi:10.1534/genetics.119.301859.
- Heslot, Nicolas; Jannink, Jean-Luc; Sorrells, Mark E. (2015-01-02). "Perspectives for Genomic Selection Applications and Research in Plants". Crop Science. 55 (1): 1. doi:10.2135/cropsci2014.03.0249. ISSN 0011-183X.
- Pace, Jordon; Yu, Xiaoqing; Lübberstedt, Thomas (2015-09-01). "Genomic prediction of seedling root length in maize (Zea mays L.)". The Plant Journal. 83 (5): 903–912. doi:10.1111/tpj.12937. ISSN 1365-313X. PMID 26189993.
- Sallam, A. H.; Endelman, J. B.; Jannink, J.-L.; Smith, K. P. (2015-03-01). "Assessing Genomic Selection Prediction Accuracy in a Dynamic Barley Breeding Population". The Plant Genome. 8 (1): 0. doi:10.3835/plantgenome2014.05.0020. ISSN 1940-3372.
- Hayr, M. K.; Druet, T.; Garrick, D. J. (2016-04-01). "027 Performance of genomic prediction using haplotypes in New Zealand dairy cattle". Journal of Animal Science. 94 (supplement2): 13. doi:10.2527/msasas2016-027. ISSN 1525-3163.
- Chen, L.; Vinsky, M.; Li, C. (2015-02-01). "Accuracy of predicting genomic breeding values for carcass merit traits in Angus and Charolais beef cattle". Animal Genetics. 46 (1): 55–59. doi:10.1111/age.12238. ISSN 1365-2052. PMID 25393962.
- Liu, Tianfei; Qu, Hao; Luo, Chenglong; Shu, Dingming; Wang, Jie; Lund, Mogens Sandø; Su, Guosheng (2014). "Accuracy of genomic prediction for growth and carcass traits in Chinese triple-yellow chickens". BMC Genetics. 15 (110): 1–8. doi:10.1186/s12863-014-0110-y.
- Shulman, Carl; Bostrom, Nick (2014-02-01). "Embryo Selection for Cognitive Enhancement: Curiosity or Game-changer?". Global Policy. 5 (1): 85–92. CiteSeerX 10.1.1.428.8837. doi:10.1111/1758-5899.12123. ISSN 1758-5899.
- Okbay, Aysu; Beauchamp, Jonathan P.; Fontana, Mark Alan; Lee, James J.; Pers, Tune H.; Rietveld, Cornelius A.; Turley, Patrick; Chen, Guo-Bo; Emilsson, Valur (2016). "Genome-wide association study identifies 74 loci associated with educational attainment". Nature. 533 (7604): 539–542. doi:10.1038/nature17671. PMC 4883595. PMID 27225129.
- Vassos, Evangelos; Forti, Marta Di; Coleman, Jonathan; Iyegbe, Conrad; Prata, Diana; Euesden, Jack; O’Reilly, Paul; Curtis, Charles; Kolliakou, Anna (2017-03-15). "An Examination of Polygenic Score Risk Prediction in Individuals With First-Episode Psychosis". Biological Psychiatry. 81 (6): 470–477. doi:10.1016/j.biopsych.2016.06.028. PMID 27765268.
- Dudbridge (2013). "Power and Predictive Accuracy of Polygenic Risk Scores". PLOS Genetics. 9 (3): e1003348. doi:10.1371/journal.pgen.1003348. PMC 3605113. PMID 23555274.