polygenic score
Barton N, Hermisson J & Nordborg M 2019 Why structure matters. eLife 8:e45380.
- the first GWAS for height found a small number of SNPs that jointly explained only a tiny fraction of the variation
- this was in contrast with the high heritability seen in twin studies
- it was dubbed ‘the missing heritability problem’
- it was suggested that the problem was simply due to a lack of statistical power to detect polymorphisms of small effect
- most of the variation remains ‘unmappable’
- sample sizes on the order of a million are still not large enough
- one way in which the unmappable component of genetic variation can be included in a statistical measure is via so-called polygenic scores
- these scores sum the estimated contributions to the trait across many SNPs, including those whose effects, on their own, are not statistically significant
- polygenic scores thus represent a shift from the goal of identifying major genes to predicting phenotype from genotype
- when a GWAS is carried out to identify major genes, it is relatively simple to avoid false positives by eliminating associations outside major loci
- if the goal is to make predictions, or to understand differences among populations (such as the latitudinal cline in height), we need accurate and unbiased estimates for all SNPs
- accomplishing this is extremely challenging
- it is also difficult to know whether one has succeeded
- one possibility is to compare the population estimates with estimates taken from sibling data, which should be relatively unbiased by environmental differences
- Berg et al. and Sohail et al. independently found that evidence for selection vanishes – along with evidence for a genetic cline in height across Europe
- the previously published results were due to the cumulative effects of slight biases in the effect-size estimates in the GIANT data
- they also found evidence for confounding in the sibling data used as a control by Robinson et al. and Field et al.
- we still do not know whether genetics and selection are responsible for the pattern of height differences seen across Europe
- there is no perfect way to control for complex population structure and environmental heterogeneity
- biases at individual loci may be tiny
- they become highly significant when summed across thousands of loci – as is done in polygenic scores
- standard methods to control for these biases, such as principal component analysis, may work well in simulations but are often insufficient when confronted with real data
- even the data in the UK Biobank seems to contain significant structure
- quantitative genetics has proved highly successful in plant and animal breeding
- this success has been based on large pedigrees, well-controlled environments, and short-term prediction
- when these methods have been applied to natural populations, even the most basic predictions fail, in large part due to poorly understood environmental factors
- natural populations are never homogeneous
- it is therefore misleading to imply there is a qualitative difference between ‘within-population’ and ‘between-population’ comparisons