polygenic adaptation

Thornton KR 2019 Polygenic adaptation to an environmental shift: temporal dynamics of variation under Gaussian stabilizing selection and additive effects on a single trait. Genetics 213:1513-1530.

  • detectable "hitchhiking" patterns are only apparent if
  • (i) the optimum shifts are large with respect to equilibrium variation for the trait
  • (ii) mutation rates to large-effect mutations are low
  • (iii) large-effect mutations rapidly increase in frequency and eventually reach fixation, which typically occurs after the population reaches the new optimum
  • partial sweeps do not appreciably affect patterns of linked variation, even when the mutations are strongly selected
  • populations reach the new optimum prior to the completion of any sweeps
  • the times to fixation are longer for this model than for standard models of directional selection
  • the model of Gaussian stabilizing selection around an optimal trait value differs from the standard model in that mutations affect fitness indirectly via their effects on trait values
  • for the additive model of gene action considered here, and considering a single segregating mutation affecting the trait, the mode of selection is under- or overdominant in a frequency-dependent manner (Robertson 1956; Kimura 1981)
  • this model has been extended to multiple mutations in linkage equilibrium by several authors (Barton 1986; de Vladar and Barton 2014; Jain and Stephan 2015, 2017b)
  • the equilibrium conditions of models of Gaussian stabilizing selection on traits have been studied extensively
  • the dynamics are quite complicated, with many possible equilibria existing for the case of no linkage disequilibrium
  • recent theoretical work has attempted to clarify when sweeps should happen and when adaptation should proceed primarily via subtle allele frequency shifts
  • after the directional phase, selection becomes disruptive, and mutations affecting fitness are fixed or lost to reduce the genetic load of the population
  • the work described above identifies the conditions where sweeps are expected
  • we do not have a picture of the dynamics of linked selection during adaptation to an optimum shift
  • the difficulty of analyzing models of continuous phenotypes with partial linkage among sites has been an impediment to a theoretical description of the process
  • Höllinger et al. (2019) were able to accommodate partial linkage by simplifying how mutations affect phenotype and focusing on the dynamics up until a particular mean trait value was first reached
  • in their simplest model, an individual is either mutant or nonmutant
  • there are only two phenotypes possible
  • I describe the physical distances over which hitchhiking during polygenic adaptation leaves detectable signatures
  • the key conceptual difference is that the model of adaptation is changed from constant directional selection to the sudden optimum shift models involving a continuous trait considered in de Vladar and Barton (2014) and Jain and Stephan (2015, 2017b)
  • I modeled a single trait under real stabilizing selection (Johnson and Barton 2005)
  • mutations affecting trait values arise at rate μ per haploid genome per generation according to an infinitely many sites scheme (Kimura 1969)
  • I evolved populations of size N = 5,000 diploids
  • mutations affecting trait values occur uniformly (at rate μ) in a continuous genomic interval in which recombination breakpoints arise according to a uniform Poisson process with a mean of 0.5 recombination breakpoints per diploid
  • the mutation rates used were 2.5 × 10−4, 10−3, and 5 × 10−3
  • these mutation rates corresponded to Θ = 4 values of 5, 20, and 100, respectively, meaning sweeps were expected to be high frequency, mixes of partial and complete sweeps, and adaptation primarily by allele frequency changes, respectively, as the population approached the new optimum
  • at mutation-selection equilibrium, these parameters result in an equilibrium genetic variance given by the "House of Cards" approximation, which is ≈ 4μ for the definition of mutation rate and the VS used here, and ignoring the contribution of genetic drift
  • expected genetic variance is therefore small
  • new mutations are more likely to have large effects relative to standing variation
  • I simulated all traits with VS = 1 and did not explicitly model random effects on trait values
  • the evolutionary dynamics would be unaffected because the contribution of the environmental variance to VS would be small
  • these simulations may be viewed as similar to the numerical calculations in de Vladar and Barton (2014) and Jain and Stephan (2017b), but with loose linkage between selected variants
  • previous studies assumed linkage equilibrium
  • I allowed for new mutation after the optimum shift
  • they differ from the approach of Höllinger et al. (2019) in that I simulated continuous traits and did not stop evolution once a specific mean fitness was first reached
  • z typically reached zo before the first fixation had occurred
  • mutations with large effects on trait value fix first, as predicted by Robertson (1956)
  • fixations of large effect typically have origin times close to zero
  • large-effect mutations only exist for a relatively brief period of time after the optimum shift, after which most segregating variation reaching appreciable derived allele frequencies are of relatively small effect
  • for a short time following the optimum shift, several intermediate-frequency mutations with large effects on trait values may be segregating
  • many of these variants are adaptive (γ > 0) but will only make short-term contributions to adaption prior to their loss
  • the dynamics of these mutations recapitulate results from de Vladar and Barton (2014)
  • due to epistatic effects on fitness, some mutations that are initially beneficial later become deleterious and are removed
  • fixation times are rather long, in the order of N generations even for mutations with large 2
  • the numbers of sweeps from new mutations and from standing variants are similar
  • fixation of smaller-effect standing variants are more common in simulations with higher μ
  • large-effect standing variants that fixed after the optimum shift were rare at the time of the shift
  • small-effect mutations were also typically rare at mutation-selection balance
  • fixations from variants that are common at the time of the optimum shift have small effects on trait values
  • the fixation of such mutations are unlikely to generate the patterns of haplotype diversity associated with "soft sweeps"
  • such patterns require strong selection on mutations at intermediate frequencies
  • as the mutation rate increases, the genetic background of these fixing variants becomes more polygenic
  • the initial rate of frequency change of the fixation lessens because other mutations are involved in the response to the optimum shift, some of which may contribute to adaptation but not fix in the long-term
  • for all replicates, the fixations are at different loci (separated by ≥ 50 cM) with one exception
  • the partial sweeps occurring at intermediate mutation rates (middle collumn of Figure 6) are not associated with strong signals of hitchhiking, at least when the sample size is relatively small
  • the time when a given statistic shows its maximum departure from equilibrium values differs for each statistic
  • the maximum departure may occur ≈ 100 generations after the time to adaptation
  • Figure 6 and Figure 7 suggest that patterns of strong hitchhiking are more likely at loci where large-effect mutations fix
  • such mutations must arise on average before the mean time to adaptation
  • patterns of variation due to strong sweeps from standing variation overlap considerably with those of older sweeps from new mutations
  • the conditions for a selective sweep are consistent with predictions made using theoretical results from Jain and Stephan (2017b) and Höllinger et al. (2019)
  • the simulations presented here are comparable to the "most effects are large" case from Jain and Stephan (2017b)
  • the trait variance increases during adaptation [also see de Vladar and Barton (2014)] due to large-effect mutations moving from low to intermediate frequency
  • mutations with large effects on trait values at the time of the optimum shift are most likely to rise in frequency
  • mutations that eventually fix are not necessarily those with the largest effect size
  • when several large-effect mutations cosegregate, those with the highest initial frequencies tend to reach fixation
  • faster sweeps are more likely at lower mutation rates
  • regimes where the genetic variance decreases during adaptation are not possible for any of the simulations presented here
  • when considering the pattern of hitchhiking at a locus, the presence or absence of a large-effect fixation at a locus is a reliable predictor of the magnitude of hitchhiking patterns
  • such fixations are more common when the mutation rate is smaller
  • thus strong departures from equilibrium patterns of variation are not expected for more polygenic traits
  • for the optimum shift model considered here, the strength of selection is not constant over time
  • genotypes containing variants that were initially strongly favored by selection are subject to much weaker selection by the time the population has reached the new optimum
  • this weakening of selection increases fixation times to the order of the population size
  • the partial linkage among sites in this work leads to some negative linkage disequilibrium (Figure S18), which is a signal of interference
  • this interference has little effect on the mean time to adaptation, but fixation times are increased
  • once the population is close to the new optimum, selection on individual genotypes is much weaker (Figure 5), setting up the conditions for interference to affect fixation times
  • the stabilizing selection around the initial optimum keeps large-effect mutations rare
  • sweeps from such standing variants start at low frequencies
  • it is not possible to tune the model parameters to obtain sweeps from large-effect, but common, variants with high probability
  • it is tempting to involve a need for pleiotropic effects to have large-effect mutations segregating at intermediate frequencies at the time of the optimum shift
  • I also allowed for partial linkage among sites, which is a key difference from the work based on the Barton (1986) framework, which assumes free recombination
  • partial linkage affects the long-term dynamics of selected mutations
  • the only test statistic based on patterns of SNP variation for detecting polygenic adaptation that I am aware of is the singleton density score (Field et al. 2016)
  • I have not explored this statistic here
  • it would be more fruitful to do so using simulations of much larger genomic regions applying tree sequence recording (Kelleher et al. 2018)
  • a more thorough understanding of the dynamics of linked selection during polygenic adaptation will require investigation of models with pleiotropic effects
  • the question in a pleiotropic model is the role that large-effect mutations may play, which is an unresolved question
  • acknowledging the focus on the standard additive model, the current work is best viewed as an investigation of a central concern in molecular population genetics (the effect of natural selection on linked neutral variation) having replaced the standard model of that subdiscipline with the standard model of evolutionary quantitative genetics
  • there are considerable theoretical and empirical challenges remaining in the understanding of the genetics of rapid adaptation
  • for models of phenotypic adaptation, our standard "tests of selection" are likely to fail, and are highly underpowered even when the assumptions of the phenotype model are close to that of the standard model

omnigenicity

Wray NR, Wijmenga C, Sullivan PF, Yang J & Visscher PM 2018 Common disease is more complex than implied by the core gene omnigenic model. Cell 173:1573-1580.

  • Boyle et al. (2017b) [...] introduce the term "omnigenic," (omni = "all") in acknowledgment of the very large number of genetic loci contributing to disease risk
  • a key feature of the omnigenic model is the classification of genes as "peripheral" (which are generally regulatory in cellular networks and contribute to risk for many diseases and therefore to pleiotropy) or "core" (which are more disease specific with biologically interpretable roles)
  • a defining feature of the omnigenic hypothesis is that only a modest number of genes or pathways have specific roles in the etiology of a specific disease
  • these core genes, if mutated or deleted, have the strongest functional effects
  • the key point of distinction of the omnigenic hypothesis is the emphasis on the importance of core genes
  • types of genes detected in rare variant studies—which can detect highly deleterious variants with large effect sizes—play more direct roles in complex disease than do genes identified from GWASs based on common variants
  • a consequence of the model is to focus experimental designs on discovery of rare variants
  • this conclusion implies a simpler gene-disease biology than we have empirical evidence for
  • one conclusion from sequencing genomes from healthy individuals was the high level of redundancy/robustness in the human genome
  • most apparently normal humans have ~100 loss-of-function mutations (MacArthur et al., 2012)
  • the core/peripheral properties align closely with those of older conceptualizations considering the relative importance of rare/common variants (Pritchard, 2001; Pritchard and Cox, 2002)
  • the omnigenic model is partly a reframing of older ideas while trying to accommodate the empirical evidence that confirms polygenicity and a role of risk variants from across the allelic spectrum
  • a closer look at the definition of the core gene is warranted
  • for common disorders, the largest WES studies conducted to date have not been sufficiently powered to detect the effect sizes that exist in nature
  • for type 2 diabetes, the conclusion from analysis of WES (7,380 cases) was "large-scale sequencing does not support the idea that lower-frequency coding variants have a major role in predisposition"
  • understanding the consequences of polygenicity for individuals also links into an understanding of epistasis, the interacting effects of risk loci
  • expectations for the role of epistasis in complex genetic disease are confusing and confused
  • molecular biology studies provide unequivocal evidence that gene-gene interactions are common and impart a strong desire to undertake studies to detect epistatic associations
  • quantitative genetic theory suggests that contributions from non-additive effects to phenotypic variation in the population and differences between people are small
  • these differing viewpoints are accommodated under a polygenic genetic architecture
  • the only way to reconcile disease that impacts only a small fraction of the population with a genetic architecture of many risk loci is to have a highly non-linear relationship between probability of disease and burden of risk alleles (Slatkin, 2008)
  • the statistical genetics community prefers to say that complex disease is underpinned by genetic effects working additively in liability to risk
  • for a molecular geneticist studying samples from diseased and healthy individuals, interactions between genetic effects are indeed implied, but on a scale that is challenging to study, both in terms of number of contributing genes and uniqueness of individuals
  • whether the goal is discovery of rare variants or common variants, sample sizes are a key limiting factor for furthering our understanding of polygenic diseases
  • increasing sample size remains a research priority
  • Boyle et al. omnigenic core gene Perspective has been widely interpreted as a call for a research focus on cell-specific gene regulatory networks
  • to assume that a limited number of core genes are key to our understanding of common disease may underestimate the true biological complexity, which is better represented by systems genetics and network approaches

omnigenicity

Boyle EA, Li YI & Pritchard JK 2017 The omnigenic model: response from the authors. J Psychiat Brain Sci 2:S8.

  • one of our goals in writing this paper was to highlight an apparent paradox in human genetics
  • most of the heritability for a typical complex trait is driven by genetic variation at loci that seem unrelated to the trait in question
  • the lack of a clear explanation for this seeming paradox is a major conceptual gap in modern human genetics
  • prior to the GWAS era, many researchers conceptualized complex traits in a very similar paradigm
  • they expected that complex traits would be driven by variants in multiple genes, each with proportionally smaller effect sizes
  • there was a clear expectation that if those genes could be found, they would lead directly to disease-relevant biology
  • typical complex traits are hugely polygenic, such that
  • (1) the largest-effect variants confer only modest risk and, together, explain only a small fraction of the heritability
  • (2) huge numbers of variants make non-negligible contributions to heritability
  • (3) the signal is spread surprisingly broadly across the genome
  • for example, most 100kb windows contain variants that measurably affect height
  • (4) there is only weak enrichment of heritability in genes with putatively relevant gene functions
  • (5) while signals are strongly enriched in chromatin that is active in relevant cell types, there is little difference between the enrichment of cell type-specific chromatin vs. generically active chromatin such as constitutive promoters of housekeeping genes
  • since around 2006, our shared understanding of the architecture of complex traits has been completely transformed
  • essentially any gene with regulatory variants in at least one tissue that contributes to disease pathogenesis is likely to have nontrivial effects on risk for that disease
  • since core genes are hugely outnumbered by peripheral genes, a large fraction of the total genetic contribution to disease comes from peripheral genes that do not play direct roles in disease
  • "polygenic" means different things to different people
  • "gene regulatory networks are sufficiently interconnected such that all genes expressed in disease-relevant cells are liable to affect the functions of core disease-related genes"
  • we tried to leave the definition of core genes open
  • it seems unlikely to us that a single definition can cover all cases in all complex traits
  • if a precise definition is needed, we suggested that core genes may be defined as the (minimal) set of genes such that "conditional on the genotype and expression levels of all core genes, the genotypes and expression levels of peripheral genes no longer matter"
  • in the near future, the combination of GWAS data and expression data in large case-control samples will enable tests to distinguish core and peripheral genes by this definition

missing heritability

Liu X, Li YI & Pritchard JK 2019 Trans effects on gene expression can drive ominigenic inheritance. Cell 177:1022-1034.

  • we provide a formal model in which genetic contributions to complex traits are partitioned into direct effects from core genes and indirect effects from peripheral genes acting in trans
  • if the core genes for a trait tend to be co-regulated, then the effects of peripheral variation can be amplified
  • nearly all of the genetic variance is driven by weak trans effects
  • most of the missing heritability is due to large numbers of small-effect common variants that are not significant at current sample sizes
  • between 71% and 100% of 1 megabase (Mb) windows in the genome are estimated to contribute to the heritability of schizophrenia
  • much of the trait variance is mediated through genes that are not directly involved in the trait in question
  • these observations appear at odds with conventional ways of understanding the links from genotype to phenotype
  • much of the progress in classical genetics has come from detailed molecular work to dissect the biological mechanisms of individual mutations
  • why does such a large portion of the genome contribute to heritability?
  • why do the lead hits for a typical trait contribute so little to heritability?
  • what factors determine the effect sizes of SNPs on traits?
  • it is essential for the field to develop conceptual models for understanding complex trait architecture
  • the model proposed here is a step in that direction
  • genes not expressed in relevant cell types do not contribute significantly to heritability
  • the per-SNP heritability in tissue-specific regulatory elements is only modestly increased relative to SNPs in broadly active regulatory elements, provided that they are active in relevant tissues
  • the heritability of a typical complex trait is driven by variation in a large number of regulatory elements and genes, spread widely across the genome, and mediated through a wide range of gene functional categories
  • rare variants are generally not major contributors to the overall phenotypic variance
  • protein-coding variants are relatively rare in the genome and thus contribute only a small fraction of heritability
  • the heritability is generally dominated by noncoding variants, especially variants in gene regulatory regions
  • there is strong enrichment of both cis- and trans-eQTLs among GWAS hits, albeit still a considerable gap in linking all hits to eQTLs
  • some genes (and their regulatory networks) are functionally proximate to disease risk
  • these genes tend to produce the biggest signals in common- and rare-variant association studies
  • they tend to be the most illuminating from the point of view of understanding disease etiology
  • they are responsible for only a small fraction of the genetic variance in disease risk
  • the bulk of the heritability is mediated through genes that have a wide variety of functions, many of which have no obvious functional connection to disease
  • most of the GWAS hits are in noncoding, putatively regulatory regions of the genome
  • the primary links between genetic variation and complex disease are via gene regulation
  • the omnigenic model partitions genes into core genes and peripheral genes
  • core genes can affect disease risk directly
  • peripheral genes can only affect risk indirectly through trans-regulatory effects on core genes
  • two key proposals of the omnigenic model are
  • (1) that most, if not all, genes expressed in trans-relevant cells have the potential to affect core-gene regulation
  • (2) that for typical traits, nearly all of the heritability is determined by variation near peripheral genes
  • core genes are the key drivers of disease
  • it is the cumulative effects of many peripheral gene variants that determine polygenic risk
  • "omnigenic" has a more precise meaning than the term "polygenic"
  • polygenic can be used to describe the involvement of anything from tens of loci to every variant in the genome and would include omnigenic as a special case, toward the high end of the polygenic spectrum
  • we also use the term "omnigenic model" to refer to our specific model of complex trait architecture in which heritability is mainly driven by peripheral genes that trans-regulate core genes
  • it is also worth distinguishing our model from Fisher's classic infinitesimal model
  • it does not tell us how many causal variants to expect in practice nor about the molecular mechanisms linking genetic variation to phenotypes
  • we define a gene as a "core gene" if and only if the gene product (protein, or RNA for a noncoding gene) has a direct effect—not mediated through regulation of another gene—on cellular and organismal processes leading to a change in the expected value of a particular phenotype
  • all other genes expressed in relevant cell types are considered "peripheral genes" and can only affect the phenotype indirectly through regulatory effects on core genes
  • genes that are "unexpressed" in trait-relevant tissues are assumed not to contribute to heritability
  • most peripheral genes make small contributions to heritability
  • some peripheral genes, such as transcription factors and protein regulators, play important roles because they regulate multiple core genes
  • Equation 3 illustrates the key factors determining how cis- and trans-eQTL effects on core genes impact complex trait heritability
  • the first two groups of terms on the right-hand side of this expression depend on the relative importance of cis and trans effects in determining expression heritability of core genes
  • the third group of terms depends on genetic covariances between pairs of core genes
  • these genetic covariances must arise from trans effects
  • there are many more pairs of core genes (nearly M2) than core genes (M)
  • these terms may dominate the heritability for most traits
  • most studies are hugely underpowered to detect trans-eQTLs
  • estimates of trans heritability must rely on statistical methods that aggregate weak signals
  • the literature is reassuringly consistent across a range of study designs, indicating that around 60%–90% of genetic variance in expression is due to trans-acting variation
  • trans-eQTLs are notoriously difficult to find in humans
  • this is partly due to the extra multiple testing burden on trans-eQTLs but is mainly due to the small effect sizes of trans-eQTLs
  • trans effects are uniformly small compared to cis effects, with only a handful reaching significance
  • typical genes must have very large numbers of weak trans-eQTLs
  • this model starts to explain why so much of the genome contributes heritability for typical traits
  • suppose instead that a considerable fraction of core genes are either co-regulated with shared directions of effects or negatively co-regulated with opposite directions of effects
  • the sum of covariance terms can dominate the genetic variance for trait Y
  • covariances are primarily driven by trans effects, co-regulated networks could potentially act as strong amplifiers for trans-acting variants that are shared among core genes in those networks
  • there has been little work so far on measuring the genetic basis of gene expression correlations
  • the work to date shows that expression covariance is substantially driven by genetic factors
  • Goldinger et al. (2013) studied heritability of principal components (PCs) in a dataset of whole-blood gene expression from 335 individuals
  • they reported a strong genetic component in the lead PCs, with an average heritability of 0.39 for the first 50 PCs
  • Lukowski et al. (2017) tested for genetic covariance between gene pairs and identified 15,000 gene pairs (0.5% of all gene pairs) with significantly nonzero genetic covariance at 5% false discovery rate
  • for the 10% of gene pairs with the highest phenotypic correlation, the average genetic correlation is 0.12
  • this magnitude is potentially large enough to make an important contribution to heritability
  • if core genes are often co-regulated, with shared directions of effects, as seems likely, then nearly all heritability would be due to trans effects
  • cis-eQTLs usually have much larger effect sizes than trans-eQTLs
  • many of the biggest signals in GWASs are cis regulators of core genes
  • peripheral gene-regulatory variants may become notable hits if they are trans-eQTLs for many core genes with correlated directions of effect
  • the bulk of trait heritability is driven by a huge number of peripheral variants that are weak trans-eQTLs for core genes
  • the 57 genome-wide significant loci explain ~20% of the heritability
  • all variation tagged in current GWASs together explains ~80%
  • 54% of 1 Mb windows in the genome contribute to the heritability of extreme lipid levels
  • we have clear evidence for the involvement of core genes, yet they contribute only a small fraction of the genetic variance in the trait
  • much of the remaining variance is due to the combined contributions of many small trans effects being funneled through the core genes

soft sweep

Zheng Y & Wiehe T 2019 Adaptation in structured populations and fuzzy boundaries between hard and soft sweeps. PLoS Comput Biol 15:e1007426.

  • train- and test-sets must have the same, or at least similar, demographic parameters so that demographic effects will not be mis-identified as selection signals
  • real population histories may lie outside of the tested parameter space
  • it has been claimed that over 90% of the recent adaptation events in Homo sapiens have been soft sweeps, making hard sweeps the exception rather than the rule [35]
  • this finding is consistent with an earlier study reporting that "classic selective sweeps," i.e., those characterized by a sharp reduction of diversity around an adaptive locus, are rare in human populations
  • recent Drosophila adaptations are largely attributed to hard sweeps
  • we assume a co-dominant fitness scheme
  • wild-type homozygotes have fitness 1
  • heterozygotes have fitness 1 + s
  • mutant homozygotes have fitness (1 + s)2 ≈ 1 + 2s
  • we assume further s = 0.02
  • a total of 5, 000 parallel samples (100 independent populations, and 50 samples from each) were produced from each scenario, each deme and each time point
  • haplotype-based methods excel with ongoing and early-stage sweeps
  • frequency-spectrum-based ones are more powerful for completed sweeps
  • this is consistent with the fact that the methods such as iHS are designed for ongoing sweeps rather than for completed ones
  • generally, we observe a tendency to mis-classify hard sweeps as soft when there is a stage mismatch between train- and test-sets of a predictor
  • we call this effect temporal softening
  • misclassification as soft sweeps is common at both ends of the timeline but rare at time stages close to fixation
  • classification of sweeps as "hard" and "soft" often relies on ideal assumptions such as known time stage and genomic location of the selection site, as well as demographic assumptions such as a panmictic population of constant size
  • in regard to the location-based effect known as "soft shoulder", potential solutions include explicitly modeling regions linked to hard sweeps as well as classify sweeps based on signal peaks only
  • our "temporal softening" is caused by an early-stage hard sweep mimicking the signal of later-stage soft sweeps
  • multiple haplotypes at locus
  • weaker reduction of genetic diversity
  • a one-peak patterns for statistics like Fay and Wu's H or linkage-based ones
  • (two-peak patterns occur for fixed hard sweeps)
  • the peaks refer to the shape of the statistics along the chromosomes, surrounding the site of the adaptive allele
  • when a machine-learning algorithm is trained with sweeps of only one time stage, or a statistic (especially a likelihood-ratio test) is created based on only ongoing or fixed sweeps, it can be unable to recognize patterns for other stages
  • most studies before have been focusing on only ongoing [62] or fixed [21] sweeps
  • so far, little attention has been paid to the question of how robust the tools are with respect to stage mismatch and how much false positive and negative rates may be inflated by this problem
  • we thus argue that searches for sweeps in genomic data, especially those that also try to distinguish hard and soft sweeps, need to explicitly account for the different stages (ongoing, recent or ancient) in the models and (if applicable) machine-learning training sets
  • it is possible that the large amount of "soft sweeps" discovered from the human genome [36], [35] are "sweeps by proxy"
  • i.e. hard sweeps occurring in other populations imported by migration
  • mixed samples intensify the "spatial softening" effect in local adaptation scenarios
  • temporal misclassification, including softening and hardening, refers to classification of hard sweeps as soft or vice versa, because the training model mismatches with the tested data in time stage
  • spatial softening can cause hard sweeps in neighboring demes to be falsely detected as soft
  • if a panmictic population model is used in data analysis but the real situation involves occasional migration, false positive sweeps (mainly classified as soft) may ensue
  • the claim that human populations have overwhelmingly soft sweeps as the mode of adaptation may be a result of biased classification

Tajima's coalescent

Palacios JA, Véber A, Cappello L, wang Z, Wakeley J & Ramachandran S 2019 Bayesian estimation of population size changes by sampling Tajima's trees. Genetics 213:967-986.

  • our objective in the implementation of BESTT is to estimate the posterior distribution of model parameters by replacing Kingman's genealogy with Tajima's genealogy gT
  • replacing Kingman's genealogy by Tajima's genealogy in our posterior distribution exponentially reduces the size of the state space of genealogies
  • Tajima's genealogies
  • our method of computing the probability of the recoded data, Yh × m, uses ranked tree shapes rather than fully labeled histories
  • we refer to these ranked tree shapes as Tajima's genealogies
  • they have also been called unlabeled rooted trees (Griffiths and Tavaré 1995) and evolutionary relationships (Tajima 1983)
  • in Tajima's genealogies, only the internal nodes are labeled and they are labeled by their order in time
  • Tajima's genealogies encode the minimum information needed to compute the probability of data Yh × m, which consists of nested sets of mutations, without any approximations
  • no other labels matter because individuals are exchangeable in the population model we assume
  • this represents a dramatic coarsening of tree space compared to the classical leaf-labeled binary trees of Kingman's coalescent
  • this provides a much more efficient way to integrate over the key hidden variable, the unknown gene genealogy of the sample, when computing likelihoods
  • we model this hidden variable using the vintaged and sized coalescent (Sainudiin et al. 2015), which corresponds exactly to this coarsening of Kingman's coalescent
  • the main computational bottleneck of coalescent-based inference of evolutionary histories lies in the large cardinality of the hidden state space of genealogies
  • in the standard Kingman coalescent, a genealogy is a random labeled bifurcating tree that models the set of ancestral relationships of the samples
  • a lower-resolution coalescent model on genealogies, Tajima's coalescent, can be used as an alternative to the standard Kingman coalescent model
  • the Tajima coalescent model provides a feasible alternative that integrates over a smaller state space than the standard Kingman model
  • the main advantage in Tajima's coalescent is modeling of the ranked tree topology as opposed to the fully labeled tree topology, as in Kingman's coalescent
  • our method does not model recombination, population structure, or selection
  • it assumes completely linked and neutral segments from individuals from a single population, and the infinite sites mutation model

Tajima's coalescent

Palacios JA, Wakeley J & Ramachandran S 2015 Bayesian nonparametric inference of population size changes from sequential genealogies. Genetics 201:281-304.

  • we address a key problem for inference of population size trajectories under sequentially Markov coalescent models
  • we express the transition densities of local genealogies in terms of local ranked tree shapes (Tajima 1983) and coalescent times and show that these quantities are statistically sufficient for inferring population size trajectories either from sequence data directly or from the set of local genealogies
  • the use of ranked tree shapes allows us to exploit the state process of local genealogies efficiently since the space of ranked tree shapes has a smaller cardinality than the space of labeled topologies (Sainudiin et al. 2014)
  • sequential Tajima's genealogies are sufficient statistics under the SMC'
  • the sufficient statistics for inferring N(t) under the SMC' model are the coalescent times, when taken together with local ranked tree shapes (tree with no labels but ranked coalescent events)
  • for a single locus, the set of coalescent times together with the ranked tree shape corresponds to a realization of Tajima's n-coalescent
  • the set of local Tajima's genealogies has sufficient statistics for inferring N(t) under the SMC' model
  • our model can be easily modified to model a variable recombination rate along chromosomal segments and to jointly infer variable recombination rates and N(t)
  • under the SMC' model, local ranked tree shapes and coalescent times correspond to a set of local Tajima's genealogies
  • these Tajima's genealogies are sufficient statistics for inferring N(t)
  • under the SMC' model, the state space needed for inferring population size trajectories from sequence data is that of a sequence of local Tajima's genealogies
  • this lumping, or reduction of the original SMC' process, will allow more efficient inference from sequence data directly