soft sweep

Harris RB, Sackman A & Jensen JD 2018 On the unfounded enthusiasm for soft selective sweeps II: examining recent evidence from humans, flies, and viruses. PLoS Genet 14:e1007859.

  • all three examples are prone to extremely high false-positive rates, incorrectly identifying soft sweeps under both hard sweep and neutral models
  • well-fit demographic histories combined with rare hard sweeps serve as the more parsimonious explanation
  • these findings represent a necessary response to the growing tendency of invoking parameter-heavy, assumption-laden models of pervasive positive selection, and neglecting best practices regarding the construction of proper demographic null models
  • these genomic scans were initially focused around a hard sweep model, in which selection acts upon rare, newly arising beneficial mutations
  • recent years have seen the description of sweeps occurring from both standing and rapidly recurring beneficial mutations, collectively known as soft sweeps
  • there is a recent and troubling tendency to neglect these demographic considerations
  • we reanalyze these findings and demonstrate that a more careful consideration of neutral processes results in highly differing conclusions
  • for decades, identifying beneficial mutations based on genomic patterns of linked polymorphism has remained a topic of keen interest theoretically, methodologically, and empirically in the field of population genetics
  • initial efforts were largely focused around a hard selective sweep model—that is, one in which positive selection acts upon a newly arising beneficial mutation and brings it to fixation in the population
  • owing both to theoretical developments (e.g., [3,4]), as well as a lack of evidence for widespread hard sweeps in the genomes of commonly studied organisms (e.g., [5]), alternative models have gained attention over the past decade
  • the notion of soft selective sweeps encompasses at least two very different models:
  • a) selection on standing variation—in which positive selection begins acting upon a mutation only once it is already at appreciable frequency in the population
  • b) multiple de novo beneficial mutations—in which positive selection acts upon independently-arising and simultaneously-segregating copies of (a) beneficial mutation(s)
  • despite being relevant in two very different parameter spaces, the commonality between these two models and the reason for their common grouping as soft sweeps is that both models may result in multiple high frequency haplotypes at the time of fixation
  • this reflects the fact that equivalent copies of beneficial mutations are carried on different haplotypes at the onset of selection
  • these haplotype backgrounds may hitchhike to intermediate frequencies
  • with regards to the above prediction of multiple high frequency haplotypes, Schrider et al. [6] noted that the presence of high frequency haplotypes is also a widely-utilized prediction of a hard sweep model with recombination
  • the authors described a so-called ‘soft-shoulder effect’ in which regions flanking a hard sweep may be mis-characterized as being the target of a soft sweep
  • Orr and Betancourt [7] demonstrated that the likelihood of a hard sweep reaching fixation is not simply determined by whether positive selection begins acting when the beneficial mutation is present in a single vs. multiple copies in the population
  • they showed that there is a wide parameter space under which selection on standing variation will still result in a hard sweep
  • we examined the impact of the population's demographic history on the inferred mode of selection
  • owing to the wide-range of expected patterns of variation produced under models of soft sweeps, these new methods frequently classify both neutral demographic histories, as well as hard sweeps, as soft sweeps
  • soft selective sweeps appear as the default conclusion under any model that results in intermediate frequency haplotypes
  • recent claims for widespread soft sweeps are highly tenuous
  • well-fit demographic models combined with rare hard sweeps stand as a strong alternative explanation for observed data
  • pervasive soft sweeps are a highly unlikely explanation in all instances
  • these examples were chosen to span organisms of very different underlying population parameters
  • the results and considerations described below are therefore applicable to the much broader soft selective sweep literature
  • given the rapid sojourn time of a beneficial mutation, the assumption that all sweeps are on-going is peculiar indeed
  • Vy et al. demonstrated that using a flexible window-size approach results in hard-sweep classifications for loci identified as soft sweeps by Garud et al.
  • these observations are fully consistent with scanning whole-genome data and ascertaining the most extreme regions
  • as such there is no need to invoke anything other than population history in order to explain empirical observations
  • the H2/H1 statistic utilized to discern the mode of selection has poor discriminatory power
  • the empirical p-values demonstrate that the top 50 peaks claimed to be the result of soft sweeps all have H2/H1 values that fall well within the distribution of hard sweeps (p = 0.1–0.33)
  • when considering a demographic model fit to the DGRP data, it is apparent that the top 50 outlier regions ascertained by Garud et al. are largely consistent with neutrality
  • S/HIC has a high true positive rate when the true demographic model is known a priori
  • this is never the case in practice
  • the classifier was trained using an equilibrium demographic model
  • the true/test data were simulated under the non-equilibrium model estimated for African human populations by Tennessen et al. [16]
  • these 'true-positives' consist of hard sweeps that have been incorrectly identified as soft
  • their true positives consist nearly entirely of false positives
  • this mis-classification is almost universally in the direction of falsely identifying soft selective sweeps
  • as with the Drosophila analysis, the claim of soft sweeps in human populations is based on the observation of an excess of intermediate frequency haplotypes across the genome relative to neutral equilibrium expectations
  • the accurate performance of the statistic relies on a prior knowledge of both the distribution of fitness effects as well as the demographic history of the population in question
  • neither is ever accurately known in practice
  • the mis-specification of either results in pervasive mis-classification
  • hard sweeps of weakly beneficial mutations will be classified as soft sweeps
  • neutral demographic models which result in haplotype structures similar to soft sweeps (including mild bottlenecks, structured populations, and migration) will be classified as 'soft'
  • severe population bottlenecks may result in genomic patterns of variation which appear 'hard'
  • the repeated claim that S/HIC is robust to demography [11,12,21] is unwarranted
  • the finding of genome-wide soft sweeps appears consistent with mis-inference owing to both the highly non-equilibrium history of these human populations as well as the underlying assumption of large selection coefficients
  • Feder et al. [13] do not propose a novel method, but instead rely on the expectation that hard sweeps will reduce variation much more strongly than soft sweeps
  • though not considered/modeled in Feder et al. [13], it is evident that effective treatment strategies translate to a strong reduction in viral population sizes
  • ineffective treatments do not
  • variation in the severity of bottlenecks during treatment must therefore result in differing levels of neutral genetic variability within populations exposed to treatments of differing efficacy
  • populations exposed to more effective treatments that exhibit longer periods of virologic suppression will on average spend longer periods of time at reduced population size
  • the lack of sufficient model testing and statistical performance analyses underlying these claims of recurrent soft sweeps appears to have led to inaccurate views of the evolutionary processes and trajectories governing these organisms under study
  • the generalization of these results has resulted in misleading answers to decades old questions in population genetics, with some suggesting a dominant role for positive selection in shaping patterns of genomic variation
  • e.g., [21], though see the response of [35]
  • studies which seek to characterize the frequency and impact of selective sweeps using population genomic data, but begin with the assumption that positive selection is the pervasive and dominant force shaping genome-wide patterns of variability, are circular to the point of being futile
  • our work highlights the importance of first considering the demographic history of the population under study when performing genomic analyses