isolation with migration

Wilkinson-Herbots HM 2012 The distribution of the coalescence time and the number of pairwise nucleotide differences in a model of population divergence or speciation with an initial period of gene flow. Theor Popul Biol 82:92-108.

  • estimates obtained with the computer programs IM (see references above) and MIMAR (Becquet and Przeworski, 2007) are often biased if the assumptions of the IM model are violated
  • these methods are highly sensitive to the assumption of constant gene flow since the split
  • IMa or IMa2 (Hey, 2010) have been used to estimate the times when migration events occurred, and to try to distinguish between scenarios of speciation with gene flow and scenarios of introgression (where gene flow occurs through secondary contact after a period of complete isolation)
  • such inferences about the timing of gene flow are not valid
  • commonly used computational methods based on the IM model were originally designed to handle the more traditional type of data set consisting of a large number of DNA sequences at each of a small number of independent loci and are computationally demanding (relying on MCMC simulation to compute the likelihood)
  • Takahata et al. (1995) developed a Maximum Likelihood method for estimating demographic parameters using one pair of sequences at each of a large number of independent loci
  • their method is based on the exact, analytical expression for the likelihood for such pairwise difference data
  • there is currently considerable interest in fast computational methods which can handle data from large numbers of loci obtained from only a few genomes
  • likelihoods for pairs of genes at independent loci are far less complicated than likelihoods based on a larger sample at each locus
  • for simple models, coalescent theory can be applied to find the likelihood explicitly rather than having to rely on MCMC or numerical approximations
  • a data set consisting of a small number of sequences at each of a large number of independent loci may actually be more informative about the parameters of interest than a data set consisting of a large number of sequences at each of a small number of independent loci
  • in a panmictic population of constant size, sequences at the same locus are highly positively correlated as they are all part of the same underlying genealogical tree
  • adding more sequences at the same locus typically adds relatively little to the total length of the genealogical tree at that locus
  • for the related but somewhat different problem of reconstructing species trees, Maddison and Knowles (2006) concluded on the basis of simulation results that sequencing more individuals can be more informative than sequencing more loci (for fairly small numbers of individuals and loci) when the species diverged relatively recently
  • the ML method of Takahata et al. (1995) has been extended in a number of ways, including by Yang (1997) to incorporate variation of evolutionary rates among loci, and by Innan and Watanabe (2006) to a model of gradual population divergence where, looking backward in time, the migration rate between two currently isolated populations increases as a linear function of time, from the time of complete isolation until panmixia (or a "quasi-panmictic" state) is reached
  • Innan and Watanabe's model is more sophisticated than the "isolation with initial migration" model considered in this paper, their calculation of the likelihood relies on numerical computation of the probability density function of the coalescence time of a pair of genes using recursion equations, which can be very time-consuming
  • the accuracy of their likelihood calculation depends on the number of time points at which the pdf of the coalescence time is computed
  • Lohse et al. (2011) demonstrated how generating functions can be used to calculate likelihoods, and implemented an extension of Takahata et al.'s ML method to an IM model, based on the exact likelihood for triplets of DNA sequences sampled at each locus
  • they also discuss how this can be extended to somewhat larger samples at each locus and how recombination can be incorporated
  • in many cases the IM model (with migration continuing at a constant rate until the present) would seem unrealistic in this context
  • it may be tempting to try and interpret estimates of migration rates obtained from such IM analyses as average levels of gene flow over time
  • Teshima and Tajima (2002) considered various scenarios of divergence with varying migration rates over time and found that their results depended on "the sum of the migration rate for the period of migration" or, equivalently, the expected total number of migration events per gene (or ancestral lineage) during the entire period of migration
  • even though the expected total number of migrants over the entire period of migration is the same in each of these scenarios, the distribution of S12 can be quite different
  • parameter estimates obtained by applying an IM model to DNA sequence data for species which are now completely isolated may well be inaccurate and should be treated with caution
  • the number of descendant populations in the "island model" stage of the IM model, n, should be the total number of populations exchanging migrants, including any unsampled populations