normality

Nicholson G, Smith AV, Jónsson F, Gústafsson Ó, Stefánsson K & Donnelly P 2002 Assessing population differentiation and isolation from single-nucleotide polymorphism data. J R Stat Soc B 64:695-715.

  • in the Balding-Nichols (BN) model, allele frequencies at a particular locus in each population are assumed to vary independently about 'mean' values, according to a symmetric Dirichlet distribution (beta for biallelic systems like SNPs), with an additional parameter for each locus–population pair specifying the variance structure
  • we believe that the different perspective here (explicitly transient) is appropriate for the different genetic systems (SNPs) on which we focus
  • there is a literature (reviewed in Beaumont (2001)) on distinguishing between the BN model and the exact pure drift model for real data
  • the model that is introduced in this paper agrees with the BN model to first and second moments
  • an important difference in principle is that because it has mass outside [0, 1] the use of a normal, but not of a beta, distribution allows for variation to be lost (population allele frequencies of 0 or 1) in some contemporaneous populations
  • a feature that is apparently important in the second data set below
  • we model the dependence structure among the population allele frequencies α as follows
  • first introduce another collection of unobserved quantities, one for each locus
  • πi, i = 1, 2,..., L
  • in the population genetics discussion below these will play the role of the allele frequencies in a population ancestral to those sampled
  • in addition, introduce parameters cj, j = 1, 2,..., P, one for each population
  • these will be the parameters that we aim to estimate
  • they specify how far (in the sense of variance) each population's allele frequencies tend to be from typical values
  • formally, conditionally on π, c,
  • αij ~ normal{πi, cjπi(1 − πi)} ... (2)
  • here and throughout we shall sometimes specify continuous distributions for quantities which are restricted to [0, 1]
  • by this we mean the distribution whose density on (0, 1) is the relevant density, with atoms at 0 and 1 whose size is the total mass of the relevant distribution on (−∞, 0) and (1, ∞) respectively
  • to complete the hierarchy we put independent priors on π and c
  • π1,..., πL are independent and identically distributed with density f ... (3)
  • c1,..., cL are independent and identically distributed with density g ... (4)
  • the version of the Wright–Fisher diffusion in question has infinitesimal mean 0 and infinitesimal variance at z given by z(1 − z)
  • provided that τ is small, the increment in the diffusion, from starting value π, over time τ, is approximately normally distributed, in this case with mean 0 and variance τπ(1 − π)
  • α ~ normal{π, τπ(1 − π)} ... (7)
  • which is consistent with the marginal distribution implied by equation (2)
  • both the natural Markov model and the diffusion approximation have absorbing boundaries corresponding to either all or none of the chromosomes in the population carrying the variant in question
  • provided that the distributional statement (7) is interpreted as described after expression (2), then expression (7) remains a reasonable approximation
  • see Beaumont (2001) and references therein for an exact treatment
  • in this setting, the parameter cj for population j has a very natural interpretation as the time, on the diffusion timescale, for which the population has undergone genetic drift
  • in the special case in which the major effect is the bottleneck at founding, then cj can be interpreted as the inverse of the size of the bottleneck
  • expression (7) is effectively just the normal approximation to the binomial distribution for the number of copies of the allele of interest after the bottleneck
  • Wright (1951) described FST as 'the correlation between random gametes, drawn from the same subpopulation, relative to the total'
  • unfortunately this definition is not precise
  • some of the subsequent confusion in the literature stems from different interpretations
  • one conceptual dichotomy between approaches arises from differences (usually implicit rather than explicit) in what is being conditioned on
  • a common definition is
  • FST = (Q2Q3) / (1 − Q3) ... (9)
  • where Q2 and Q3 are respectively the probability that two copies of the region on different chromosomes sampled from within and between populations are the same
  • in what Balding has called the descriptive approach, often associated with Nei and colleagues, these probabilities are thought of as relating only to the sampling process by which the chromosomes are chosen
  • they are defined conditionally on the population allele frequencies
  • as a consequence, FST is a function of the population allele frequencies
  • in practice it is often 'evaluated' simply by replacing these by the obvious estimates from the sample data
  • other approaches (model based in Balding's terminology) interpret the probabilities in equation (9) as relating to repetitions of the entire evolutionary process, rather than simply over repetitions of sampling from the extant populations
  • FST would be thought of as a statistical parameter
  • the goal is to estimate it from data, and/or to relate it (by probability calculations) to parameters which directly specify the evolutionary model
  • the most common estimation procedure (see for example Weir (1996)) is often formulated by analogy with analysis of variance
  • it is equivalent (Rousset, 2001) to a method-of-moments approach
  • the probabilities Q2 and Q3 in equation (9) are estimated by the frequencies of identical pairs of chromosomes at the locus, within and between populations respectively, in the sample, and the estimates substituted into equation (9)
  • writing α for the allele frequency in a population, a rearrangement of equation (9) gives
  • FST = var(α) / (E(α) {1 − E(α)}) ... (10)
  • in the descriptive framework the randomness relates simply to which population may happen to be chosen
  • in a model-based framework the expectations are over the evolutionary model
  • note the standard multiplicative parameterization of var(α) relative to E(α) {1 − E(α)} to which we referred earlier
  • formula (10) bears a close similarity to the marginal variance structure implied by our model (2)
  • if we were to insist on a common value of c across populations, then equation (10) would obtain, with FST replaced by c, provided that it was interpreted as being conditional on π
  • in this sense, particularly as different approaches involve different conditioning anyway, our parameters cj might be thought of as analogous to FST-values, but with one for each population