haplotype inference

Scheet P & Stephens M 2006 A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet 78:629-644.

  • fastPHASE
  • another model that also aims to flexibly capture patterns of LD is the PAC model of Li and Stephens (2003), which partially underlies the PHASE software for haplotype inference and estimation of recombination rates
  • one way to view the model we present here is as an attempt to combine the computational convenience of cluster-based models with the flexibility of the PAC model
  • in terms of computational convenience, our model is substantially more attractive than the PAC model
  • with unknown haplotypic phases integrated out analytically rather than via a time-consuming and tedious-to-implement Markov chain–Monte Carlo scheme, such as that used by PHASE
  • the price we pay for this computational convenience is that our model is purely predictive
  • in common with the block-based models mentioned above but in contrast to the PAC model, our model does not attempt to directly relate observed genetic variation to underlying demographic or evolutionary processes
  • it is suited to two other applications that we consider here: inferring unknown ("missing") genotypes and inferring haplotypes from unphased genotype data
  • this model (11) is reminiscent of the "linkage" model of Falush et al. (2003), who modeled genotype data at loosely linked markers in structured populations
  • one difference between their model and ours is that they allowed α (q in their notation) to vary among individuals but fixed α across markers
  • we allow α to vary across markers but assume it to be fixed across individuals
  • the interpretation of these parameters is very different in the two applications
  • in the model of Falush et al. (2003), this parameter controls each individual's proportion of ancestry in each subpopulation (which would be expected to differ across individuals)
  • here it controls the relative frequency of the common haplotypes (which would be expected to differ in different genomic regions)
  • Falush et al. (2003) also restricted r to be constant
  • we allow it to vary in each marker interval
  • the assumption of HWE will not hold exactly for real populations
  • models based on HWE can perform well at haplotype inference and missing-data imputation, even when there are clear and substantial deviations from HWE (e.g., Fallin and Schork 2000; Stephens and Scheet 2005)
  • the model underlying PHASE is based on the PAC model of Li and Stephens (2003), which shares the flexibility of the model we present here but is considerably more costly to compute
  • given the similarities between our model and those of Pritchard et al. (2000) and Falush et al. (2003), it seems natural to consider extending our model to the case in which the subpopulation(s) of origin of each individual is unknown
  • effectively producing a method for clustering individuals that can deal with sets of tightly linked markers