Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2012 May 28;28(15):2008–2015. doi: 10.1093/bioinformatics/bts314

Blockwise HMM computation for large-scale population genomic inference

Joshua S Paul 1, Yun S Song 1,2,*
PMCID: PMC3400961  PMID: 22641715

Abstract

Motivation: A promising class of methods for large-scale population genomic inference use the conditional sampling distribution (CSD), which approximates the probability of sampling an individual with a particular DNA sequence, given that a collection of sequences from the population has already been observed. The CSD has a wide range of applications, including imputing missing sequence data, estimating recombination rates, inferring human colonization history and identifying tracts of distinct ancestry in admixed populations. Most well-used CSDs are based on hidden Markov models (HMMs). Although computationally efficient in principle, methods resulting from the common implementation of the relevant HMM techniques remain intractable for large genomic datasets.

Results: To address this issue, a set of algorithmic improvements for performing the exact HMM computation is introduced here, by exploiting the particular structure of the CSD and typical characteristics of genomic data. It is empirically demonstrated that these improvements result in a speedup of several orders of magnitude for large datasets and that the speedup continues to increase with the number of sequences. The optimized algorithms can be adopted in methods for various applications, including the ones mentioned above and make previously impracticable analyses possible.

Availability: Software available upon request.

Supplementary Information: Supplementary data are available at Bioinformatics online.

Contact: [email protected]

1 INTRODUCTION

With the cost of genomic sequencing rapidly decreasing, there is a growing need for statistical methodologies that can efficiently accommodate genomic-scale data for many individuals while accounting for complex patterns of variation (e.g. linkage disequilibrium) caused by evolutionary processes such as mutation and recombination. The applicable underlying statistical model is the coalescent with recombination, which describes the distribution of genealogical histories for a collection of individuals. Known methods for inference under this model are generally intractable at the genomic scale, so practicable methods must realize a balance between computational efficiency and approximating key properties of the coalescent with recombination. A promising class of such methods use the conditional sampling distribution (CSD).

A CSD approximates the probability (under the coalescent with recombination) of sampling an individual with a particular DNA sequence, given that a collection of sequences from the population has already been observed. Intuitively, recombination partitions the newly sampled sequence into segments, each of which is a copy of the corresponding segment in a previously sampled sequence, with imperfections introduced by mutation. For computational efficiency, this construction is often cast as a hidden Markov model (HMM). The hidden state at a site indicates the previously sampled sequence being copied and the associated observed state the allele of the new sample.

Even within this framework there are alternatives, as it is possible to trade-off fidelity to the underlying coalescent process for computational efficiency. The CSD of Paul et al. (2011), for example, is the most accurate but is a constant factor slower than CSDs proposed by Fearnhead and Donnelly (2001) and by Li and Stephens (2003), with the latter being the fastest, but least accurate. CSDs for more complex models, incorporating gene conversion (Gay et al., 2007; Yin et al., 2009), diploidy (Marchini et al., 2007), demography (Davison et al., 2009; Hellenthal et al., 2008) and admixture (Price et al., 2009; Sundquist et al., 2008), have also been proposed.

Methods incorporating the CSD generally fall into one of several categories. Likelihoods can be approximated using CSD-based importance sampling (De Iorio and Griffiths, 2004a; De Iorio and Griffiths, 2004b; Fearnhead and Donnelly, 2001; Griffiths et al., 2008; Stephens and Donnelly, 2000) coupled with composite methods (Fearnhead and Donnelly, 2002; Hudson, 2001) or directly as a product of CSDs (Li and Stephens, 2003). In conjunction with expectation–maximization or Markov chain Monte Carlo, these methods have been used for estimation of fine-scale recombination rates (Crawford et al., 2004; Fearnhead and Smith, 2005; Li and Stephens, 2003; McVean et al., 2004), gene conversion parameters (Gay et al., 2007; Yin et al., 2009), population demography (Davison et al., 2009; Hellenthal et al., 2008) and population structure (Lawson et al., 2012). It is also possible to infer and use the hidden states in the HMM CSD computation. This has been used for admixture inference (Price et al., 2009; Sundquist et al., 2008; Wegmann et al., 2011), in which genomic segments corresponding to ancestral populations are identified and also within a pseudo-Gibbs sampling framework to phase genotype sequence data into haplotype sequence data and to impute missing data (Howie et al., 2009; Li et al., 2010; Marchini et al., 2007; Stephens and Scheet, 2005).

Nearly, all of these methods rely on iterative Monte Carlo or expectation–maximization techniques. As a result, they are computationally intensive, often requiring several hours, or, in some cases, days, to produce a result, even for modest non-genomic datasets (Howie et al., 2009); directly extending the methods to large genomic datasets is thus often impractical. Moreover, nearly all of the running time is expended on CSD computation, and so the choice of CSD is often made on the basis of efficiency and (arguably) at the expense of accuracy (Browning and Browning, 2007; Howie et al., 2009; Li and Stephens, 2003; Scheet and Stephens, 2006; Stephens and Scheet, 2005).

In this article, we help to overcome these limitations by proposing two related optimizations to the relevant HMM-based CSD computations. Consider sampling a large number of sequences from a population. If the sampled sequences are very long, it is likely that nearly all of them will be unique. However, for most relatively short regions, the number of unique subsequences will be reduced. This intuition forms the basis of the first optimization, which locally reduces the complexity of the CSD computation, thereby improving efficiency. The collection of locally unique subsequences on which this optimization depends are formalized as a partition Inline graphic of the sampled sequences; we characterize the optimal partition given the sampled sequences and provide a fast algorithm for approximating this optimum.

A second common feature of the sampled sequences is an abundance of non-polymorphic sites. These sites are informative—for example, a local over-abundance of non-polymorphic sites indicates a recent common ancestor, which in turn indicates a low propensity for recombination—and should be included in the analysis. Indeed, Li and Durbin (2011) used the physical distribution of polymorphic and non-polymorphic sites between a pair of sequences to infer past population sizes of humans. Using the fact that non-polymorphic sites do not differentiate the sequences, we show that it is possible to reduce the complexity of the CSD computation at non-polymorphic sites. We stress that our solution is different from simply ignoring non-polymorphic sites; we are proposing algorithmic improvements to incorporate non-polymorphic sites into the analysis in an efficient way.

In formally describing and evaluating our optimizations, we restrict attention to the most accurate HMM-based CSD, Inline graphic, proposed by Paul et al. (2011) and consider the problem of computing the conditional sampling probability (CSP), denoted Inline graphic, of a particular individual α given a collection n of previously sampled individuals. (Incidentally, in the case the size n of the previously observed sample n is 1, the HMM underlying Inline graphic is equivalent to the HMM used in the aforementioned work of Li and Durbin; we anticipate that Inline graphic provides one way of extending this line of work to many sequences). On simulated data, our algorithmic improvement leads to a speedup of about 550× for n=5000 previously sampled individuals; by making regularity assumptions on mutation and recombination rates, this speedup increases to about 1850×. Importantly, we show that the empirically observed speed-up increases with the number n of previously sampled individuals.

Although we describe our optimizations in the context of computing Inline graphic, they are more generally applicable. We provide two sufficient conditions for our optimizations and use them to show the applicability to other HMM-based CSDs, including those of Fearnhead and Donnelly (2001) and Li and Stephens (2003), as well as CSDs for more complex demographic models and population genetic HMMs. Also, in the Supplementary Material, we describe extending our algorithms to allow for efficient inference of hidden states, often termed posterior decoding.

We stress that the work presented here is fundamentally different from previous works on ‘approximating’ CSD-based population genetic inference. Howie et al. (2009) consider a fixed-size subset of the haplotype configuration n, chosen using a measure of ‘closeness’ to the sampled haplotype α, in order to reduce the state space of the HMM-based CSD and speed up computation. Similarly, Scheet and Stephens (2006) and Browning and Browning (2007) consider an HMM with reduced state space by compacting the configuration n into a substantially smaller haplotype model. More recently, Delaneau et al. (2012) have proposed an approximate HMM formulation relying on a partition of the sampled sequences similar to that proposed herein (Section 2.6). As described earlier, using such heuristics improves computational efficiency, but ultimately at the expense of accuracy. The purpose of this article is to provide highly optimized ‘and exact’ computation for a large class of approximate CSDs, rather than to introduce additional approximations to the underlying models.

2 METHODS

Herein, we describe the HMM formulation of Inline graphic, the algorithms that are currently being used to compute Inline graphic and the optimizations we are proposing to improve the running time.

We remark that the theoretical analysis of our algorithms is limited to asymptotic time (and space) complexity. As a measure of real-world performance, asymptotic analyses often leave much to be desired. Consider, for example, a sample in which 1 out of every 1000 sites is polymorphic. If we denote by k the total number of sites and kp the number of polymorphic sites, then formally O(k)=O(kp). Nevertheless, we would like to distinguish between an algorithm that operates on each of the k sites and an algorithm that operates only on the kp polymorphic sites, as the the latter will be some 1000× faster; we thus write the complexities for the two algorithms as O(k) and O(kp), respectively.

2.1 Notation

We consider haplotypes in the finite-locus finite-alleles setting. Throughout, we suppose that there are k loci, and that recombination may occur between any adjacent pair of loci (ℓ, ℓ+1) where 1≤ℓ<k.

The space of all such haplotypes is denoted by Inline graphic, and given a haplotype hInline graphic, the allele at locus ℓ is denoted by h[ℓ] and the sub-haplotype for a range of loci ℓ≤ℓ′ is denoted by h[ℓ:ℓ′]. A sample configuration of haplotypes is specified by a vector n=(nh)hInline graphic, with nh being the number of haplotypes of type h in the sample. The set of ‘unique’ haplotypes associated with configuration n is denoted by Inline graphicn={hInline graphic:nh>0}. Finally, the total number of haplotypes is denoted by |n|=n, the number of unique haplotypes by |Inline graphicn |=nu and the number of polymorphic loci, which generally depends on the sample n, by kp.

2.2 A brief description of Inline graphic

Suppose that, conditioned on having already observed a haplotype configuration n, we wish to sample a new haplotype α. By generalizing the technique of Griffiths et al. (2008) based on the diffusion process, Paul and Song (2010) introduced the CSD Inline graphic intended to approximate key properties of the coalescent with recombination, the model under which inference is to be performed. The central idea of Inline graphic is to fix the unknown ancestry of n to be the ‘trunk genealogyInline graphic*(n), in which lineages associated with the haplotypes do not mutate, recombine, or coalesce with one another, but rather extend infinitely into the past. Having fixed the ancestry of n, a conditional genealogy Inline graphic associated with haplotype α is sampled; within Inline graphic, lineages evolve backwards in time subject to mutation, recombination, coalescence and ‘absorption’ into one of the lineages of Inline graphic*(n). When every lineage of Inline graphic has been absorbed, the process terminates; the type of every lineage of Inline graphic is now determined and a sample for α is generated.

Although a recursion for computing the CSP Inline graphic is known, it is computationally intractable for all but the smallest datasets. To remedy this limitation, Paul et al. (2011) adopted a sequentially Markov framework McVean and Cardin (2005) for the conditional genealogy Inline graphic. The central idea is to consider the ‘marginal’ conditional genealogy s associated with each locus ℓ, which is described, disregarding mutation events, by the pair s=(t, h), where t∈[0, ∞) is the absorption time and hInline graphicn is the absorption haplotype. The conditional genealogy Inline graphic can thus be represented as a sequence of marginal conditional genealogies {s}. See Figure 1 for an illustration.

Fig. 1.

Fig. 1.

Illustration of a sequentially Markov framework for the conditional genealogy Inline graphic. The trunk genealogy Inline graphic*(n) is indicated. The three loci of each haplotype are each represented by a circle, with the shading indicating the allelic type at that locus. Time is represented vertically, with the present (time 0) at the bottom. The marginal genealogies at the first, second and third loci are shown as dotted, dashed and solid lines, respectively. Mutation event, along with resulting allele, is indicated by small arrow. Absorption events at each locus, and the corresponding absorption time (t(a) and t(b)) and haplotype (h(a) and h(b), respectively), are indicated by horizontal lines

In general, the random sequence of marginal conditional genealogies is not Markov, due to the potential for coalescence events within the conditional genealogy. Nonetheless, it is possible to ‘approximate’ this sequence as Markov by using a two-locus transition distribution. Mutation can then be realized at each locus independently as a Poisson process on the marginal conditional genealogy, thereby generating a sample for α. The resulting CSD is denoted by Inline graphic. Paul et al. (2011) found that the Markov approximation underlying Inline graphic has minimal effect on accuracy compared with Inline graphic, coinciding with findings in similar domains (McVean and Cardin, 2005; Marjoram and Wall, 2006).

Owing to its Markov structure, the CSD Inline graphic can be cast as an HMM wherein the ℓth hidden state is the marginal conditional genealogy s=(t, h) and the ℓth observed state the conditionally sampled allele α[ℓ]. In order to use standard dynamic programming methodologies for inference, the state space of the HMM must be finite, and so absorption time is discretized by partitioning [0, ∞) into a (possibly large) finite number m of intervals I and considering an absorption interval, denoted by dI, rather than an absorption time. The discretized hidden state at the ℓth locus is then s=(d, h), and the initial, transition and emission distributions for the discretized Inline graphic HMM are denoted by ζ(·), Φ(· | ·) and ξ(· | ·), respectively. Interested readers should consult Paul and Song (2010) and Paul et al. (2011) for details.

2.3 Computation with Inline graphic

The CSP Inline graphic under the discrete-space HMM can be efficiently computed using the ‘forward algorithm’, a dynamic program associated with the HMM forward recursion:

graphic file with name bts314m1.jpg (1)

where

graphic file with name bts314m2.jpg (2)

with base case

graphic file with name bts314m3.jpg (3)

Importantly, this recursion, and the associated dynamic program with time complexity O(k(num)2), is applicable to a general HMM. In the following, we examine the transition and emission distributions more carefully and obtain a series of refined recursions and the associated dynamic programs.

2.4 Improving efficiency through the transition distribution

Consider the description of Inline graphic given above and more rigorously defined in Paul et al. (2011). If a recombination does not occur between loci ℓ−1 and ℓ, then (dℓ−1, hℓ−1)=(d, h); moreover, if recombination does occur, the absorbing haplotype h is independent of hℓ−1 and uniformly distributed. As a result, we have the following property (We remark that Property 1 is a sufficient, though not necessary, condition for the algorithmic optimizations described in this and subsequent sections. For example, as stated, the transition distribution imposes a uniform distribution on the absorbing haplotype in the case of a recombination; in fact, the algorithms can be generalized to accommodate an arbitrary distribution over haplotypes that does not depend on d′ or h′. In Section 4, we discuss the applicability of these optimizations to more general (and more specialized) classes of approximate CSDs.):

Property 1. —

The initial and transition probabilities ζ and ϕ, take the following functional form:

graphic file with name bts314um1.jpg

where x(d), y(d′) and z(d′,d) are known analytic expressions.

Using Property 1 in conjunction with definitions

graphic file with name bts314m4.jpg (4)

we can express equations (1)–(3) as

graphic file with name bts314m5.jpg (5)

where

graphic file with name bts314m6.jpg (6)

with base case

graphic file with name bts314m7.jpg (7)

Using these recursions, the dynamic program in Algorithm 1 can be used to compute Inline graphic. Within the pseudocode description, the time complexity of Lines 6, 7 and 8 are O(m), O(nu) and O(nu), respectively. As a result, the time complexity of Lines 5–9, and for the algorithm as a whole, is O(km(m+nu)). This represents a substantial improvement over the quadratic dependence on nu in the naive forward algorithm for HMMs. This simple optimization has already been generally adopted (Fearnhead and Donnelly, 2001; Li and Stephens, 2003; Paul et al., 2011) and serves as a baseline for improvement.

graphic file with name bts314i21.jpg

2.5 Improving efficiency through the emission distribution

The state of the Inline graphic HMM at locus ℓ is a tuple (d, h). However, the emission probability of α[ℓ] is governed only by the time interval d and the allele h[ℓ]. As a result, the following property holds:

Property 2. —

Consider a subset Inline graphicInline graphicn such that there exists an allele a with h[ℓ]=a for all hInline graphic. Then, for each absorption interval dI, the emission distribution ξ(· | d, h) is identical for all hInline graphic. We indicate this fact by writing ξ(·|d, h)=ξ(·|d, Inline graphic) for all hInline graphic.

With this in mind, define a ‘partitionInline graphic of the haplotype configuration n to be a collection of blocks of the form (Inline graphic, ℓs, ℓe), such that

  • For every (Inline graphic, ℓs, ℓe)∈Inline graphic, there exists a sub-haplotype x such that h[ℓs:ℓe]=x for all hInline graphic.

  • For every haplotype hInline graphicn and 1≤ℓ≤k, there exists ‘exactly’ one (Inline graphic, ℓs, ℓe)∈Inline graphic with hInline graphic and ℓs≤ℓ≤ℓe.

For a given locus ℓ, a configuration partition Inline graphic induces a partition of the haplotypes Inline graphicn, denoted by Inline graphicℓ, and Property 2 applies to each Inline graphicInline graphicℓ. In the next sections, we present new sets of recursions and dynamic programming algorithms valid for an arbitrary partition Inline graphic. The computational complexity of these algorithms will depend on Inline graphic through two functions, namely Ψ(Inline graphic) and Ω(Inline graphic), defined as follows: For locus ℓ, define Ψ(Inline graphic)=|Inline graphicℓ|, the number of blocks in Inline graphicℓ and define ω(Inline graphic) to be the total number of haplotypes in blocks of the configuration partition ‘ending’ at locus ℓ. Then,

graphic file with name bts314um2.jpg

In some cases, we are primarily concerned with polymorphic loci, and so we define Ψp(Inline graphic) to be the analog of Ψ(Inline graphic) summed over ‘only’ polymorphic loci.

Finally, we define the trivial partition CT for haplotype configuration n as the partition containing a single block ({h}, 1, k) for each hInline graphicn. Note that Ψ(CT)=k · nu and Ω(CT)=nu. See Figure 2 for an illustration of both CT and two non-trivial configuration partitions.

Fig. 2.

Fig. 2.

Illustration of three alternative configuration partitions. Each row represents a haplotype, with white and black circles representing the allele at each of eight polymorphic loci. The line style of each sub-haplotype indicates the block to which it belongs. (a) The trivial configuration partition CT; Ψp(CT)=40 and Ω(CT)=5. (b) A non-trivial configuration partition, C; Ψp(C)=24 and Ω(C)=12. (c) The configuration partition Cs found by the algorithm described in Section 2.6 for s=3; Ψp(Cs)=24 and Ω(Cs)=15

2.5.1 A general configuration partition

Let Inline graphic be a configuration partition of n. Begin by defining

graphic file with name bts314m8.jpg (8)

Now suppose (Inline graphic, ℓs, ℓe)∈Inline graphic. Applying Definition (8) and Property 2 to equation (6), then for ℓs≤ℓ≤ℓe,

graphic file with name bts314m9.jpg (9)

where nInline graphic=∑hInline graphic nh. Similarly, by induction and making use of equations (6) and (9), it is possible to show that, for ℓs≤ℓ≤ℓe and hInline graphic,

graphic file with name bts314m10.jpg (10)

where T(d, Inline graphic)=∏ℓ′=ℓs ξℓ′(α[ℓ′]|d, Inline graphicyℓ′−1(d), and solves the recursion,

graphic file with name bts314m11.jpg (11)

for ℓs≤ℓ≤ℓe, with base case Ts−1(d, Inline graphic)=1.

For each block (Inline graphic, ℓs, ℓe)∈Inline graphic, we take advantage of equations (9) and (11) to directly compute Q(d, Inline graphic) and T(d, Inline graphic) at every locus ℓs≤ℓ≤ℓe. At the end of each block, when ℓ=ℓe, the finer-grain values F(d, h) are computed for each hInline graphic using equation (10), and subsequently used to compute initial values for blocks beginning at locus ℓ+1. The associated dynamic program to compute the CSP Inline graphic is given in Algorithm 2.

Within Algorithm 2, the time complexity of Line 7 is O(m); of Line 8 is O(Inline graphic)) and of Lines 9 and 10 is O(Inline graphic)). Thus, the time complexity of Lines 6–11, and the dynamic program, is O(km2+m(Ψ(Inline graphic)+Ω(Inline graphic))). Observe that Algorithm 1 is a special case of Algorithm 2 for Inline graphic=CT. Thus, if it is possible to obtain a configuration partition Inline graphic for n such that Ψ(Inline graphic)+Ω(Inline graphic) is substantially less than Ψ(CT)+Ω(CT)=knu+nu, our new algorithm may be considerably faster than Algorithm 1; constructing such a configuration partition is the subject of Section 2.6.

graphic file with name bts314i24.jpg

graphic file with name bts314i25.jpg

2.5.2 The absence of polymorphism

In many reasonable evolutionary scenarios, a great many loci will not be polymorphic. Accommodating such loci in the analysis is important and can be done efficiently making use of Property 2. In particular, for a non-polymorphic locus ℓ, Property 2 applies to the trivial set Inline graphic0=Inline graphicn, and therefore the emission distribution can be written ξ(·|d, Inline graphic0)=ξ(·|d) and moreover, Q(d, Inline graphic0)=Q(d).

Suppose consecutive loci ℓs*,…, ℓe* are not polymorphic. Rewriting equations (9) and (10) for block (Inline graphic0, ℓs*, ℓe*) yields, for ℓs*≤ℓ≤ℓe*,

graphic file with name bts314m12.jpg (12)

and for ℓs*≤ℓ≤ℓe* and hInline graphic0=Inline graphicn,

graphic file with name bts314m13.jpg (13)

where Inline graphic and solves the recursion

graphic file with name bts314m14.jpg (14)

for ℓs*≤ℓ≤ℓe*, with base case Ts*−1(d)=1.

Now, let Inline graphic be a configuration partition with (Inline graphic, ℓs, ℓe)∈Inline graphic. Suppose that there is a stretch of non-polymorphic loci ℓs*,…, ℓe* and that ℓs≤ℓs*≤ℓe*≤ℓe. Applying Definition (8) to equation (13), yields, for ℓs*≤ℓ≤ℓe*,

graphic file with name bts314m15.jpg (15)

Similarly, considering the definition of T(d, Inline graphic) along with equation (14),

graphic file with name bts314m16.jpg (16)

Algorithm 2 can be modified to accommodate such stretches of non-polymorphic loci as a special case, making use of equations (12) and (14) to directly compute the values of Q(d) and T(d) at each non-polymorphic locus ℓ. If we then assume (without loss of generality) that each (Inline graphic, ℓs, ℓe)∈Inline graphic has ℓe at a polymorphic locus, then at the final non-polymorphic locus, for which ℓ=ℓe*, equations (15) and (16) may be used to compute Q(d, Inline graphic) and T(d, Inline graphic) for each Inline graphicInline graphicℓ. This modification is detailed in Algorithm 3.

Within Algorithm 3, the time complexity of Lines 5 and 8 is O(1), of Line 7 is O(m(Inline graphic)+ω(Inline graphic)) and of Line 10 is O(m). As a result, the time complexity of Lines 2–12, and of the dynamic program, is O(km2+mp(Inline graphic)+Ω(Inline graphic))). Relative to Algorithm, less computation needs to be done for non-polymorphic loci; thus, in the typical case of many non-polymorphic loci, this dynamic program will have a decreased running time. For Inline graphic=CT, the time complexity is O(km2+kpmnu).

2.5.3 An optimization for non-polymorphic loci

The key recursions (12) and (14) for non-polymorphic loci can be written in matrix form as Inline graphic=Inline graphic (Inline graphic+Inline graphic) Inline graphicℓ−1 and Inline graphic=Inline graphicInline graphicInline graphicℓ−1, where Inline graphic and Inline graphic are m-dimensional column vectors having dth entry Q(d) and T(d), respectively; Inline graphic and Inline graphic are (m×m)-dimensional diagonal matrices having (d×d)th entry ξ(α[ℓ]|d) and yℓ−1(d), respectively; and Inline graphic is an (m×m)-dimensional matrix having (d, d′)th entry zℓ−1(d′,d).

Now, suppose that the mutational model is symmetric and the mutation rate constant for all loci. Then, Inline graphic=Inline graphic does not depend on ℓ, for all non-polymorphic loci ℓ. Similarly, if the recombination rate between each pair of loci is constant, then Inline graphic=Inline graphic and Inline graphic=Inline graphic do not depend on ℓ. With these assumptions, for ℓs*≤ℓ≤ℓe*,

graphic file with name bts314m17.jpg (17)

and the values of (Inline graphic(Inline graphic+Inline graphic))r and (Inline graphicInline graphic)r can be pre-computed (either by eigenvalue decomposition or repeated multiplication) for a relevant range of r-values. Using this technique for explicitly computing only the necessary values of Q(d) and T(d), stretches of non-polymorphic loci can be ‘analytically’ skipped.

The modified dynamic program associated with this optimization is given in Algorithm 4. The time complexity of Line 4 is O(m) and of Line 6 is O(m(Inline graphic)+ω(Inline graphic)). Thus, the time complexity for the dynamic program is O(kpm2+mp(Inline graphic)+Ω(Inline graphic))).

This refinement once again reduces the computation required for non-polymorphic loci, and so we might expect substantial improvements in performance over Algorithms 2 and 3. For the choice Inline graphic=CT, the time complexity is O(kpm(m+nu)). Note that the assumptions necessary for Algorithm 4, namely a symmetric mutation model and uniform mutation and recombination rates, can be relaxed, but at the expense of additional pre-computation. For example, given non-uniform, but locally similar, recombination rates, pre-computation might be performed for each of several rates; each stretch of non-polymorphic loci could then use the pre-computed values associated with the closest recombination rate.

graphic file with name bts314i27.jpg

2.6 A fast algorithm for configuration partitions

In Section 2.5, we assumed a configuration partition Inline graphic to be specified and showed that, for Algorithms 2–4, the time complexity depends on Inline graphic through the functions Ψ(Inline graphic) (or Ψp(Inline graphic)) and Ω(Inline graphic) and more particularly their sum. It is intuitively clear that a configuration partition minimizing Ω will maximize Ψ (as in CT), and vice versa, and in general these quantities are inversely related; minimizing a convex combination of these quantities is therefore difficult. In this section, we propose a fast and simple parametrized algorithm for constructing reasonably good configuration partitions.

Given a configuration n, the algorithm proceeds sequentially over the loci: initially, let ℓs=1. Given ℓs, find the largest polymorphic locus ℓe such that ℓs≤ℓek, and the number of unique sub-haplotypes between loci ℓs and ℓe is at most some threshold parameter s. Then, for each unique sub-haplotype x between ℓs and ℓe, group all hInline graphicn such that h[ℓs : ℓe]=x into the same block Inline graphic and add (Inline graphic, ℓs, ℓe) to the configuration partition. Set ℓs=ℓe+1 and repeat until locus k is reached. An example configuration partition resulting from this algorithm is shown in Figure 2c.

Applying this procedure to configuration n with threshold parameter s results in a configuration partition which we denote Cs. Observe that for s=|Inline graphicn|, we obtain Cs=CT, which minimizes Ω. On the other hand, for s=2 (in the bi-allelic case), the algorithm produces a configuration partition that minimizes Ψp. Intermediate values of s produce the intermediate results that are of interest. In order to gauge the effect of different combinations of Ψp and Ω on the running time, the CSP Inline graphic was computed using the configuration partition Cs for each of several values of s, and the times recorded; the results are plotted in Figure 3. As our intuition suggested, the running time depends substantially on the choice of Inline graphic and, in accordance with the asymptotic time complexity results depends linearly on both Ψp and Ω. Furthermore, as anticipated, the values of Ψp and Ω are inversely related.

Fig. 3.

Fig. 3.

Empirically observed running time of Algorithm 4 used to compute Inline graphic, for a particular configuration n and an arbitrary α∈Inline graphicn. Several values of s∈(2,…, 500) were used, and each circle corresponds to a particular value of s. The curve of circles demonstrates the trade-off between small Ψp (small s-values) and small Ω (large s-values). As predicted by the asymptotic time complexity results, running time appears to depend linearly on both Ψp- and Ω-values, and fitting a linear model indicates the constant associated with Ψp is ~1.5 times greater than the constant associated with Ω. The configuration n was generated using coalescent simulation for 500 individuals, each having 105 bi-allelic loci, using population-scaled mutation rate θ=0.005 per locus and population-scaled recombination rate ρ=0.001 between each pair of adjacent loci, and resulting in kp=1724 polymorphic loci and nu=324 unique haplotypes

By fitting a linear model to the data, we can deduce the constants associated with Ψp and Ω, which the asymptotic results cannot provide. Although these constants will depend on the implementation and hardware, their ratio should be relatively stable, and therefore informative for choosing an optimal trade-off between Ψp and Ω. We find that the constant associated with Ψp is ~1.5 times that associated with Ω, suggesting that running time is minimized for a choice of Inline graphic that minimizes 1.5 · Ψp(Inline graphic)+Ω(Inline graphic). Furthermore, making use of the above algorithm, we define

graphic file with name bts314um3.jpg

and C*=Cs*. In practice the value s* is found using binary search and determining C* is very fast. This definition will be used frequently in Section 4, as C* (and the analogous result for Algorithm 1, using Ψ in place of Ψp) provides a good, though not necessarily optimal, choice for Inline graphic.

3 RESULTS

We have presented three optimized algorithms for computing the conditional sampling probability (CSP) Inline graphic. Briefly, Algorithms 2–4 rely on a partition Inline graphic of the configuration n. We have characterized optimal such partitions and proposed a simple and fast method for constructing good partitions Inline graphic=C* (cf, Section 2.6). For the sake of comparison, we also consider the trivial partition Inline graphic=CT. Relative to Algorithm 2, Algorithms 3 and 4 represent successive improvements in efficiency for non-polymorphic loci. In this section, we provide an empirical analysis of these algorithms and demonstrate that our optimizations yield a substantial speedup.

The optimized algorithms, along with their asymptotic time complexities, are summarized in Table 1. Intuitively, for a fixed number of haplotypes n, and assuming coarse homogeneity across the genome, the runtimes of each of these algorithms should be linear in the number of loci. We are interested in determining the constants associated with this linear behaviour for each algorithm. Note, however, that for the cases when Inline graphic=CT, the time complexities do not depend on n directly, but rather the number of unique haplotypes nu. For a particular value of n, the quantity nu will increase with the number of loci under consideration until nu=n; only at this point do the runtimes become linear in the number of loci. A similar argument can be made for a more general configuration partition Inline graphic. In order to attain and analyse the linear behaviour for the modestly sized configurations that are considered, we formally interpret even non-unique haplotypes to be unique, thereby forcing nu=n.

Table 1.

A summary of the proposed algorithms along with their asymptotic time complexities

Inline graphic=CT General Inline graphic
Algorithm 2 O(km(m+nu)) O(km2+m(Ψ(Inline graphic)+Ω(Inline graphic)))
Algorithm 3 O(km2+kpmnu) O(km2+mp(Inline graphic)+Ω(Inline graphic)))
Algorithm 4 O(kpm(m+nu)) O(kpm2+mp(Inline graphic)+Ω(Inline graphic)))

Recall that Algorithm 2 with Inline graphic=CT is equivalent to Algorithm 1.

We produce data using coalescent simulation: we assume a symmetric 2-allele model and with population-scaled mutation rate θ=0.005 per locus and population−scaled recombination rate ρ=0.001 between each pair of adjacent loci. For each of several values of n, we thus simulate an n-haplotype 2×105-locus configuration n. We compute the partitions CT and C* and subsequently record the running time of each algorithm in computing Inline graphic, for an arbitrary haplotype α∈Inline graphicn. Throughout, we use a time discretization consisting of m=16 intervals. The running times are plotted, on a logarithmic scale, as a function of n in Figure 4a and 4b, for Inline graphic=CT and Inline graphic=C*, respectively.

Fig. 4.

Fig. 4.

Log-scaled plots of the running time (in milliseconds) required to compute Inline graphic for n with 2×105 loci and |n|=n, as a function n, for each of Algorithms 2–4. Configurations were generated using coalescent simulation as described in the text and results obtained on a single core of a MacPro with dual quad-core 3.0 GHz Xeon CPUs. (a) Inline graphic=CT, the trivial configuration partition. (b) Inline graphic=C*, the configuration partition described in Section 2.6

From Figure 4a, for which Inline graphic=CT, it is clear that our refinements for non-polymorphic loci have practical benefits, as Algorithms 3 and 4 perform substantially better than Algorithm 2. The asymptotic results summarized in Table 1 suggest the running time of Algorithm 4 is a factor of k/kp faster than Algorithm 2. This factor is roughly reflected in the logarithmic plot of Figure 4a as a vertical shift, with deviations occurring because kp increases (slowly) with n. Similarly, as n increases, the asymptotic results indicate that computation is dominated by the O(kpnm) term for both Algorithms 3 and 4; this is reflected in Figure 4a by a near identity in running times for these algorithms for larger values of n.

Comparing Figure 4b to a, the benefits of taking Inline graphic=C* can be observed. For each algorithm, this optimization improves performance substantially, particularly as the number of haplotypes n increases. Given the results for Algorithm 3 in particular, it is clear that the key quantity Ψp(Inline graphic)+Ω(Inline graphic), taken from Table 1, increases more slowly with n for Inline graphic=C* than for Inline graphic=CT. Finally, as in the previous case, the asymptotic results for general Inline graphic indicate that computation is dominated by the O(mp(Inline graphic)+Ω(Inline graphic))) term for both Algorithms 3 and 4; the associated convergence of running times appears to be occurring in Figure 4b, though more slowly than in Figure 4a; thus, Algorithm 4 is a practically useful alternative to Algorithm 3, even for larger values of n.

Although general trends are clear from Figure 4, the logarithmic scale makes it difficult to appreciate the magnitude of the effects of the optimizations. As mentioned earlier, assuming rough homogeneity over the genome, the computation time increases linearly with the number of loci. In Table 2, we summarize the constant associated with this linear behaviour as the time required to process 1×105 loci, along with the speedup relative to the baseline, Algorithm 1. Observe that Algorithm 3, with Inline graphic=C*, which can be applied in complete generality, provides a speedup of 15×, 250× and 546× for conditional sample sizes n=100, 2000, and 5000, respectively; and in most cases, Algorithm 4 can be applied, which increases these speedups to 320×, 1300× and 1845×. Importantly, the speedup increases with the conditional sample size n.

Table 2.

A summary of several key statistics from Figure 4

Inline graphic=CT
n=100 n=2000 n=5000
Algorithm 2 45 (1.0×) 870 (1.0×) 2153 (1.0×)
Algorithm 3 3.5 (13×) 21 (41×) 54 (40×)
Algorithm 4 0.63 (71×) 18 (48×) 49 (44×)

Inline graphic=C*
n=100 n=2000 n=5000

Algorithm 2 3.8 (12×) 7.8 (110×) 10.3 (208×)
Algorithm 3 3.0 (15×) 3.5 (250×) 3.9 (546×)
Algorithm 4 0.14 (320×) 0.68 (1300×) 1.17 (1845×)

The table indicates the time (in seconds) required to compute the CSP Inline graphic for |n|=n, per 1×105 loci. The speed-up versus Algorithm 2 with Inline graphic=CT, equivalent to the commonly used Algorithm 1, is given in parentheses.

4 DISCUSSION

We have presented a number of optimized algorithms for computing the CSP Inline graphic. Our optimizations are based on two intuitive observations: first, the number of unique haplotypes in a genomic sample is dramatically reduced within relatively short regions and second, the large number of non-polymorphic loci in a genomic sample, though informative, do not distinguish between haplotypes. These observations are formalized and leveraged to refine the recursive equations for computing Inline graphic, yielding optimized, yet exact, algorithms.

We have described our optimization algorithms in the context of the HMM associated with the CSD Inline graphic proposed by Paul et al. (2011). It is natural to question whether similar optimizations are applicable to related CSDs, such as those proposed by Fearnhead and Donnelly (2001) and by Li and Stephens (2003). In Section 2, we described two sufficient conditions: Property 1, which stipulates that, upon recombination, a new hidden haplotype is chosen independently and uniformly at random and Property 2, which stipulates that the emission distribution depends only on the allele at the current locus of the hidden haplotype. The aforementioned CSDs do satisfy both of these properties; in particular, stronger forms of Property 1 hold for both CSDs, enabling additional optimizations. We have not empirically analysed the resulting optimized algorithms, but by considering the resulting asymptotic time complexities, analogous to those in Table 1, we anticipate that the speedups obtained will be qualitatively comparable to those observed for Inline graphic, though the corresponding magnitudes are difficult to estimate.

It is also interesting to consider CSDs for more complex demographic scenarios. A theoretically straightforward extension of Inline graphic to variable population size, for example, will continue to satisfy both Properties 1 and 2 and will therefore be amenable to very similar optimizations. On the other hand, extension to structured populations, populations that are divided into several demes between which there is limited migration, will not satisfy Property 1 as the new hidden haplotype chosen upon recombination depends on the deme in which recombination occurs. Nonetheless, a relaxed version of Property 1 will be satisfied along with Property 2, and we anticipate analogous optimizations will be possible. The outcome is similar if Inline graphic is extended to conditionally sampling diploid, rather than haploid, individuals.

Related optimizations may be possible in other contexts as well. For example, Li and Durbin (2011) make use of a population genetic HMM which satisfies conditions that correspond to Properties 1 and 2 and is used to analyse genomic data. It is interesting to note that, in order to make their method practicable, Li and Durbin (2011) consider non-overlapping 100 bp windows as their set of loci; using the optimization detailed in this article may render such compromises unnecessary. It is less clear whether our optimizations are applicable to other population genetic HMMs, such as those considered by Hobolth et al. (2007) and Dutheil et al. (2009); nonetheless, we hope that our work will foster progress in this area.

We conclude by recalling that a broad range of population genetic methods have been developed and will continue to be developed, based on the CSD. These methods are generally computationally intensive, and approximations are often made on the basis of efficiency and at the expense of accuracy; with the advent of inexpensive genomic sequencing, such computational problems will be compounded. We have introduced several optimizations for CSD computation that can potentially speed up this computation by several orders of magnitude without introducing additional approximations. We believe that these optimizations will enable analyses that were previously impracticable, particularly for large genomic datasets. We also hope that the optimizations will encourage more accurate methods, and in particular more accurate CSDs, to be developed and used for population genomic inference.

Supplementary Material

Supplementary Data

Acknowledgements

We thank Anand Bhaskar and Matthias Steinrücken for fruitful discussion on both the theoretical and practical aspects of this work.

Funding: NIH National Research Service Award Trainee appointment (T32-HG00047) to JSP and an NSF CAREER Grant (DBI-0846015) and a Packard Fellowship for Science and Engineering to YSS.

Conflict of Interest: none declared.

REFERENCES

  1. Browning B.L., Browning S.R. Rapid and accurate haplotype phasing and missing data inference for whole genome association studies using localized haplotype clustering. Am. J. Hum. Genet. 2007;81:1084–1097. doi: 10.1086/521987. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Crawford D.C., et al. Evidence for substantial fine-scale variation in recombination rates across the human genome. Nat. Genet. 2004;36:700–706. doi: 10.1038/ng1376. [DOI] [PubMed] [Google Scholar]
  3. Davison D., et al. An approximate likelihood for genetic data under a model with recombination and population splitting. Theor. Popul. Biol. 2009;75:331–345. doi: 10.1016/j.tpb.2009.04.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. De Iorio M., Griffiths R.C. Importance sampling on coalescent histories. I. Adv. Appl. Prob. 2004a;36:417–433. [Google Scholar]
  5. De Iorio M., Griffiths R.C. Importance sampling on coalescent histories. II: Subdivided population models. Adv. Appl. Prob. 2004b;36:434–454. [Google Scholar]
  6. Delaneau O., et al. A linear complexity phasing method for thousands of genomes. Nat. Methods. 2012;9:179–181. doi: 10.1038/nmeth.1785. [DOI] [PubMed] [Google Scholar]
  7. Dutheil J.Y., et al. Ancestral population genomics: the coalescent hidden markov model approach. Genetics. 2009;183:259–274. doi: 10.1534/genetics.109.103010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Fearnhead P., Donnelly P. Estimating recombination rates from population genetic data. Genetics. 2001;159:1299–1318. doi: 10.1093/genetics/159.3.1299. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Fearnhead P., Donnelly P. Approximate likelihood methods for estimating local recombination rates. J. Royal Stat. Soc. B. 2002;64:657–680. [Google Scholar]
  10. Fearnhead P., Smith N.G. A novel method with improved power to detect recombination hotspots from polymorphism data reveals multiple hotspots in human genes. Am. J. Hum. Genet. 2005;77:781–794. doi: 10.1086/497579. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Gay J., et al. Estimating meiotic gene conversion rates from population genetic data. Genetics. 2007;177:881–894. doi: 10.1534/genetics.107.078907. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Griffiths R.C., et al. Importance sampling and the two-locus model with subdivided population structure. Adv. Appl. Probab. 2008;40:473–500. doi: 10.1239/aap/1214950213. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Hellenthal G., et al. Inferring human colonization history using a copying model. PLoS Genet. 2008;4:e1000078. doi: 10.1371/journal.pgen.1000078. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Hobolth A., et al. Genomic relationships and speciation times of human, chimpanzee, and gorilla inferred from a coalescent hidden markov model. PLoS Genet. 2007;3:e7. doi: 10.1371/journal.pgen.0030007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Howie B.N., et al. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet. 2009;5:e1000529. doi: 10.1371/journal.pgen.1000529. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Hudson R.R. Two-locus sampling distributions and their application. Genetics. 2001;159:1805–1817. doi: 10.1093/genetics/159.4.1805. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Lawson D., et al. Inference of population structure using dense haplotype data. PLoS Genet. 2012;8:e1002453. doi: 10.1371/journal.pgen.1002453. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Li H., Durbin R. Inference of human population history from individual whole-genome sequences. Nature. 2011;475:493–496. doi: 10.1038/nature10231. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Li N., Stephens M. Modelling linkage disequilibrium, and identifying recombination hotspots using SNP data. Genetics. 2003;165:2213–2233. doi: 10.1093/genetics/165.4.2213. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Li Y., et al. Mach: Using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet. Epidemiol. 2010;34:816–834. doi: 10.1002/gepi.20533. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Marchini J., et al. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat. Genet. 2007;39:906–913. doi: 10.1038/ng2088. [DOI] [PubMed] [Google Scholar]
  22. Marjoram P., Wall J.D. Fast “coalescent” simulation. BMC Genet. 2006;7:16. doi: 10.1186/1471-2156-7-16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. McVean G.A.T., et al. The fine-scale structure of recombination rate variation in the human genome. Science. 2004;304:581–584. doi: 10.1126/science.1092500. [DOI] [PubMed] [Google Scholar]
  24. McVean G.A., Cardin N.J. Approximating the coalescent with recombination. Philos. Trans. R. Soc. Lond. B Biol. Sci. 2005;360:1387–1393. doi: 10.1098/rstb.2005.1673. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Paul J.S., Song Y.S. A principled approach to deriving approximate conditional sampling distributions in population genetics models with recombination. Genetics. 2010;186:321–338. doi: 10.1534/genetics.110.117986. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Paul J.S., et al. An accurate sequentially markov conditional sampling distribution for the coalescent with recombination. Genetics. 2011;187:1115–1128. doi: 10.1534/genetics.110.125534. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Price A.L., et al. Sensitive detection of chromosomal segments of distinct ancestry in admixed populations. PLoS Genet. 2009;5:e1000519. doi: 10.1371/journal.pgen.1000519. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Scheet P., Stephens M. A fast and flexible method for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet. 2006;78:629–644. doi: 10.1086/502802. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Stephens M., Donnelly P. Inference in molecular population genetics. J. R. Stat. Soc. Ser. B Stat. Methodol. 2000;62:605–655. [Google Scholar]
  30. Stephens M., Scheet P. Accounting for decay of linkage disequilibrium in haplotype inference and missing-data imputation. Am. J. Hum. Genet. 2005;76:449–462. doi: 10.1086/428594. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Sundquist A., et al. Effect of genetic divergence in identifying ancestral origin using HAPAA. Genome Res. 2008;18:676–682. doi: 10.1101/gr.072850.107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Wegmann D., et al. Recombination rates in admixed individuals identified by ancestry-based inference. Nat. Genet. 2011;43:847–853. doi: 10.1038/ng.894. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Yin J., et al. Joint estimation of gene conversion rates and mean conversion tract lengths from population SNP data. Bioinformatics. 2009;25:i231–i239. doi: 10.1093/bioinformatics/btp229. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES