Blockwise HMM computation for large-scale population genomic inference

Joshua S Paul; Yun S Song

doi:10.1093/bioinformatics/bts314

. 2012 May 28;28(15):2008–2015. doi: 10.1093/bioinformatics/bts314

Blockwise HMM computation for large-scale population genomic inference

Joshua S Paul ¹, Yun S Song ^1,2,^*

PMCID: PMC3400961 PMID: 22641715

Abstract

Motivation: A promising class of methods for large-scale population genomic inference use the conditional sampling distribution (CSD), which approximates the probability of sampling an individual with a particular DNA sequence, given that a collection of sequences from the population has already been observed. The CSD has a wide range of applications, including imputing missing sequence data, estimating recombination rates, inferring human colonization history and identifying tracts of distinct ancestry in admixed populations. Most well-used CSDs are based on hidden Markov models (HMMs). Although computationally efficient in principle, methods resulting from the common implementation of the relevant HMM techniques remain intractable for large genomic datasets.

Results: To address this issue, a set of algorithmic improvements for performing the exact HMM computation is introduced here, by exploiting the particular structure of the CSD and typical characteristics of genomic data. It is empirically demonstrated that these improvements result in a speedup of several orders of magnitude for large datasets and that the speedup continues to increase with the number of sequences. The optimized algorithms can be adopted in methods for various applications, including the ones mentioned above and make previously impracticable analyses possible.

Availability: Software available upon request.

Supplementary Information: Supplementary data are available at Bioinformatics online.

Contact: [email protected]

1 INTRODUCTION

With the cost of genomic sequencing rapidly decreasing, there is a growing need for statistical methodologies that can efficiently accommodate genomic-scale data for many individuals while accounting for complex patterns of variation (e.g. linkage disequilibrium) caused by evolutionary processes such as mutation and recombination. The applicable underlying statistical model is the coalescent with recombination, which describes the distribution of genealogical histories for a collection of individuals. Known methods for inference under this model are generally intractable at the genomic scale, so practicable methods must realize a balance between computational efficiency and approximating key properties of the coalescent with recombination. A promising class of such methods use the conditional sampling distribution (CSD).

A CSD approximates the probability (under the coalescent with recombination) of sampling an individual with a particular DNA sequence, given that a collection of sequences from the population has already been observed. Intuitively, recombination partitions the newly sampled sequence into segments, each of which is a copy of the corresponding segment in a previously sampled sequence, with imperfections introduced by mutation. For computational efficiency, this construction is often cast as a hidden Markov model (HMM). The hidden state at a site indicates the previously sampled sequence being copied and the associated observed state the allele of the new sample.

Even within this framework there are alternatives, as it is possible to trade-off fidelity to the underlying coalescent process for computational efficiency. The CSD of Paul et al. (2011), for example, is the most accurate but is a constant factor slower than CSDs proposed by Fearnhead and Donnelly (2001) and by Li and Stephens (2003), with the latter being the fastest, but least accurate. CSDs for more complex models, incorporating gene conversion (Gay et al., 2007; Yin et al., 2009), diploidy (Marchini et al., 2007), demography (Davison et al., 2009; Hellenthal et al., 2008) and admixture (Price et al., 2009; Sundquist et al., 2008), have also been proposed.

Methods incorporating the CSD generally fall into one of several categories. Likelihoods can be approximated using CSD-based importance sampling (De Iorio and Griffiths, 2004a; De Iorio and Griffiths, 2004b; Fearnhead and Donnelly, 2001; Griffiths et al., 2008; Stephens and Donnelly, 2000) coupled with composite methods (Fearnhead and Donnelly, 2002; Hudson, 2001) or directly as a product of CSDs (Li and Stephens, 2003). In conjunction with expectation–maximization or Markov chain Monte Carlo, these methods have been used for estimation of fine-scale recombination rates (Crawford et al., 2004; Fearnhead and Smith, 2005; Li and Stephens, 2003; McVean et al., 2004), gene conversion parameters (Gay et al., 2007; Yin et al., 2009), population demography (Davison et al., 2009; Hellenthal et al., 2008) and population structure (Lawson et al., 2012). It is also possible to infer and use the hidden states in the HMM CSD computation. This has been used for admixture inference (Price et al., 2009; Sundquist et al., 2008; Wegmann et al., 2011), in which genomic segments corresponding to ancestral populations are identified and also within a pseudo-Gibbs sampling framework to phase genotype sequence data into haplotype sequence data and to impute missing data (Howie et al., 2009; Li et al., 2010; Marchini et al., 2007; Stephens and Scheet, 2005).

Nearly, all of these methods rely on iterative Monte Carlo or expectation–maximization techniques. As a result, they are computationally intensive, often requiring several hours, or, in some cases, days, to produce a result, even for modest non-genomic datasets (Howie et al., 2009); directly extending the methods to large genomic datasets is thus often impractical. Moreover, nearly all of the running time is expended on CSD computation, and so the choice of CSD is often made on the basis of efficiency and (arguably) at the expense of accuracy (Browning and Browning, 2007; Howie et al., 2009; Li and Stephens, 2003; Scheet and Stephens, 2006; Stephens and Scheet, 2005).

In this article, we help to overcome these limitations by proposing two related optimizations to the relevant HMM-based CSD computations. Consider sampling a large number of sequences from a population. If the sampled sequences are very long, it is likely that nearly all of them will be unique. However, for most relatively short regions, the number of unique subsequences will be reduced. This intuition forms the basis of the first optimization, which locally reduces the complexity of the CSD computation, thereby improving efficiency. The collection of locally unique subsequences on which this optimization depends are formalized as a partition Inline graphic of the sampled sequences; we characterize the optimal partition given the sampled sequences and provide a fast algorithm for approximating this optimum.

A second common feature of the sampled sequences is an abundance of non-polymorphic sites. These sites are informative—for example, a local over-abundance of non-polymorphic sites indicates a recent common ancestor, which in turn indicates a low propensity for recombination—and should be included in the analysis. Indeed, Li and Durbin (2011) used the physical distribution of polymorphic and non-polymorphic sites between a pair of sequences to infer past population sizes of humans. Using the fact that non-polymorphic sites do not differentiate the sequences, we show that it is possible to reduce the complexity of the CSD computation at non-polymorphic sites. We stress that our solution is different from simply ignoring non-polymorphic sites; we are proposing algorithmic improvements to incorporate non-polymorphic sites into the analysis in an efficient way.

In formally describing and evaluating our optimizations, we restrict attention to the most accurate HMM-based CSD, Inline graphic , proposed by Paul et al. (2011) and consider the problem of computing the conditional sampling probability (CSP), denoted , of a particular individual α given a collection n of previously sampled individuals. (Incidentally, in the case the size n of the previously observed sample n is 1, the HMM underlying Inline graphic is equivalent to the HMM used in the aforementioned work of Li and Durbin; we anticipate that provides one way of extending this line of work to many sequences). On simulated data, our algorithmic improvement leads to a speedup of about 550× for n=5000 previously sampled individuals; by making regularity assumptions on mutation and recombination rates, this speedup increases to about 1850×. Importantly, we show that the empirically observed speed-up increases with the number n of previously sampled individuals.

Although we describe our optimizations in the context of computing Inline graphic , they are more generally applicable. We provide two sufficient conditions for our optimizations and use them to show the applicability to other HMM-based CSDs, including those of Fearnhead and Donnelly (2001) and Li and Stephens (2003), as well as CSDs for more complex demographic models and population genetic HMMs. Also, in the Supplementary Material, we describe extending our algorithms to allow for efficient inference of hidden states, often termed posterior decoding.

We stress that the work presented here is fundamentally different from previous works on ‘approximating’ CSD-based population genetic inference. Howie et al. (2009) consider a fixed-size subset of the haplotype configuration n, chosen using a measure of ‘closeness’ to the sampled haplotype α, in order to reduce the state space of the HMM-based CSD and speed up computation. Similarly, Scheet and Stephens (2006) and Browning and Browning (2007) consider an HMM with reduced state space by compacting the configuration n into a substantially smaller haplotype model. More recently, Delaneau et al. (2012) have proposed an approximate HMM formulation relying on a partition of the sampled sequences similar to that proposed herein (Section 2.6). As described earlier, using such heuristics improves computational efficiency, but ultimately at the expense of accuracy. The purpose of this article is to provide highly optimized ‘and exact’ computation for a large class of approximate CSDs, rather than to introduce additional approximations to the underlying models.

2 METHODS

Herein, we describe the HMM formulation of Inline graphic , the algorithms that are currently being used to compute and the optimizations we are proposing to improve the running time.

We remark that the theoretical analysis of our algorithms is limited to asymptotic time (and space) complexity. As a measure of real-world performance, asymptotic analyses often leave much to be desired. Consider, for example, a sample in which 1 out of every 1000 sites is polymorphic. If we denote by k the total number of sites and k_p the number of polymorphic sites, then formally O(k)=O(k_p). Nevertheless, we would like to distinguish between an algorithm that operates on each of the k sites and an algorithm that operates only on the k_p polymorphic sites, as the the latter will be some 1000× faster; we thus write the complexities for the two algorithms as O(k) and O(k_p), respectively.

2.1 Notation

We consider haplotypes in the finite-locus finite-alleles setting. Throughout, we suppose that there are k loci, and that recombination may occur between any adjacent pair of loci (ℓ, ℓ+1) where 1≤ℓ<k.

The space of all such haplotypes is denoted by Inline graphic , and given a haplotype h∈, the allele at locus ℓ is denoted by h[ℓ] and the sub-haplotype for a range of loci ℓ≤ℓ′ is denoted by h[ℓ:ℓ′]. A sample configuration of haplotypes is specified by a vector n=(n_h)_h∈, with n_h being the number of haplotypes of type h in the sample. The set of ‘unique’ haplotypes associated with configuration n is denoted by Inline graphic _n={h∈:n_h>0}. Finally, the total number of haplotypes is denoted by |n|=n, the number of unique haplotypes by |_n |=n_u and the number of polymorphic loci, which generally depends on the sample n, by k_p.

2.2 A brief description of

Suppose that, conditioned on having already observed a haplotype configuration n, we wish to sample a new haplotype α. By generalizing the technique of Griffiths et al. (2008) based on the diffusion process, Paul and Song (2010) introduced the CSD Inline graphic intended to approximate key properties of the coalescent with recombination, the model under which inference is to be performed. The central idea of is to fix the unknown ancestry of n to be the ‘trunk genealogy’ ^*(n), in which lineages associated with the haplotypes do not mutate, recombine, or coalesce with one another, but rather extend infinitely into the past. Having fixed the ancestry of n, a conditional genealogy Inline graphic associated with haplotype α is sampled; within , lineages evolve backwards in time subject to mutation, recombination, coalescence and ‘absorption’ into one of the lineages of ^*(n). When every lineage of has been absorbed, the process terminates; the type of every lineage of Inline graphic is now determined and a sample for α is generated.

Although a recursion for computing the CSP Inline graphic is known, it is computationally intractable for all but the smallest datasets. To remedy this limitation, Paul et al. (2011) adopted a sequentially Markov framework McVean and Cardin (2005) for the conditional genealogy . The central idea is to consider the ‘marginal’ conditional genealogy s_ℓ associated with each locus ℓ, which is described, disregarding mutation events, by the pair s_ℓ=(t_ℓ, h_ℓ), where t_ℓ∈[0, ∞) is the absorption time and h_ℓ∈ Inline graphic _n is the absorption haplotype. The conditional genealogy can thus be represented as a sequence of marginal conditional genealogies {s_ℓ}. See Figure 1 for an illustration.

Fig. 1. — Illustration of a sequentially Markov framework for the conditional genealogy . The trunk genealogy ^*(n) is indicated. The three loci of each haplotype are each represented by a circle, with the shading indicating the allelic type at that locus. Time is represented vertically, with the present (time 0) at the bottom. The marginal genealogies at the first, second and third loci are shown as dotted, dashed and solid lines, respectively. Mutation event, along with resulting allele, is indicated by small arrow. Absorption events at each locus, and the corresponding absorption time (t^(a) and t^(b)) and haplotype (h^(a) and h^(b), respectively), are indicated by horizontal lines

In general, the random sequence of marginal conditional genealogies is not Markov, due to the potential for coalescence events within the conditional genealogy. Nonetheless, it is possible to ‘approximate’ this sequence as Markov by using a two-locus transition distribution. Mutation can then be realized at each locus independently as a Poisson process on the marginal conditional genealogy, thereby generating a sample for α. The resulting CSD is denoted by Inline graphic . Paul et al. (2011) found that the Markov approximation underlying has minimal effect on accuracy compared with , coinciding with findings in similar domains (McVean and Cardin, 2005; Marjoram and Wall, 2006).

Owing to its Markov structure, the CSD Inline graphic can be cast as an HMM wherein the ℓth hidden state is the marginal conditional genealogy s_ℓ=(t_ℓ, h_ℓ) and the ℓth observed state the conditionally sampled allele α[ℓ]. In order to use standard dynamic programming methodologies for inference, the state space of the HMM must be finite, and so absorption time is discretized by partitioning [0, ∞) into a (possibly large) finite number m of intervals I and considering an absorption interval, denoted by d_ℓ∈I, rather than an absorption time. The discretized hidden state at the ℓth locus is then s_ℓ=(d_ℓ, h_ℓ), and the initial, transition and emission distributions for the discretized Inline graphic HMM are denoted by ζ(·), Φ_ℓ(· | ·) and ξ_ℓ(· | ·), respectively. Interested readers should consult Paul and Song (2010) and Paul et al. (2011) for details.

2.3 Computation with

The CSP Inline graphic under the discrete-space HMM can be efficiently computed using the ‘forward algorithm’, a dynamic program associated with the HMM forward recursion:

(1)

where

(2)

with base case

(3)

Importantly, this recursion, and the associated dynamic program with time complexity O(k(n_um)²), is applicable to a general HMM. In the following, we examine the transition and emission distributions more carefully and obtain a series of refined recursions and the associated dynamic programs.

2.4 Improving efficiency through the transition distribution

Consider the description of Inline graphic given above and more rigorously defined in Paul et al. (2011). If a recombination does not occur between loci ℓ−1 and ℓ, then (d_ℓ−1, h_ℓ−1)=(d_ℓ, h_ℓ); moreover, if recombination does occur, the absorbing haplotype h_ℓ is independent of h_ℓ−1 and uniformly distributed. As a result, we have the following property (We remark that Property 1 is a sufficient, though not necessary, condition for the algorithmic optimizations described in this and subsequent sections. For example, as stated, the transition distribution imposes a uniform distribution on the absorbing haplotype in the case of a recombination; in fact, the algorithms can be generalized to accommodate an arbitrary distribution over haplotypes that does not depend on d′ or h′. In Section 4, we discuss the applicability of these optimizations to more general (and more specialized) classes of approximate CSDs.):

Property 1. —

The initial and transition probabilities ζ and ϕ, take the following functional form:

where x^(d), y_ℓ^(d′) and z_ℓ^(d′,d) are known analytic expressions.

Using Property 1 in conjunction with definitions

(4)

we can express equations (1)–(3) as

(5)

where

(6)

with base case

(7)

Using these recursions, the dynamic program in Algorithm 1 can be used to compute Inline graphic . Within the pseudocode description, the time complexity of Lines 6, 7 and 8 are O(m), O(n_u) and O(n_u), respectively. As a result, the time complexity of Lines 5–9, and for the algorithm as a whole, is O(km(m+n_u)). This represents a substantial improvement over the quadratic dependence on n_u in the naive forward algorithm for HMMs. This simple optimization has already been generally adopted (Fearnhead and Donnelly, 2001; Li and Stephens, 2003; Paul et al., 2011) and serves as a baseline for improvement.

graphic file with name bts314i21.jpg

2.5 Improving efficiency through the emission distribution

The state of the Inline graphic HMM at locus ℓ is a tuple (d_ℓ, h_ℓ). However, the emission probability of α[ℓ] is governed only by the time interval d_ℓ and the allele h_ℓ[ℓ]. As a result, the following property holds:

Property 2. —

Consider a subset ⊂_n such that there exists an allele a with h[ℓ]=a for all h∈. Then, for each absorption interval d∈I, the emission distribution ξ_ℓ(· | d, h) is identical for all h∈. We indicate this fact by writing ξ_ℓ(·|d, h)=ξ_ℓ(·|d, ) for all h∈.

With this in mind, define a ‘partition’ Inline graphic of the haplotype configuration n to be a collection of blocks of the form (, ℓ_s, ℓ_e), such that

For every (, ℓ_s, ℓ_e)∈, there exists a sub-haplotype x such that h[ℓ_s:ℓ_e]=x for all h∈.
For every haplotype h∈_n and 1≤ℓ≤k, there exists ‘exactly’ one (, ℓ_s, ℓ_e)∈ with h∈ and ℓ_s≤ℓ≤ℓ_e.

For a given locus ℓ, a configuration partition Inline graphic induces a partition of the haplotypes _n, denoted by ℓ, and Property 2 applies to each ∈ℓ. In the next sections, we present new sets of recursions and dynamic programming algorithms valid for an arbitrary partition . The computational complexity of these algorithms will depend on Inline graphic through two functions, namely Ψ() and Ω(), defined as follows: For locus ℓ, define Ψ_ℓ()=|ℓ|, the number of blocks in ℓ and define ω_ℓ() to be the total number of haplotypes in blocks of the configuration partition ‘ending’ at locus ℓ. Then,

In some cases, we are primarily concerned with polymorphic loci, and so we define Ψ_p( Inline graphic ) to be the analog of Ψ() summed over ‘only’ polymorphic loci.

Finally, we define the trivial partition C_T for haplotype configuration n as the partition containing a single block ({h}, 1, k) for each h∈ Inline graphic _n. Note that Ψ(C_T)=k · n_u and Ω(C_T)=n_u. See Figure 2 for an illustration of both C_T and two non-trivial configuration partitions.

Fig. 2. — Illustration of three alternative configuration partitions. Each row represents a haplotype, with white and black circles representing the allele at each of eight polymorphic loci. The line style of each sub-haplotype indicates the block to which it belongs. (a) The trivial configuration partition C_T; Ψ_p(C_T)=40 and Ω(C_T)=5. (b) A non-trivial configuration partition, C; Ψ_p(C)=24 and Ω(C)=12. (c) The configuration partition C_s found by the algorithm described in Section 2.6 for s=3; Ψ_p(C_s)=24 and Ω(C_s)=15

2.5.1 A general configuration partition

Let Inline graphic be a configuration partition of n. Begin by defining

(8)

Now suppose ( Inline graphic , ℓ_s, ℓ_e)∈. Applying Definition (8) and Property 2 to equation (6), then for ℓ_s≤ℓ≤ℓ_e,

(9)

where n Inline graphic =∑_h∈ n_h. Similarly, by induction and making use of equations (6) and (9), it is possible to show that, for ℓ_s≤ℓ≤ℓ_e and h∈,

(10)

where T_ℓ(d, Inline graphic )=∏_{ℓ′=ℓ_s}^ℓ ξ_ℓ′(α[ℓ′]|d, )·y_ℓ′−1^(d), and solves the recursion,

(11)

for ℓ_s≤ℓ≤ℓ_e, with base case T_{ℓ_s}−1(d, Inline graphic )=1.

For each block ( Inline graphic , ℓ_s, ℓ_e)∈, we take advantage of equations (9) and (11) to directly compute Q_ℓ(d, ) and T_ℓ(d, ) at every locus ℓ_s≤ℓ≤ℓ_e. At the end of each block, when ℓ=ℓ_e, the finer-grain values F_ℓ(d, h) are computed for each h∈ using equation (10), and subsequently used to compute initial values for blocks beginning at locus ℓ+1. The associated dynamic program to compute the CSP Inline graphic is given in Algorithm 2.

Within Algorithm 2, the time complexity of Line 7 is O(m); of Line 8 is O(Ψ_ℓ( Inline graphic )) and of Lines 9 and 10 is O(ω_ℓ()). Thus, the time complexity of Lines 6–11, and the dynamic program, is O(km²+m(Ψ()+Ω())). Observe that Algorithm 1 is a special case of Algorithm 2 for =C_T. Thus, if it is possible to obtain a configuration partition for n such that Ψ()+Ω( Inline graphic ) is substantially less than Ψ(C_T)+Ω(C_T)=kn_u+n_u, our new algorithm may be considerably faster than Algorithm 1; constructing such a configuration partition is the subject of Section 2.6.

graphic file with name bts314i24.jpg

graphic file with name bts314i25.jpg

2.5.2 The absence of polymorphism

In many reasonable evolutionary scenarios, a great many loci will not be polymorphic. Accommodating such loci in the analysis is important and can be done efficiently making use of Property 2. In particular, for a non-polymorphic locus ℓ, Property 2 applies to the trivial set Inline graphic ₀=_n, and therefore the emission distribution can be written ξ_ℓ(·|d, ₀)=ξ_ℓ(·|d) and moreover, Q_ℓ(d, ₀)=Q_ℓ(d).

Suppose consecutive loci ℓ_s^*,…, ℓ_e^* are not polymorphic. Rewriting equations (9) and (10) for block ( Inline graphic ₀, ℓ_s^*, ℓ_e^*) yields, for ℓ_s^*≤ℓ≤ℓ_e^*,

(12)

and for ℓ_s^*≤ℓ≤ℓ_e^* and h∈ Inline graphic ₀=_n,

(13)

where Inline graphic and solves the recursion

(14)

for ℓ_s^*≤ℓ≤ℓ_e^*, with base case T_{ℓ_s^*−1}(d)=1.

Now, let Inline graphic be a configuration partition with (, ℓ_s, ℓ_e)∈. Suppose that there is a stretch of non-polymorphic loci ℓ_s^*,…, ℓ_e^* and that ℓ_s≤ℓ_s^*≤ℓ_e^*≤ℓ_e. Applying Definition (8) to equation (13), yields, for ℓ_s^*≤ℓ≤ℓ_e^*,

(15)

Similarly, considering the definition of T_ℓ(d, Inline graphic ) along with equation (14),

(16)

Algorithm 2 can be modified to accommodate such stretches of non-polymorphic loci as a special case, making use of equations (12) and (14) to directly compute the values of Q_ℓ(d) and T_ℓ(d) at each non-polymorphic locus ℓ. If we then assume (without loss of generality) that each ( Inline graphic , ℓ_s, ℓ_e)∈ has ℓ_e at a polymorphic locus, then at the final non-polymorphic locus, for which ℓ=ℓ_e^*, equations (15) and (16) may be used to compute Q_ℓ(d, ) and T_ℓ(d, ) for each ∈ℓ. This modification is detailed in Algorithm 3.

Within Algorithm 3, the time complexity of Lines 5 and 8 is O(1), of Line 7 is O(m+Ψ_ℓ( Inline graphic )+ω_ℓ()) and of Line 10 is O(m). As a result, the time complexity of Lines 2–12, and of the dynamic program, is O(km²+m(Ψ_p()+Ω())). Relative to Algorithm, less computation needs to be done for non-polymorphic loci; thus, in the typical case of many non-polymorphic loci, this dynamic program will have a decreased running time. For Inline graphic =C_T, the time complexity is O(km²+k_pmn_u).

2.5.3 An optimization for non-polymorphic loci

The key recursions (12) and (14) for non-polymorphic loci can be written in matrix form as Inline graphic _ℓ=_ℓ (_ℓ+_ℓ) _ℓ−1 and _ℓ=_ℓ_ℓ_ℓ−1, where _ℓ and _ℓ are m-dimensional column vectors having dth entry Q_ℓ(d) and T_ℓ(d), respectively; _ℓ and _ℓ are (m×m)-dimensional diagonal matrices having (d×d)th entry ξ_ℓ(α[ℓ]|d) and y_ℓ−1^(d), respectively; and Inline graphic _ℓ is an (m×m)-dimensional matrix having (d, d′)th entry z_ℓ−1^(d′,d).

Now, suppose that the mutational model is symmetric and the mutation rate constant for all loci. Then, Inline graphic _ℓ= does not depend on ℓ, for all non-polymorphic loci ℓ. Similarly, if the recombination rate between each pair of loci is constant, then _ℓ= and _ℓ= do not depend on ℓ. With these assumptions, for ℓ_s^*≤ℓ≤ℓ_e^*,

(17)

and the values of ( Inline graphic (+))^r and ()^r can be pre-computed (either by eigenvalue decomposition or repeated multiplication) for a relevant range of r-values. Using this technique for explicitly computing only the necessary values of Q_ℓ(d) and T_ℓ(d), stretches of non-polymorphic loci can be ‘analytically’ skipped.

The modified dynamic program associated with this optimization is given in Algorithm 4. The time complexity of Line 4 is O(m) and of Line 6 is O(m+Ψ_ℓ( Inline graphic )+ω_ℓ()). Thus, the time complexity for the dynamic program is O(k_pm²+m(Ψ_p()+Ω())).

This refinement once again reduces the computation required for non-polymorphic loci, and so we might expect substantial improvements in performance over Algorithms 2 and 3. For the choice Inline graphic =C_T, the time complexity is O(k_pm(m+n_u)). Note that the assumptions necessary for Algorithm 4, namely a symmetric mutation model and uniform mutation and recombination rates, can be relaxed, but at the expense of additional pre-computation. For example, given non-uniform, but locally similar, recombination rates, pre-computation might be performed for each of several rates; each stretch of non-polymorphic loci could then use the pre-computed values associated with the closest recombination rate.

graphic file with name bts314i27.jpg

2.6 A fast algorithm for configuration partitions

In Section 2.5, we assumed a configuration partition Inline graphic to be specified and showed that, for Algorithms 2–4, the time complexity depends on through the functions Ψ() (or Ψ_p()) and Ω() and more particularly their sum. It is intuitively clear that a configuration partition minimizing Ω will maximize Ψ (as in C_T), and vice versa, and in general these quantities are inversely related; minimizing a convex combination of these quantities is therefore difficult. In this section, we propose a fast and simple parametrized algorithm for constructing reasonably good configuration partitions.

Given a configuration n, the algorithm proceeds sequentially over the loci: initially, let ℓ_s=1. Given ℓ_s, find the largest polymorphic locus ℓ_e such that ℓ_s≤ℓ_e≤k, and the number of unique sub-haplotypes between loci ℓ_s and ℓ_e is at most some threshold parameter s. Then, for each unique sub-haplotype x between ℓ_s and ℓ_e, group all h∈ Inline graphic _n such that h[ℓ_s : ℓ_e]=x into the same block and add (, ℓ_s, ℓ_e) to the configuration partition. Set ℓ_s=ℓ_e+1 and repeat until locus k is reached. An example configuration partition resulting from this algorithm is shown in Figure 2c.

Applying this procedure to configuration n with threshold parameter s results in a configuration partition which we denote C_s. Observe that for s=| Inline graphic _n|, we obtain C_s=C_T, which minimizes Ω. On the other hand, for s=2 (in the bi-allelic case), the algorithm produces a configuration partition that minimizes Ψ_p. Intermediate values of s produce the intermediate results that are of interest. In order to gauge the effect of different combinations of Ψ_p and Ω on the running time, the CSP Inline graphic was computed using the configuration partition C_s for each of several values of s, and the times recorded; the results are plotted in Figure 3. As our intuition suggested, the running time depends substantially on the choice of and, in accordance with the asymptotic time complexity results depends linearly on both Ψ_p and Ω. Furthermore, as anticipated, the values of Ψ_p and Ω are inversely related.

Fig. 3. — Empirically observed running time of Algorithm 4 used to compute , for a particular configuration n and an arbitrary α∈_n. Several values of s∈(2,…, 500) were used, and each circle corresponds to a particular value of s. The curve of circles demonstrates the trade-off between small Ψ_p (small s-values) and small Ω (large s-values). As predicted by the asymptotic time complexity results, running time appears to depend linearly on both Ψ_p- and Ω-values, and fitting a linear model indicates the constant associated with Ψ_p is ~1.5 times greater than the constant associated with Ω. The configuration n was generated using coalescent simulation for 500 individuals, each having 10⁵ bi-allelic loci, using population-scaled mutation rate θ=0.005 per locus and population-scaled recombination rate ρ=0.001 between each pair of adjacent loci, and resulting in k_p=1724 polymorphic loci and n_u=324 unique haplotypes

By fitting a linear model to the data, we can deduce the constants associated with Ψ_p and Ω, which the asymptotic results cannot provide. Although these constants will depend on the implementation and hardware, their ratio should be relatively stable, and therefore informative for choosing an optimal trade-off between Ψ_p and Ω. We find that the constant associated with Ψ_p is ~1.5 times that associated with Ω, suggesting that running time is minimized for a choice of Inline graphic that minimizes 1.5 · Ψ_p()+Ω(). Furthermore, making use of the above algorithm, we define

and C^*=C_s^*. In practice the value s^* is found using binary search and determining C^* is very fast. This definition will be used frequently in Section 4, as C^* (and the analogous result for Algorithm 1, using Ψ in place of Ψ_p) provides a good, though not necessarily optimal, choice for Inline graphic .

3 RESULTS

We have presented three optimized algorithms for computing the conditional sampling probability (CSP) Inline graphic . Briefly, Algorithms 2–4 rely on a partition of the configuration n. We have characterized optimal such partitions and proposed a simple and fast method for constructing good partitions =C^* (cf, Section 2.6). For the sake of comparison, we also consider the trivial partition =C_T. Relative to Algorithm 2, Algorithms 3 and 4 represent successive improvements in efficiency for non-polymorphic loci. In this section, we provide an empirical analysis of these algorithms and demonstrate that our optimizations yield a substantial speedup.

The optimized algorithms, along with their asymptotic time complexities, are summarized in Table 1. Intuitively, for a fixed number of haplotypes n, and assuming coarse homogeneity across the genome, the runtimes of each of these algorithms should be linear in the number of loci. We are interested in determining the constants associated with this linear behaviour for each algorithm. Note, however, that for the cases when Inline graphic =C_T, the time complexities do not depend on n directly, but rather the number of unique haplotypes n_u. For a particular value of n, the quantity n_u will increase with the number of loci under consideration until n_u=n; only at this point do the runtimes become linear in the number of loci. A similar argument can be made for a more general configuration partition Inline graphic . In order to attain and analyse the linear behaviour for the modestly sized configurations that are considered, we formally interpret even non-unique haplotypes to be unique, thereby forcing n_u=n.

Table 1.

A summary of the proposed algorithms along with their asymptotic time complexities

	=C_T	General
Algorithm 2	O(km(m+n_u))	O(km²+m(Ψ()+Ω()))
Algorithm 3	O(km²+k_pmn_u)	O(km²+m(Ψ_p()+Ω()))
Algorithm 4	O(k_pm(m+n_u))	O(k_pm²+m(Ψ_p()+Ω()))

Open in a new tab

Recall that Algorithm 2 with Inline graphic =C_T is equivalent to Algorithm 1.

We produce data using coalescent simulation: we assume a symmetric 2-allele model and with population-scaled mutation rate θ=0.005 per locus and population−scaled recombination rate ρ=0.001 between each pair of adjacent loci. For each of several values of n, we thus simulate an n-haplotype 2×10⁵-locus configuration n. We compute the partitions C_T and C^* and subsequently record the running time of each algorithm in computing Inline graphic , for an arbitrary haplotype α∈_n. Throughout, we use a time discretization consisting of m=16 intervals. The running times are plotted, on a logarithmic scale, as a function of n in Figure 4a and 4b, for =C_T and =C^*, respectively.

Fig. 4. — Log-scaled plots of the running time (in milliseconds) required to compute for n with 2×10⁵ loci and |n|=n, as a function n, for each of Algorithms 2–4. Configurations were generated using coalescent simulation as described in the text and results obtained on a single core of a MacPro with dual quad-core 3.0 GHz Xeon CPUs. (a) =C_T, the trivial configuration partition. (b) =C^*, the configuration partition described in Section 2.6

From Figure 4a, for which Inline graphic =C_T, it is clear that our refinements for non-polymorphic loci have practical benefits, as Algorithms 3 and 4 perform substantially better than Algorithm 2. The asymptotic results summarized in Table 1 suggest the running time of Algorithm 4 is a factor of k/k_p faster than Algorithm 2. This factor is roughly reflected in the logarithmic plot of Figure 4a as a vertical shift, with deviations occurring because k_p increases (slowly) with n. Similarly, as n increases, the asymptotic results indicate that computation is dominated by the O(k_pnm) term for both Algorithms 3 and 4; this is reflected in Figure 4a by a near identity in running times for these algorithms for larger values of n.

Comparing Figure 4b to a, the benefits of taking Inline graphic =C^* can be observed. For each algorithm, this optimization improves performance substantially, particularly as the number of haplotypes n increases. Given the results for Algorithm 3 in particular, it is clear that the key quantity Ψ_p()+Ω(), taken from Table 1, increases more slowly with n for Inline graphic =C^* than for =C_T. Finally, as in the previous case, the asymptotic results for general indicate that computation is dominated by the O(m(Ψ_p()+Ω())) term for both Algorithms 3 and 4; the associated convergence of running times appears to be occurring in Figure 4b, though more slowly than in Figure 4a; thus, Algorithm 4 is a practically useful alternative to Algorithm 3, even for larger values of n.

Although general trends are clear from Figure 4, the logarithmic scale makes it difficult to appreciate the magnitude of the effects of the optimizations. As mentioned earlier, assuming rough homogeneity over the genome, the computation time increases linearly with the number of loci. In Table 2, we summarize the constant associated with this linear behaviour as the time required to process 1×10⁵ loci, along with the speedup relative to the baseline, Algorithm 1. Observe that Algorithm 3, with Inline graphic =C^*, which can be applied in complete generality, provides a speedup of 15×, 250× and 546× for conditional sample sizes n=100, 2000, and 5000, respectively; and in most cases, Algorithm 4 can be applied, which increases these speedups to 320×, 1300× and 1845×. Importantly, the speedup increases with the conditional sample size n.

Table 2.

A summary of several key statistics from Figure 4

	=C_T
	n=100	n=2000	n=5000
Algorithm 2	45 (1.0×)	870 (1.0×)	2153 (1.0×)
Algorithm 3	3.5 (13×)	21 (41×)	54 (40×)
Algorithm 4	0.63 (71×)	18 (48×)	49 (44×)

	=C^*
	n=100	n=2000	n=5000

Algorithm 2	3.8 (12×)	7.8 (110×)	10.3 (208×)
Algorithm 3	3.0 (15×)	3.5 (250×)	3.9 (546×)
Algorithm 4	0.14 (320×)	0.68 (1300×)	1.17 (1845×)

Open in a new tab

The table indicates the time (in seconds) required to compute the CSP Inline graphic for |n|=n, per 1×10⁵ loci. The speed-up versus Algorithm 2 with =C_T, equivalent to the commonly used Algorithm 1, is given in parentheses.

4 DISCUSSION

We have presented a number of optimized algorithms for computing the CSP Inline graphic . Our optimizations are based on two intuitive observations: first, the number of unique haplotypes in a genomic sample is dramatically reduced within relatively short regions and second, the large number of non-polymorphic loci in a genomic sample, though informative, do not distinguish between haplotypes. These observations are formalized and leveraged to refine the recursive equations for computing Inline graphic , yielding optimized, yet exact, algorithms.

We have described our optimization algorithms in the context of the HMM associated with the CSD Inline graphic proposed by Paul et al. (2011). It is natural to question whether similar optimizations are applicable to related CSDs, such as those proposed by Fearnhead and Donnelly (2001) and by Li and Stephens (2003). In Section 2, we described two sufficient conditions: Property 1, which stipulates that, upon recombination, a new hidden haplotype is chosen independently and uniformly at random and Property 2, which stipulates that the emission distribution depends only on the allele at the current locus of the hidden haplotype. The aforementioned CSDs do satisfy both of these properties; in particular, stronger forms of Property 1 hold for both CSDs, enabling additional optimizations. We have not empirically analysed the resulting optimized algorithms, but by considering the resulting asymptotic time complexities, analogous to those in Table 1, we anticipate that the speedups obtained will be qualitatively comparable to those observed for Inline graphic , though the corresponding magnitudes are difficult to estimate.

It is also interesting to consider CSDs for more complex demographic scenarios. A theoretically straightforward extension of Inline graphic to variable population size, for example, will continue to satisfy both Properties 1 and 2 and will therefore be amenable to very similar optimizations. On the other hand, extension to structured populations, populations that are divided into several demes between which there is limited migration, will not satisfy Property 1 as the new hidden haplotype chosen upon recombination depends on the deme in which recombination occurs. Nonetheless, a relaxed version of Property 1 will be satisfied along with Property 2, and we anticipate analogous optimizations will be possible. The outcome is similar if Inline graphic is extended to conditionally sampling diploid, rather than haploid, individuals.

Related optimizations may be possible in other contexts as well. For example, Li and Durbin (2011) make use of a population genetic HMM which satisfies conditions that correspond to Properties 1 and 2 and is used to analyse genomic data. It is interesting to note that, in order to make their method practicable, Li and Durbin (2011) consider non-overlapping 100 bp windows as their set of loci; using the optimization detailed in this article may render such compromises unnecessary. It is less clear whether our optimizations are applicable to other population genetic HMMs, such as those considered by Hobolth et al. (2007) and Dutheil et al. (2009); nonetheless, we hope that our work will foster progress in this area.

We conclude by recalling that a broad range of population genetic methods have been developed and will continue to be developed, based on the CSD. These methods are generally computationally intensive, and approximations are often made on the basis of efficiency and at the expense of accuracy; with the advent of inexpensive genomic sequencing, such computational problems will be compounded. We have introduced several optimizations for CSD computation that can potentially speed up this computation by several orders of magnitude without introducing additional approximations. We believe that these optimizations will enable analyses that were previously impracticable, particularly for large genomic datasets. We also hope that the optimizations will encourage more accurate methods, and in particular more accurate CSDs, to be developed and used for population genomic inference.

Supplementary Material

Supplementary Data

supp_28_15_2008__index.html^{(772B, html)}

Acknowledgements

We thank Anand Bhaskar and Matthias Steinrücken for fruitful discussion on both the theoretical and practical aspects of this work.

Funding: NIH National Research Service Award Trainee appointment (T32-HG00047) to JSP and an NSF CAREER Grant (DBI-0846015) and a Packard Fellowship for Science and Engineering to YSS.

Conflict of Interest: none declared.

REFERENCES

Browning B.L., Browning S.R. Rapid and accurate haplotype phasing and missing data inference for whole genome association studies using localized haplotype clustering. Am. J. Hum. Genet. 2007;81:1084–1097. doi: 10.1086/521987. [DOI] [PMC free article] [PubMed] [Google Scholar]
Crawford D.C., et al. Evidence for substantial fine-scale variation in recombination rates across the human genome. Nat. Genet. 2004;36:700–706. doi: 10.1038/ng1376. [DOI] [PubMed] [Google Scholar]
Davison D., et al. An approximate likelihood for genetic data under a model with recombination and population splitting. Theor. Popul. Biol. 2009;75:331–345. doi: 10.1016/j.tpb.2009.04.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
De Iorio M., Griffiths R.C. Importance sampling on coalescent histories. I. Adv. Appl. Prob. 2004a;36:417–433. [Google Scholar]
De Iorio M., Griffiths R.C. Importance sampling on coalescent histories. II: Subdivided population models. Adv. Appl. Prob. 2004b;36:434–454. [Google Scholar]
Delaneau O., et al. A linear complexity phasing method for thousands of genomes. Nat. Methods. 2012;9:179–181. doi: 10.1038/nmeth.1785. [DOI] [PubMed] [Google Scholar]
Dutheil J.Y., et al. Ancestral population genomics: the coalescent hidden markov model approach. Genetics. 2009;183:259–274. doi: 10.1534/genetics.109.103010. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fearnhead P., Donnelly P. Estimating recombination rates from population genetic data. Genetics. 2001;159:1299–1318. doi: 10.1093/genetics/159.3.1299. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fearnhead P., Donnelly P. Approximate likelihood methods for estimating local recombination rates. J. Royal Stat. Soc. B. 2002;64:657–680. [Google Scholar]
Fearnhead P., Smith N.G. A novel method with improved power to detect recombination hotspots from polymorphism data reveals multiple hotspots in human genes. Am. J. Hum. Genet. 2005;77:781–794. doi: 10.1086/497579. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gay J., et al. Estimating meiotic gene conversion rates from population genetic data. Genetics. 2007;177:881–894. doi: 10.1534/genetics.107.078907. [DOI] [PMC free article] [PubMed] [Google Scholar]
Griffiths R.C., et al. Importance sampling and the two-locus model with subdivided population structure. Adv. Appl. Probab. 2008;40:473–500. doi: 10.1239/aap/1214950213. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hellenthal G., et al. Inferring human colonization history using a copying model. PLoS Genet. 2008;4:e1000078. doi: 10.1371/journal.pgen.1000078. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hobolth A., et al. Genomic relationships and speciation times of human, chimpanzee, and gorilla inferred from a coalescent hidden markov model. PLoS Genet. 2007;3:e7. doi: 10.1371/journal.pgen.0030007. [DOI] [PMC free article] [PubMed] [Google Scholar]
Howie B.N., et al. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet. 2009;5:e1000529. doi: 10.1371/journal.pgen.1000529. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hudson R.R. Two-locus sampling distributions and their application. Genetics. 2001;159:1805–1817. doi: 10.1093/genetics/159.4.1805. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lawson D., et al. Inference of population structure using dense haplotype data. PLoS Genet. 2012;8:e1002453. doi: 10.1371/journal.pgen.1002453. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li H., Durbin R. Inference of human population history from individual whole-genome sequences. Nature. 2011;475:493–496. doi: 10.1038/nature10231. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li N., Stephens M. Modelling linkage disequilibrium, and identifying recombination hotspots using SNP data. Genetics. 2003;165:2213–2233. doi: 10.1093/genetics/165.4.2213. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li Y., et al. Mach: Using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet. Epidemiol. 2010;34:816–834. doi: 10.1002/gepi.20533. [DOI] [PMC free article] [PubMed] [Google Scholar]
Marchini J., et al. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat. Genet. 2007;39:906–913. doi: 10.1038/ng2088. [DOI] [PubMed] [Google Scholar]
Marjoram P., Wall J.D. Fast “coalescent” simulation. BMC Genet. 2006;7:16. doi: 10.1186/1471-2156-7-16. [DOI] [PMC free article] [PubMed] [Google Scholar]
McVean G.A.T., et al. The fine-scale structure of recombination rate variation in the human genome. Science. 2004;304:581–584. doi: 10.1126/science.1092500. [DOI] [PubMed] [Google Scholar]
McVean G.A., Cardin N.J. Approximating the coalescent with recombination. Philos. Trans. R. Soc. Lond. B Biol. Sci. 2005;360:1387–1393. doi: 10.1098/rstb.2005.1673. [DOI] [PMC free article] [PubMed] [Google Scholar]
Paul J.S., Song Y.S. A principled approach to deriving approximate conditional sampling distributions in population genetics models with recombination. Genetics. 2010;186:321–338. doi: 10.1534/genetics.110.117986. [DOI] [PMC free article] [PubMed] [Google Scholar]
Paul J.S., et al. An accurate sequentially markov conditional sampling distribution for the coalescent with recombination. Genetics. 2011;187:1115–1128. doi: 10.1534/genetics.110.125534. [DOI] [PMC free article] [PubMed] [Google Scholar]
Price A.L., et al. Sensitive detection of chromosomal segments of distinct ancestry in admixed populations. PLoS Genet. 2009;5:e1000519. doi: 10.1371/journal.pgen.1000519. [DOI] [PMC free article] [PubMed] [Google Scholar]
Scheet P., Stephens M. A fast and flexible method for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet. 2006;78:629–644. doi: 10.1086/502802. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stephens M., Donnelly P. Inference in molecular population genetics. J. R. Stat. Soc. Ser. B Stat. Methodol. 2000;62:605–655. [Google Scholar]
Stephens M., Scheet P. Accounting for decay of linkage disequilibrium in haplotype inference and missing-data imputation. Am. J. Hum. Genet. 2005;76:449–462. doi: 10.1086/428594. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sundquist A., et al. Effect of genetic divergence in identifying ancestral origin using HAPAA. Genome Res. 2008;18:676–682. doi: 10.1101/gr.072850.107. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wegmann D., et al. Recombination rates in admixed individuals identified by ancestry-based inference. Nat. Genet. 2011;43:847–853. doi: 10.1038/ng.894. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yin J., et al. Joint estimation of gene conversion rates and mean conversion tract lengths from population SNP data. Bioinformatics. 2009;25:i231–i239. doi: 10.1093/bioinformatics/btp229. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

supp_28_15_2008__index.html^{(772B, html)}

supp_bts314_supplementary.pdf^{(129.4KB, pdf)}

[B1] Browning B.L., Browning S.R. Rapid and accurate haplotype phasing and missing data inference for whole genome association studies using localized haplotype clustering. Am. J. Hum. Genet. 2007;81:1084–1097. doi: 10.1086/521987. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B2] Crawford D.C., et al. Evidence for substantial fine-scale variation in recombination rates across the human genome. Nat. Genet. 2004;36:700–706. doi: 10.1038/ng1376. [DOI] [PubMed] [Google Scholar]

[B3] Davison D., et al. An approximate likelihood for genetic data under a model with recombination and population splitting. Theor. Popul. Biol. 2009;75:331–345. doi: 10.1016/j.tpb.2009.04.001. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B4] De Iorio M., Griffiths R.C. Importance sampling on coalescent histories. I. Adv. Appl. Prob. 2004a;36:417–433. [Google Scholar]

[B5] De Iorio M., Griffiths R.C. Importance sampling on coalescent histories. II: Subdivided population models. Adv. Appl. Prob. 2004b;36:434–454. [Google Scholar]

[B6] Delaneau O., et al. A linear complexity phasing method for thousands of genomes. Nat. Methods. 2012;9:179–181. doi: 10.1038/nmeth.1785. [DOI] [PubMed] [Google Scholar]

[B7] Dutheil J.Y., et al. Ancestral population genomics: the coalescent hidden markov model approach. Genetics. 2009;183:259–274. doi: 10.1534/genetics.109.103010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B8] Fearnhead P., Donnelly P. Estimating recombination rates from population genetic data. Genetics. 2001;159:1299–1318. doi: 10.1093/genetics/159.3.1299. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B9] Fearnhead P., Donnelly P. Approximate likelihood methods for estimating local recombination rates. J. Royal Stat. Soc. B. 2002;64:657–680. [Google Scholar]

[B10] Fearnhead P., Smith N.G. A novel method with improved power to detect recombination hotspots from polymorphism data reveals multiple hotspots in human genes. Am. J. Hum. Genet. 2005;77:781–794. doi: 10.1086/497579. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B11] Gay J., et al. Estimating meiotic gene conversion rates from population genetic data. Genetics. 2007;177:881–894. doi: 10.1534/genetics.107.078907. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B12] Griffiths R.C., et al. Importance sampling and the two-locus model with subdivided population structure. Adv. Appl. Probab. 2008;40:473–500. doi: 10.1239/aap/1214950213. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B13] Hellenthal G., et al. Inferring human colonization history using a copying model. PLoS Genet. 2008;4:e1000078. doi: 10.1371/journal.pgen.1000078. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B14] Hobolth A., et al. Genomic relationships and speciation times of human, chimpanzee, and gorilla inferred from a coalescent hidden markov model. PLoS Genet. 2007;3:e7. doi: 10.1371/journal.pgen.0030007. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B15] Howie B.N., et al. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet. 2009;5:e1000529. doi: 10.1371/journal.pgen.1000529. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B16] Hudson R.R. Two-locus sampling distributions and their application. Genetics. 2001;159:1805–1817. doi: 10.1093/genetics/159.4.1805. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B17] Lawson D., et al. Inference of population structure using dense haplotype data. PLoS Genet. 2012;8:e1002453. doi: 10.1371/journal.pgen.1002453. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B18] Li H., Durbin R. Inference of human population history from individual whole-genome sequences. Nature. 2011;475:493–496. doi: 10.1038/nature10231. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B19] Li N., Stephens M. Modelling linkage disequilibrium, and identifying recombination hotspots using SNP data. Genetics. 2003;165:2213–2233. doi: 10.1093/genetics/165.4.2213. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B20] Li Y., et al. Mach: Using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet. Epidemiol. 2010;34:816–834. doi: 10.1002/gepi.20533. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B21] Marchini J., et al. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat. Genet. 2007;39:906–913. doi: 10.1038/ng2088. [DOI] [PubMed] [Google Scholar]

[B22] Marjoram P., Wall J.D. Fast “coalescent” simulation. BMC Genet. 2006;7:16. doi: 10.1186/1471-2156-7-16. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B23] McVean G.A.T., et al. The fine-scale structure of recombination rate variation in the human genome. Science. 2004;304:581–584. doi: 10.1126/science.1092500. [DOI] [PubMed] [Google Scholar]

[B24] McVean G.A., Cardin N.J. Approximating the coalescent with recombination. Philos. Trans. R. Soc. Lond. B Biol. Sci. 2005;360:1387–1393. doi: 10.1098/rstb.2005.1673. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B25] Paul J.S., Song Y.S. A principled approach to deriving approximate conditional sampling distributions in population genetics models with recombination. Genetics. 2010;186:321–338. doi: 10.1534/genetics.110.117986. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B26] Paul J.S., et al. An accurate sequentially markov conditional sampling distribution for the coalescent with recombination. Genetics. 2011;187:1115–1128. doi: 10.1534/genetics.110.125534. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B27] Price A.L., et al. Sensitive detection of chromosomal segments of distinct ancestry in admixed populations. PLoS Genet. 2009;5:e1000519. doi: 10.1371/journal.pgen.1000519. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B28] Scheet P., Stephens M. A fast and flexible method for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet. 2006;78:629–644. doi: 10.1086/502802. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B29] Stephens M., Donnelly P. Inference in molecular population genetics. J. R. Stat. Soc. Ser. B Stat. Methodol. 2000;62:605–655. [Google Scholar]

[B30] Stephens M., Scheet P. Accounting for decay of linkage disequilibrium in haplotype inference and missing-data imputation. Am. J. Hum. Genet. 2005;76:449–462. doi: 10.1086/428594. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B31] Sundquist A., et al. Effect of genetic divergence in identifying ancestral origin using HAPAA. Genome Res. 2008;18:676–682. doi: 10.1101/gr.072850.107. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B32] Wegmann D., et al. Recombination rates in admixed individuals identified by ancestry-based inference. Nat. Genet. 2011;43:847–853. doi: 10.1038/ng.894. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B33] Yin J., et al. Joint estimation of gene conversion rates and mean conversion tract lengths from population SNP data. Bioinformatics. 2009;25:i231–i239. doi: 10.1093/bioinformatics/btp229. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Blockwise HMM computation for large-scale population genomic inference

Joshua S Paul

Yun S Song

Abstract

1 INTRODUCTION

2 METHODS

2.1 Notation

2.2 A brief description of

Fig. 1.

2.3 Computation with

2.4 Improving efficiency through the transition distribution

Property 1. —

2.5 Improving efficiency through the emission distribution

Property 2. —

Fig. 2.

2.5.1 A general configuration partition

2.5.2 The absence of polymorphism

2.5.3 An optimization for non-polymorphic loci

2.6 A fast algorithm for configuration partitions

Fig. 3.

3 RESULTS

Table 1.

Fig. 4.

Table 2.

4 DISCUSSION

Supplementary Material

Acknowledgements

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Blockwise HMM computation for large-scale population genomic inference

Joshua S Paul

Yun S Song

Abstract

1 INTRODUCTION

2 METHODS

2.1 Notation

2.2 A brief description of

Fig. 1.

2.3 Computation with

2.4 Improving efficiency through the transition distribution

Property 1. —

2.5 Improving efficiency through the emission distribution

Property 2. —

Fig. 2.

2.5.1 A general configuration partition

2.5.2 The absence of polymorphism

2.5.3 An optimization for non-polymorphic loci

2.6 A fast algorithm for configuration partitions

Fig. 3.

3 RESULTS

Table 1.

Fig. 4.

Table 2.

4 DISCUSSION

Supplementary Material

Acknowledgements

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases