Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2001 Apr;68(4):978-89.
doi: 10.1086/319501. Epub 2001 Mar 9.

A new statistical method for haplotype reconstruction from population data

Affiliations

A new statistical method for haplotype reconstruction from population data

M Stephens et al. Am J Hum Genet. 2001 Apr.

Abstract

Current routine genotyping methods typically do not provide haplotype information, which is essential for many analyses of fine-scale molecular-genetics data. Haplotypes can be obtained, at considerable cost, experimentally or (partially) through genotyping of additional family members. Alternatively, a statistical method can be used to infer phase and to reconstruct haplotypes. We present a new statistical method, applicable to genotype data at linked loci from a population sample, that improves substantially on current algorithms; often, error rates are reduced by > 50%, relative to its nearest competitor. Furthermore, our algorithm performs well in absolute terms, suggesting that reconstructing haplotypes experimentally or by genotyping additional family members may be an inefficient use of resources.

PubMed Disclaimer

Figures

Figure  1
Figure 1
Illustration of how our method uses the fact that unresolved haplotypes tend to be similar to known haplotypes. Suppose that we have a list of haplotypes, as shown on the left side of the figure, that are known without error (e.g., from family data or because some individuals are homozygous). Then, intuitively, the most likely pair of haplotypes for ambiguous individual 1 consists of two haplotypes that have high population frequency, as shown. All methods considered here will correctly identify this as the most likely reconstruction. However, ambiguous individual 2 cannot possess any of the haplotypes in the known list, and the most plausible reconstruction for this individual consists of two haplotypes that are similar, but not identical, to two haplotypes that have high population frequency, as shown. Of the methods considered here, only our method uses this kind of information, leading to the improved performance we observed.
Figure  2
Figure 2
Comparison of accuracy of our method (solid line) versus EM (dotted line) and Clark’s method (dashed line) for short sequence data (∼5–30 segregating sites). Top row, mean error rate (defined in text) for haplotype reconstruction. Bottom row, mean discrepancy (defined in text) for estimation of haplotype frequencies. We simulated data sets of 2n haplotypes, randomly paired to form n genotypes, under an infinite-sites model, with θ=4 and different assumptions about the local recombination rate R (R and θ are defined in the note to table 1), using a coalescent-based program kindly provided by R. R. Hudson. For each combination of parameters considered, we generated 100 independent data sets and discarded those data sets for which the total number of possible haplotypes was >105 (the limit of our implementation of the EM algorithm), which typically left >90 data sets on which to compare the methods. Each point thus represents an average over 90–100 simulated data sets. Horizontal lines above and below each point show approximate 95% confidence intervals for this average (±2 standard errors). The results for Clark’s algorithm for R=40 are omitted, as we had difficulty getting the algorithm to consistently provide a unique haplotype reconstruction for these data.
Figure  3
Figure 3
Comparison of accuracy of our method (solid line) versus EM (dotted line) for microsatellite data. Top row, mean error rate (defined in text) for haplotype reconstruction. Bottom row, mean discrepancy (defined in text) for estimation of haplotype frequencies. We simulated data sets of 2n haplotypes, randomly paired to form n genotypes, for 10 equally spaced linked microsatellite loci, from a constant-sized population, under a symmetric stepwise mutation model, using a coalescent-based program kindly provided by P. N. Fearnhead. We assumed θ=4Neμ=8 (where μ is the per-generation mutation rate per locus, assumed to be constant across loci) and various values for the scaled recombination rate between neighboring loci, R=4Ner, where Ne is the effective population size and r is the genetic distance, in Morgans, between loci. For example, for humans, assuming Ne=104, and the genomewide average recombination rate, 1 cM = 1 Mb, the right-hand column would correspond to 10 kb between loci. For each combination of parameters considered, we generated 100 independent data sets. Each point thus represents an average over 100 simulated data sets. Horizontal lines above and below each point show approximate 95% confidence intervals for this average (±2 standard errors). We had difficulty getting Clark’s algorithm to consistently provide a unique haplotype reconstruction for these data.
Figure  4
Figure 4
Graph of error rates using our method (solid line) and the EM algorithm (dotted line) for each of the 100 simulated microsatellite data sets with n=50 and R=4. The EM algorithm gives a smaller error rate than our method for only 3 of the 100 data sets.

Comment in

Similar articles

Cited by

References

Electronic-Database Information

    1. Fearnhead PN, Donnelly P. Estimating recombination rates from population genetic data. Available from http://www.stats.ox.ac.uk/~fhead - PMC - PubMed
    1. Oxford Mathematical Genetics Group Web site, http://www.stats.ox.ac.uk/mathgen/software.html (for software implementing the authors' general method)

References

    1. Clark AG (1990) Inference of haplotypes from PCR-amplified samples of diploid populations. Mol Biol Evol 7:111–122 - PubMed
    1. Donnelly P (1986) Partition structures, Polya Urns, the Ewens sampling formula, and the age of alleles. Theor Popul Biol 30:271–288 - PubMed
    1. Excoffier L, Slatkin M (1995) Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol Biol Evol 12:921–927 - PubMed
    1. ——— (1998) Incorporating genotypes of relatives into a test of linkage disequilibrium. Am J Hum Genet 62:171–180 - PMC - PubMed
    1. Fallin D, Schork NJ (2000) Accuracy of haplotype frequency estimation for biallelic loci, via the expectation-maximization algorithm for unphased diploid genotype data. Am J Hum Genet 67:947–959 - PMC - PubMed

Publication types

LinkOut - more resources