A new statistical method for haplotype reconstruction from population data

doi:10.1086/319501

. 2001 Apr;68(4):978-89.

doi: 10.1086/319501. Epub 2001 Mar 9.

A new statistical method for haplotype reconstruction from population data

M Stephens¹, N J Smith, P Donnelly

Affiliations

PMID: 11254454
PMCID: PMC1275651
DOI: 10.1086/319501

A new statistical method for haplotype reconstruction from population data

M Stephens et al. Am J Hum Genet. 2001 Apr.

. 2001 Apr;68(4):978-89.

doi: 10.1086/319501. Epub 2001 Mar 9.

Authors

M Stephens¹, N J Smith, P Donnelly

Affiliation

¹ Department of Statistics, University of Oxford. [email protected]

PMID: 11254454
PMCID: PMC1275651
DOI: 10.1086/319501

Abstract

Current routine genotyping methods typically do not provide haplotype information, which is essential for many analyses of fine-scale molecular-genetics data. Haplotypes can be obtained, at considerable cost, experimentally or (partially) through genotyping of additional family members. Alternatively, a statistical method can be used to infer phase and to reconstruct haplotypes. We present a new statistical method, applicable to genotype data at linked loci from a population sample, that improves substantially on current algorithms; often, error rates are reduced by > 50%, relative to its nearest competitor. Furthermore, our algorithm performs well in absolute terms, suggesting that reconstructing haplotypes experimentally or by genotyping additional family members may be an inefficient use of resources.

PubMed Disclaimer

Figures

**Figure 1**
Illustration of how our method uses the fact that unresolved haplotypes tend to be similar to known haplotypes. Suppose that we have a list of haplotypes, as shown on the left side of the figure, that are known without error (e.g., from family data or because some individuals are homozygous). Then, intuitively, the most likely pair of haplotypes for ambiguous individual 1 consists of two haplotypes that have high population frequency, as shown. All methods considered here will correctly identify this as the most likely reconstruction. However, ambiguous individual 2 cannot possess any of the haplotypes in the known list, and the most plausible reconstruction for this individual consists of two haplotypes that are *similar, but not identical, to* two haplotypes that have high population frequency, as shown. Of the methods considered here, only our method uses this kind of information, leading to the improved performance we observed.

**Figure 2**
Comparison of accuracy of our method (*solid line*) versus EM (*dotted line*) and Clark’s method (*dashed line*) for short sequence data (∼5–30 segregating sites). *Top row,* mean error rate (defined in text) for haplotype reconstruction. *Bottom row,* mean discrepancy (defined in text) for estimation of haplotype frequencies. We simulated data sets of 2n haplotypes, randomly paired to form n genotypes, under an infinite-sites model, with θ=4 and different assumptions about the local recombination rate R (R and θ are defined in the note to table 1), using a coalescent-based program kindly provided by R. R. Hudson. For each combination of parameters considered, we generated 100 independent data sets and discarded those data sets for which the total number of possible haplotypes was >10⁵ (the limit of our implementation of the EM algorithm), which typically left >90 data sets on which to compare the methods. Each point thus represents an average over 90–100 simulated data sets. Horizontal lines above and below each point show approximate 95% confidence intervals for this average (±2 standard errors). The results for Clark’s algorithm for R=40 are omitted, as we had difficulty getting the algorithm to consistently provide a unique haplotype reconstruction for these data.

**Figure 3**
Comparison of accuracy of our method (*solid line*) versus EM (*dotted line*) for microsatellite data. *Top row,* mean error rate (defined in text) for haplotype reconstruction. *Bottom row,* mean discrepancy (defined in text) for estimation of haplotype frequencies. We simulated data sets of 2n haplotypes, randomly paired to form n genotypes, for 10 equally spaced linked microsatellite loci, from a constant-sized population, under a symmetric stepwise mutation model, using a coalescent-based program kindly provided by P. N. Fearnhead. We assumed θ=4N_eμ=8 (where μ is the per-generation mutation rate per locus, assumed to be constant across loci) and various values for the scaled recombination rate between neighboring loci, R=4N_er, where N_e is the effective population size and r is the genetic distance, in Morgans, between loci. For example, for humans, assuming N_e=10⁴, and the genomewide average recombination rate, 1 cM = 1 Mb, the right-hand column would correspond to 10 kb between loci. For each combination of parameters considered, we generated 100 independent data sets. Each point thus represents an average over 100 simulated data sets. Horizontal lines above and below each point show approximate 95% confidence intervals for this average (±2 standard errors). We had difficulty getting Clark’s algorithm to consistently provide a unique haplotype reconstruction for these data.

**Figure 4**
Graph of error rates using our method (*solid line*) and the EM algorithm (*dotted line*) for each of the 100 simulated microsatellite data sets with n=50 and R=4. The EM algorithm gives a smaller error rate than our method for only 3 of the 100 data sets.

See this image and copyright information in PMC

Comment in

Comparisons of two methods for haplotype reconstruction and haplotype frequency estimation from population data.
Zhang S, Pakstis AJ, Kidd KK, Zhao H. Zhang S, et al. Am J Hum Genet. 2001 Oct;69(4):906-14. doi: 10.1086/323622. Am J Hum Genet. 2001. PMID: 11536083 Free PMC article. No abstract available.

Cited by

High diversity and no significant selection signal of human ADH1B gene in Tibet.
Lu Y, Kang L, Hu K, Wang C, Sun X, Chen F, Kidd JR, Kidd KK, Li H. Lu Y, et al. Investig Genet. 2012 Nov 23;3(1):23. doi: 10.1186/2041-2223-3-23. Investig Genet. 2012. PMID: 23176670 Free PMC article.
Beringian sub-refugia revealed in blackfish (Dallia): implications for understanding the effects of Pleistocene glaciations on Beringian taxa and other Arctic aquatic fauna.
Campbell MA, Takebayashi N, López JA. Campbell MA, et al. BMC Evol Biol. 2015 Jul 19;15:144. doi: 10.1186/s12862-015-0413-2. BMC Evol Biol. 2015. PMID: 26187279 Free PMC article.
Putting pleiotropy and selection into context defines a new paradigm for interpreting genetic data.
Predazzi IM, Rokas A, Deinard A, Schnetz-Boutaud N, Williams ND, Bush WS, Tacconelli A, Friedrich K, Fazio S, Novelli G, Haines JL, Sirugo G, Williams SM. Predazzi IM, et al. Circ Cardiovasc Genet. 2013 Jun;6(3):299-307. doi: 10.1161/CIRCGENETICS.113.000126. Epub 2013 Apr 24. Circ Cardiovasc Genet. 2013. PMID: 23616601 Free PMC article.
Disparate Modes of Evolution Shaped Modern Prion (PRNP) and Prion-Related Doppel (PRND) Variation in Domestic Cattle.
Brunelle BW, O'Grady AM, Nicholson EM, Seabury CM. Brunelle BW, et al. PLoS One. 2016 May 25;11(5):e0155924. doi: 10.1371/journal.pone.0155924. eCollection 2016. PLoS One. 2016. PMID: 27224046 Free PMC article.
Importance of HBsAg recognition by HLA molecules as revealed by responsiveness to different hepatitis B vaccines.
Nishida N, Sugiyama M, Ohashi J, Kawai Y, Khor SS, Nishina S, Yamasaki K, Yazaki H, Okudera K, Tamori A, Eguchi Y, Sakai A, Kakisaka K, Sawai H, Tsuchiura T, Ishikawa M, Hino K, Sumazaki R, Takikawa Y, Kanda T, Yokosuka O, Yatsuhashi H, Tokunaga K, Mizokami M. Nishida N, et al. Sci Rep. 2021 Mar 2;11(1):3703. doi: 10.1038/s41598-021-82986-8. Sci Rep. 2021. PMID: 33654122 Free PMC article.

See all "Cited by" articles

References

Electronic-Database Information

1. Fearnhead PN, Donnelly P. Estimating recombination rates from population genetic data. Available from http://www.stats.ox.ac.uk/~fhead - PMC - PubMed
1. Oxford Mathematical Genetics Group Web site, http://www.stats.ox.ac.uk/mathgen/software.html (for software implementing the authors' general method)

References

1. Clark AG (1990) Inference of haplotypes from PCR-amplified samples of diploid populations. Mol Biol Evol 7:111–122 - PubMed
1. Donnelly P (1986) Partition structures, Polya Urns, the Ewens sampling formula, and the age of alleles. Theor Popul Biol 30:271–288 - PubMed
1. Excoffier L, Slatkin M (1995) Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol Biol Evol 12:921–927 - PubMed
1. ——— (1998) Incorporating genotypes of relatives into a test of linkage disequilibrium. Am J Hum Genet 62:171–180 - PMC - PubMed
1. Fallin D, Schork NJ (2000) Accuracy of haplotype frequency estimation for biallelic loci, via the expectation-maximization algorithm for unphased diploid genotype data. Am J Hum Genet 67:947–959 - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

WT_/Wellcome Trust/United Kingdom

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

[1] Fearnhead PN, Donnelly P. Estimating recombination rates from population genetic data. Available from http://www.stats.ox.ac.uk/~fhead - PMC - PubMed

[2] Fearnhead PN, Donnelly P. Estimating recombination rates from population genetic data. Available from http://www.stats.ox.ac.uk/~fhead - PMC - PubMed

[3] Oxford Mathematical Genetics Group Web site, http://www.stats.ox.ac.uk/mathgen/software.html (for software implementing the authors' general method)

[4] Oxford Mathematical Genetics Group Web site, http://www.stats.ox.ac.uk/mathgen/software.html (for software implementing the authors' general method)

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed