Dindel: accurate indel calls from short-read data

doi:10.1101/gr.112326.110

. 2011 Jun;21(6):961-73.

doi: 10.1101/gr.112326.110. Epub 2010 Oct 27.

Dindel: accurate indel calls from short-read data

Cornelis A Albers¹, Gerton Lunter, Daniel G MacArthur, Gilean McVean, Willem H Ouwehand, Richard Durbin

Affiliations

PMID: 20980555
PMCID: PMC3106329
DOI: 10.1101/gr.112326.110

Dindel: accurate indel calls from short-read data

Cornelis A Albers et al. Genome Res. 2011 Jun.

. 2011 Jun;21(6):961-73.

doi: 10.1101/gr.112326.110. Epub 2010 Oct 27.

Authors

Cornelis A Albers¹, Gerton Lunter, Daniel G MacArthur, Gilean McVean, Willem H Ouwehand, Richard Durbin

Affiliation

¹ Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire CB10 1HH, United Kingdom. [email protected]

PMID: 20980555
PMCID: PMC3106329
DOI: 10.1101/gr.112326.110

Abstract

Small insertions and deletions (indels) are a common and functionally important type of sequence polymorphism. Most of the focus of studies of sequence variation is on single nucleotide variants (SNVs) and large structural variants. In principle, high-throughput sequencing studies should allow identification of indels just as SNVs. However, inference of indels from next-generation sequence data is challenging, and so far methods for identifying indels lag behind methods for calling SNVs in terms of sensitivity and specificity. We propose a Bayesian method to call indels from short-read sequence data in individuals and populations by realigning reads to candidate haplotypes that represent alternative sequence to the reference. The candidate haplotypes are formed by combining candidate indels and SNVs identified by the read mapper, while allowing for known sequence variants or candidates from other methods to be included. In our probabilistic realignment model we account for base-calling errors, mapping errors, and also, importantly, for increased sequencing error indel rates in long homopolymer runs. We show that our method is sensitive and achieves low false discovery rates on simulated and real data sets, although challenges remain. The algorithm is implemented in the program Dindel, which has been used in the 1000 Genomes Project call sets.

PubMed Disclaimer

Figures

**Figure 1.**
Outline of the Dindel algorithm.

**Figure 2.**
Procedure for generation of candidate haplotypes. We first consider the empirical distribution of bases determined from the initial alignments of reads to the reference and infer a heuristic haplotype block model to preserve sequences that always occur together in one read. We then choose n block-haplotypes with the highest empirical frequency, and generate candidate haplotypes by considering all combinations of these n block-haplotypes. The number of candidate haplotypes obtained this way is thus 2ⁿ. It is possible that multiple subhaplotypes from the same block are chosen. In the second step, all candidate variants (most importantly, the candidate indels) are added to these n candidate haplotypes, resulting in a set of, at most,k · 2ⁿ candidate haplotypes, where k is the number of candidate variants tested.

**Figure 3.**
Indel allele frequency spectra by homopolymer context, with indel called directly from mapped reads using a parsimony approach. For short homopolymers the expected 1/f distribution appears. The distribution for long homopolymers is not predicted by population genetics, even for high mutation rates, but is consistent with a high error rate in this sequence context.

**Figure 4.**
Accuracy of detection of indel sites of Dindel, SAMtools, and VarScan on simulated data. (A) Sensitivity and false discovery rates for reads simulated with a constant sequencing error indel rate of 0.005% per-base at coverages of 4×, 20×, and 40×. Dindel was run with one candidate haplotype (“h = 1”) and eight candidate haplotypes (“h = 8”). The crosses indicate performance at the 99% confidence level (quality score of 20) of a non-reference indel variant being present. True-positives here are defined as indel calls that result in the same alternative haplotype sequence as that of the simulated indel. (B) Performance on data simulated with indels that were called from high-coverage real data of HapMap individual NA19240 and a realistic sequencing error indel model. Under this model, reads were simulated with increased sequencing error indel rates in long homopolymers, with rates estimated from the low-coverage data set of the 1000 Genomes pilot project. SAMtools was run with a constant sequencing error indel rate of 0.01% and of 1% (“e = 1%”).

**Figure 5.**
Power as a function of non-reference allele frequency in a simulated pooled analysis. The Dindel Bayesian EM algorithm was used to detect indels in a pool of 60 individuals with simulated read-depth of 4×. Due to increased sequencing error indel rate in long homopolymers, power decreases as a function of homopolymer length. Calls were made using a 99% confidence threshold on the posterior probability of a non-reference indel variant being present.

**Figure 6.**
Distribution of indel lengths in autosomal protein-coding regions called from 30 × 35 bp paired-end Illumina GA reads for NA18507. The fraction of indel calls resulting in a frameshift was, respectively, 57%, 65%, and 68% for Dindel, VarScan, and SAMtools.

**Figure 7.**
Discovery rate of SeattleSNP indels from the 1000 Genomes pilot 1 samples with Illumina data (170 individual sequenced at 2.99× on average). The horizontal axis represents the indel allele count in the SeattleSNP data set, the vertical axis the corresponding discovery rate in 1000 Genomes pilot 1 data.

See this image and copyright information in PMC

Cited by

Compounds that select against the tetracycline-resistance efflux pump.
Stone LK, Baym M, Lieberman TD, Chait R, Clardy J, Kishony R. Stone LK, et al. Nat Chem Biol. 2016 Nov;12(11):902-904. doi: 10.1038/nchembio.2176. Epub 2016 Sep 19. Nat Chem Biol. 2016. PMID: 27642863 Free PMC article.
Characterizing linkage disequilibrium and evaluating imputation power of human genomic insertion-deletion polymorphisms.
Lu JT, Wang Y, Gibbs RA, Yu F. Lu JT, et al. Genome Biol. 2012 Feb 29;13(2):R15. doi: 10.1186/gb-2012-13-2-r15. Genome Biol. 2012. PMID: 22377349 Free PMC article.
Variation in genes related to cochlear biology is strongly associated with adult-onset deafness in border collies.
Yokoyama JS, Lam ET, Ruhe AL, Erdman CA, Robertson KR, Webb AA, Williams DC, Chang ML, Hytönen MK, Lohi H, Hamilton SP, Neff MW. Yokoyama JS, et al. PLoS Genet. 2012 Sep;8(9):e1002898. doi: 10.1371/journal.pgen.1002898. Epub 2012 Sep 13. PLoS Genet. 2012. PMID: 23028339 Free PMC article.
Cis-regulatory somatic mutations and gene-expression alteration in B-cell lymphomas.
Mathelier A, Lefebvre C, Zhang AW, Arenillas DJ, Ding J, Wasserman WW, Shah SP. Mathelier A, et al. Genome Biol. 2015 Apr 23;16(1):84. doi: 10.1186/s13059-015-0648-7. Genome Biol. 2015. PMID: 25903198 Free PMC article.
Detection of the Omicron SARS-CoV-2 Lineage and Its BA.1 Variant with Multiplex RT-qPCR.
Yolshin ND, Komissarov AB, Varchenko KV, Musaeva TD, Fadeev AV, Lioznov DA. Yolshin ND, et al. Int J Mol Sci. 2022 Dec 18;23(24):16153. doi: 10.3390/ijms232416153. Int J Mol Sci. 2022. PMID: 36555794 Free PMC article.

See all "Cited by" articles

References

1. The 1000 Genomes Project Consortium 2010. A map of human genome variation from population-scale sequencing. Nature 467: 1061–1073 - PMC - PubMed
1. Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, et al. 2008. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456: 53–59 - PMC - PubMed
1. Bishop CM 2007. Pattern recognition and machine learning. In Information Science and Statistics. Springer, New York
1. Browning SR, Browning BL 2007. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am J Hum Genet 81: 1084–1097 - PMC - PubMed
1. Cartwright RA 2009. Problems and solutions for estimating indel rates and length distributions. Mol Biol Evol 26: 473–480 - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Miscellaneous
- NCI CPTAC Assay Portal

[1] The 1000 Genomes Project Consortium 2010. A map of human genome variation from population-scale sequencing. Nature 467: 1061–1073 - PMC - PubMed

[2] The 1000 Genomes Project Consortium 2010. A map of human genome variation from population-scale sequencing. Nature 467: 1061–1073 - PMC - PubMed

[3] Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, et al. 2008. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456: 53–59 - PMC - PubMed

[4] Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, et al. 2008. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456: 53–59 - PMC - PubMed

[5] Bishop CM 2007. Pattern recognition and machine learning. In Information Science and Statistics. Springer, New York

[6] Bishop CM 2007. Pattern recognition and machine learning. In Information Science and Statistics. Springer, New York

[7] Browning SR, Browning BL 2007. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am J Hum Genet 81: 1084–1097 - PMC - PubMed

[8] Browning SR, Browning BL 2007. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am J Hum Genet 81: 1084–1097 - PMC - PubMed

[9] Cartwright RA 2009. Problems and solutions for estimating indel rates and length distributions. Mol Biol Evol 26: 473–480 - PMC - PubMed

[10] Cartwright RA 2009. Problems and solutions for estimating indel rates and length distributions. Mol Biol Evol 26: 473–480 - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Dindel: accurate indel calls from short-read data

Affiliation

Dindel: accurate indel calls from short-read data

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous