SVIM: structural variant identification using mapped long reads

doi:10.1093/bioinformatics/btz041

. 2019 Sep 1;35(17):2907-2915.

doi: 10.1093/bioinformatics/btz041.

SVIM: structural variant identification using mapped long reads

David Heller¹, Martin Vingron¹

Affiliations

PMID: 30668829
PMCID: PMC6735718
DOI: 10.1093/bioinformatics/btz041

SVIM: structural variant identification using mapped long reads

David Heller et al. Bioinformatics. 2019.

. 2019 Sep 1;35(17):2907-2915.

doi: 10.1093/bioinformatics/btz041.

Authors

David Heller¹, Martin Vingron¹

Affiliation

¹ Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany.

PMID: 30668829
PMCID: PMC6735718
DOI: 10.1093/bioinformatics/btz041

Abstract

Motivation: Structural variants are defined as genomic variants larger than 50 bp. They have been shown to affect more bases in any given genome than single-nucleotide polymorphisms or small insertions and deletions. Additionally, they have great impact on human phenotype and diversity and have been linked to numerous diseases. Due to their size and association with repeats, they are difficult to detect by shotgun sequencing, especially when based on short reads. Long read, single-molecule sequencing technologies like those offered by Pacific Biosciences or Oxford Nanopore Technologies produce reads with a length of several thousand base pairs. Despite the higher error rate and sequencing cost, long-read sequencing offers many advantages for the detection of structural variants. Yet, available software tools still do not fully exploit the possibilities.

Results: We present SVIM, a tool for the sensitive detection and precise characterization of structural variants from long-read data. SVIM consists of three components for the collection, clustering and combination of structural variant signatures from read alignments. It discriminates five different variant classes including similar types, such as tandem and interspersed duplications and novel element insertions. SVIM is unique in its capability of extracting both the genomic origin and destination of duplications. It compares favorably with existing tools in evaluations on simulated data and real datasets from Pacific Biosciences and Nanopore sequencing machines.

Availability and implementation: The source code and executables of SVIM are available on Github: github.com/eldariont/svim. SVIM has been implemented in Python 3 and published on bioconda and the Python Package Index.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1.**
Schematic overview of different SV classes. SVs can be categorized into deletions, cut&paste insertions, tandem and interspersed duplications, inversions and novel element insertions. Each SV class is depicted in an individual genome (lower line) when compared to the reference genome (upper line). The region being rearranged is marked in red

**Fig. 2.**
The SVIM workflow. (1) Signatures for SVs are collected from the input read alignments. SVIM collects them from within alignments (intra-alignment signatures) and between alignments (inter-alignment signatures). (2) Collected signatures are clustered based on their genomic position and span. (3) Signature clusters from different parts of the genome are combined to distinguish five different classes of SVs: deletions, interspersed duplications, novel insertions, inversions and tandem duplications

**Fig. 3.**
Read signatures for an interspersed duplication and a novel element insertion. A genomic segment (yellow arrow) has been copied from locus 1 to locus 2a in an individual genome. Additionally, a novel genomic segment (gray arrow) has been inserted in locus 2b. Two reads are generated from the individual (top) and mapped to the reference genome (bottom). The first read (blue-yellow) consists of three segments. They are mapped individually to the reference genome. The two blue segments are mapped to locus 2a exhibiting an insertion signature. The yellow segment is mapped to locus 1 indicating the origin of the insertion. The second read (orange-gray) exhibits a similar insertion signature at locus 2b but as the inserted gray segment is unmapped its origin cannot be determined

**Fig. 4.**
Comparison of SV detection performance on a 6× coverage homozygous simulated dataset. SVIM consistently yielded better recall (x-axis) and precision (y-axis) than the other tools for the recovery of INSs and tandem duplications. For the recovery of deletions and inversions, *Sniffles* reached the same recall as SVIM. The different points for each tool represent multiple settings of the tools’ most important parameters (see Section 2.5). *PBHoney-Spots* only detects deletions and INSs and *PBHoney-Tails* does not detect duplications. Recall and precision were calculated using a required reciprocal overlap of 50% between variant calls and the original simulated variants

**Fig. 5.**
Comparison of recall on a 53× coverage public PacBio dataset and a 6× coverage subset with 2676 high-confidence deletion and 68 insertion calls. For each tool and different thresholds, the number of SV calls with score above the threshold (log-scale) is plotted against the recall. The upper and lower panels show performance on the full dataset and a randomly sampled 6× coverage subset of the data, respectively. SVIM reached the same recall with fewer calls than other tools. The vertical dotted lines denote the average number of deletions and insertions to expect in an individual as recently reported using a *de-novo* assembly approach (Chaisson *et al.*, 2018). Recall was calculated using a required reciprocal overlap of 50% (deletion calls) and 1% (insertion calls), respectively, between variant calls and the gold standard variants

**Fig. 6.**
Comparison of recall from NA12878 reads aligned to an altered reference genome. For each tool and different thresholds, the number of SV calls with score above the threshold (log-scale) is plotted against the recall. The upper and lower panels show performance on the full dataset and a randomly sampled 6× coverage subset of the data, respectively. In all six panels, SVIM outperformed all the other tools and reached substantially higher recall for similar numbers of calls. The improvement was most prominent for insertions. Recall was calculated using a required reciprocal overlap of 50% between variant calls and the original implanted variants

**Fig. 7.**
Venn diagram of three SV callsets for NA12878: SVIM calls on a 53× coverage PacBio dataset, SVIM calls on a 26× coverage Nanopore dataset and high-confidence calls from Parikh *et al.* (2016). Callsets were produced by merging SVIM calls with a score $\geq 40$ for deletions, interspersed duplications and novel element insertions. Subsequently, the diagram was generated using *pybedtools* (Dale *et al.*, 2011) and *matplotlib_venn*

See this image and copyright information in PMC

Cited by

Fundamental Patterns of Structural Evolution Revealed by Chromosome-Length Genomes of Cactophilic Drosophila.
Benowitz KM, Allan CW, Jaworski CC, Sanderson MJ, Diaz F, Chen X, Matzkin LM. Benowitz KM, et al. Genome Biol Evol. 2024 Sep 3;16(9):evae191. doi: 10.1093/gbe/evae191. Genome Biol Evol. 2024. PMID: 39228294 Free PMC article.
Cas9 targeted nanopore sequencing with enhanced variant calling improves CYP2D6-CYP2D7 hybrid allele genotyping.
Rubben K, Tilleman L, Deserranno K, Tytgat O, Deforce D, Van Nieuwerburgh F. Rubben K, et al. PLoS Genet. 2022 Sep 23;18(9):e1010176. doi: 10.1371/journal.pgen.1010176. eCollection 2022 Sep. PLoS Genet. 2022. PMID: 36149915 Free PMC article.
FindCSV: a long-read based method for detecting complex structural variations.
Zheng Y, Shang X. Zheng Y, et al. BMC Bioinformatics. 2024 Sep 28;25(1):315. doi: 10.1186/s12859-024-05937-w. BMC Bioinformatics. 2024. PMID: 39342151 Free PMC article.
HapKled: a haplotype-aware structural variant calling approach for Oxford nanopore sequencing data.
Zhang Z, Liu Y, Li X, Liu Y, Wang Y, Jiang T. Zhang Z, et al. Front Genet. 2024 Jul 9;15:1435087. doi: 10.3389/fgene.2024.1435087. eCollection 2024. Front Genet. 2024. PMID: 39045321 Free PMC article.
Whole-genome long-read sequencing downsampling and its effect on variant-calling precision and recall.
Harvey WT, Ebert P, Ebler J, Audano PA, Munson KM, Hoekzema K, Porubsky D, Beck CR, Marschall T, Garimella K, Eichler EE. Harvey WT, et al. Genome Res. 2023 Dec 27;33(12):2029-2040. doi: 10.1101/gr.278070.123. Genome Res. 2023. PMID: 38190646 Free PMC article.

See all "Cited by" articles

References

1. 1000 Genomes Project Consortium. (2015) A global reference for human genetic variation. Nature, 526, 68–74. - PMC - PubMed
1. Alkan C. et al. (2011) Genome structural variation discovery and genotyping. Nat. Rev. Genet., 12, 363–376. - PMC - PubMed
1. Bartenhagen C., Dugas M. (2013) Rsvsim: an R/Bioconductor package for the simulation of structural variations. Bioinformatics, 29, 1679–1681. - PubMed
1. Bron C., Kerbosch J. (1973) Algorithm 457: finding all cliques of an undirected graph. Commun. ACM, 16, 575–577.
1. Carvalho C.M., Lupski J.R. (2016) Mechanisms underlying structural variant formation in genomic disorders. Nat. Rev. Genet., 17, 224–238. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

[1] 1000 Genomes Project Consortium. (2015) A global reference for human genetic variation. Nature, 526, 68–74. - PMC - PubMed

[2] 1000 Genomes Project Consortium. (2015) A global reference for human genetic variation. Nature, 526, 68–74. - PMC - PubMed

[3] Alkan C. et al. (2011) Genome structural variation discovery and genotyping. Nat. Rev. Genet., 12, 363–376. - PMC - PubMed

[4] Alkan C. et al. (2011) Genome structural variation discovery and genotyping. Nat. Rev. Genet., 12, 363–376. - PMC - PubMed

[5] Bartenhagen C., Dugas M. (2013) Rsvsim: an R/Bioconductor package for the simulation of structural variations. Bioinformatics, 29, 1679–1681. - PubMed

[6] Bartenhagen C., Dugas M. (2013) Rsvsim: an R/Bioconductor package for the simulation of structural variations. Bioinformatics, 29, 1679–1681. - PubMed

[7] Bron C., Kerbosch J. (1973) Algorithm 457: finding all cliques of an undirected graph. Commun. ACM, 16, 575–577.

[8] Bron C., Kerbosch J. (1973) Algorithm 457: finding all cliques of an undirected graph. Commun. ACM, 16, 575–577.

[9] Carvalho C.M., Lupski J.R. (2016) Mechanisms underlying structural variant formation in genomic disorders. Nat. Rev. Genet., 17, 224–238. - PMC - PubMed

[10] Carvalho C.M., Lupski J.R. (2016) Mechanisms underlying structural variant formation in genomic disorders. Nat. Rev. Genet., 17, 224–238. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

SVIM: structural variant identification using mapped long reads

Affiliation

SVIM: structural variant identification using mapped long reads

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

LinkOut - more resources

Full Text Sources