Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Sep 1;35(17):2907-2915.
doi: 10.1093/bioinformatics/btz041.

SVIM: structural variant identification using mapped long reads

Affiliations

SVIM: structural variant identification using mapped long reads

David Heller et al. Bioinformatics. .

Abstract

Motivation: Structural variants are defined as genomic variants larger than 50 bp. They have been shown to affect more bases in any given genome than single-nucleotide polymorphisms or small insertions and deletions. Additionally, they have great impact on human phenotype and diversity and have been linked to numerous diseases. Due to their size and association with repeats, they are difficult to detect by shotgun sequencing, especially when based on short reads. Long read, single-molecule sequencing technologies like those offered by Pacific Biosciences or Oxford Nanopore Technologies produce reads with a length of several thousand base pairs. Despite the higher error rate and sequencing cost, long-read sequencing offers many advantages for the detection of structural variants. Yet, available software tools still do not fully exploit the possibilities.

Results: We present SVIM, a tool for the sensitive detection and precise characterization of structural variants from long-read data. SVIM consists of three components for the collection, clustering and combination of structural variant signatures from read alignments. It discriminates five different variant classes including similar types, such as tandem and interspersed duplications and novel element insertions. SVIM is unique in its capability of extracting both the genomic origin and destination of duplications. It compares favorably with existing tools in evaluations on simulated data and real datasets from Pacific Biosciences and Nanopore sequencing machines.

Availability and implementation: The source code and executables of SVIM are available on Github: github.com/eldariont/svim. SVIM has been implemented in Python 3 and published on bioconda and the Python Package Index.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Schematic overview of different SV classes. SVs can be categorized into deletions, cut&paste insertions, tandem and interspersed duplications, inversions and novel element insertions. Each SV class is depicted in an individual genome (lower line) when compared to the reference genome (upper line). The region being rearranged is marked in red
Fig. 2.
Fig. 2.
The SVIM workflow. (1) Signatures for SVs are collected from the input read alignments. SVIM collects them from within alignments (intra-alignment signatures) and between alignments (inter-alignment signatures). (2) Collected signatures are clustered based on their genomic position and span. (3) Signature clusters from different parts of the genome are combined to distinguish five different classes of SVs: deletions, interspersed duplications, novel insertions, inversions and tandem duplications
Fig. 3.
Fig. 3.
Read signatures for an interspersed duplication and a novel element insertion. A genomic segment (yellow arrow) has been copied from locus 1 to locus 2a in an individual genome. Additionally, a novel genomic segment (gray arrow) has been inserted in locus 2b. Two reads are generated from the individual (top) and mapped to the reference genome (bottom). The first read (blue-yellow) consists of three segments. They are mapped individually to the reference genome. The two blue segments are mapped to locus 2a exhibiting an insertion signature. The yellow segment is mapped to locus 1 indicating the origin of the insertion. The second read (orange-gray) exhibits a similar insertion signature at locus 2b but as the inserted gray segment is unmapped its origin cannot be determined
Fig. 4.
Fig. 4.
Comparison of SV detection performance on a 6× coverage homozygous simulated dataset. SVIM consistently yielded better recall (x-axis) and precision (y-axis) than the other tools for the recovery of INSs and tandem duplications. For the recovery of deletions and inversions, Sniffles reached the same recall as SVIM. The different points for each tool represent multiple settings of the tools’ most important parameters (see Section 2.5). PBHoney-Spots only detects deletions and INSs and PBHoney-Tails does not detect duplications. Recall and precision were calculated using a required reciprocal overlap of 50% between variant calls and the original simulated variants
Fig. 5.
Fig. 5.
Comparison of recall on a 53× coverage public PacBio dataset and a 6× coverage subset with 2676 high-confidence deletion and 68 insertion calls. For each tool and different thresholds, the number of SV calls with score above the threshold (log-scale) is plotted against the recall. The upper and lower panels show performance on the full dataset and a randomly sampled 6× coverage subset of the data, respectively. SVIM reached the same recall with fewer calls than other tools. The vertical dotted lines denote the average number of deletions and insertions to expect in an individual as recently reported using a de-novo assembly approach (Chaisson et al., 2018). Recall was calculated using a required reciprocal overlap of 50% (deletion calls) and 1% (insertion calls), respectively, between variant calls and the gold standard variants
Fig. 6.
Fig. 6.
Comparison of recall from NA12878 reads aligned to an altered reference genome. For each tool and different thresholds, the number of SV calls with score above the threshold (log-scale) is plotted against the recall. The upper and lower panels show performance on the full dataset and a randomly sampled 6× coverage subset of the data, respectively. In all six panels, SVIM outperformed all the other tools and reached substantially higher recall for similar numbers of calls. The improvement was most prominent for insertions. Recall was calculated using a required reciprocal overlap of 50% between variant calls and the original implanted variants
Fig. 7.
Fig. 7.
Venn diagram of three SV callsets for NA12878: SVIM calls on a 53× coverage PacBio dataset, SVIM calls on a 26× coverage Nanopore dataset and high-confidence calls from Parikh et al. (2016). Callsets were produced by merging SVIM calls with a score 40 for deletions, interspersed duplications and novel element insertions. Subsequently, the diagram was generated using pybedtools (Dale et al., 2011) and matplotlib_venn

Similar articles

Cited by

References

    1. 1000 Genomes Project Consortium. (2015) A global reference for human genetic variation. Nature, 526, 68–74. - PMC - PubMed
    1. Alkan C. et al. (2011) Genome structural variation discovery and genotyping. Nat. Rev. Genet., 12, 363–376. - PMC - PubMed
    1. Bartenhagen C., Dugas M. (2013) Rsvsim: an R/Bioconductor package for the simulation of structural variations. Bioinformatics, 29, 1679–1681. - PubMed
    1. Bron C., Kerbosch J. (1973) Algorithm 457: finding all cliques of an undirected graph. Commun. ACM, 16, 575–577.
    1. Carvalho C.M., Lupski J.R. (2016) Mechanisms underlying structural variant formation in genomic disorders. Nat. Rev. Genet., 17, 224–238. - PMC - PubMed

Publication types