Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Jan 24;176(3):663-675.e19.
doi: 10.1016/j.cell.2018.12.019. Epub 2019 Jan 17.

Characterizing the Major Structural Variant Alleles of the Human Genome

Affiliations

Characterizing the Major Structural Variant Alleles of the Human Genome

Peter A Audano et al. Cell. .

Abstract

In order to provide a comprehensive resource for human structural variants (SVs), we generated long-read sequence data and analyzed SVs for fifteen human genomes. We sequence resolved 99,604 insertions, deletions, and inversions including 2,238 (1.6 Mbp) that are shared among all discovery genomes with an additional 13,053 (6.9 Mbp) present in the majority, indicating minor alleles or errors in the reference. Genotyping in 440 additional genomes confirms the most common SVs in unique euchromatin are now sequence resolved. We report a ninefold SV bias toward the last 5 Mbp of human chromosomes with nearly 55% of all VNTRs (variable number of tandem repeats) mapping to this portion of the genome. We identify SVs affecting coding and noncoding regulatory loci improving annotation and interpretation of functional variation. These data provide the framework to construct a canonical human reference and a resource for developing advanced representations capable of capturing allelic diversity.

Keywords: gap closure; human reference genome; major allele; real-time (SMRT) sequencing; single-molecule; structural variation; whole-genome sequence and assembly.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.. SV Discovery in 15 Human Genomes
(A) Variants from each sample were merged using a nonredundant strategy starting with CHM1 and iteratively adding unique calls from additional samples. The growth rate of the nonredundant set declines as the number of samples increases. Variants shared among all samples are shown as red portions of each bar. (B) The number of variants in each discovery class is shown per sample. As expected, African samples (asterisks) contribute a higher proportion of singleton variants. (C) Discovery class frequency for each variant type: insertion (INS), deletion (DEL), and inversion (INV). Compared to deletions, a greater proportion of insertions are shared among all samples. (D) For SVs that were genotypable in all 440 population samples and in non-repetitive loci, a discovery frequency is shown with bars colored by genotyping support based on allele frequency (AF). Generally, AF supports the genotype frequency. No support: AF = 0, Single: One allele, Poly: 0.5 > AF > 0, Major: 1 > AF ≥ 0.5, Shared: AF = 1. (E) For each SV, the distance to the end of the chromosome arm was calculated and divided into 500 kbp bins. The number of calls within 5 Mbp of the chromosome end (dashed box) confirms a nonrandom distribution of SVs.
Figure 2.
Figure 2.. VNTR Distribution and Double-Strand Break Correlation
(A) An ideogram showing the distribution of VNTRs (green; below chromosome) and STRs (blue; above chromosome). STRs (n = 16,619) and VNTRs (n = 55,585) were defined as tandem repeats within the SV sequence with tandem motif lengths of ≤ 6 bp and ≥ 7 bp, respectively. The tick marks on the axes for each chromosome indicate a value of 20 per 500 kbp bin. (B) A cumulative distribution of VNTRs shows that the most rapid saturation pattern for VNTRs belongs to chromosome 17p with approximately 85% of all VNTRs found in the last 5 Mbp. Windows of 500 kbp sliding from telomere ends to the centromere were used to cumulatively count STRs and VNTRs. The x axis is truncated at 20 Mbp. (C) Abundance of STRs and VNTRs is positively correlated with the distribution of double-strand breaks with the strongest correlation occurring for larger VNTRs (R = 0.48) compared to STRs (R = 0.27).
Figure 3.
Figure 3.. GC Content Distribution
(A) The mean GC composition (dashed vertical lines colored by discovery class) is greater than the reference (black dashed vertical line), but the distribution is also skewed toward lower GC content. The null distribution over the reference was computed excluding the same regions used to filter SV calls. (B) Excluding repeats, the GC distribution follows the reference distribution more closely, but shared variants still exhibit a multi-modal distribution with peaks in GC-rich regions. Repeat content was annotated by RepeatMasker and Tandem Repeats Finder (TRF). In addition to the SV call filter, SD- and TRF-annotated loci in GRCh38 are excluded when calculating the reference distribution.
Figure 4.
Figure 4.. Missing Genic and Regulatory Sequence
(A) A shared 1.6 kbp insertion in the 5’ UTR of UBEQ2L1 is almost completely comprised of simple repeat units (CACA) or low-complexity, GC-rich sequences. The breakpoints lie precisely at the start position of the 5’ UTR, and the missing sequence is largely conserved among chimpanzee and orangutan haplotypes. (B) A 458 bp insertion is detected in 50% of the discovery samples in the large 5.63 kbp 3’ UTR of APOOL. The insertion is comprised of an AT-rich repeat array consisting of 30 bp units for a total of 24 tandem copies. Because of its AT-rich sequence composition, analysis with RNA-seq is inconclusive (“ind human” is a brain sample from a single anonymous individual). Comparison with nonhuman primates reveals that the repeat array is largely absent. (C) A 1.1 kbp shared insertion in the 3’ UTR of the ADARB1 corresponds to a large VNTR comprised primarily of GC-rich sequence. Each repeat unit is 42 bp with a variable number of copies present in CHM13, chimpanzee, and orangutan. We detect 31 tandem copies in CHM13 compared to only 7 in the GRCh38 reference assembly. (D) A 13.8 kbp inversion in intron 32 of DSCAM. The shared inversion is flanked by inverted, complete LINE-L1 repeat sequences. (E) A 480 bp shared insertion detected in the first exon of RRBP1 (transcript ENST00000246043.8) is associated with gaps in RefSeq and UCSC gene annotations (top). Mapping human IPS-derived PacBio Iso-Seq data to the GRCh38 reference assembly identifies discordant read alignments at the insertion site (Iso-Seq alignments, left). Analysis of the insertion and adjacent flanking sequence identifies a large VNTR (1,380 bp) comprised of 30 bp repeat units. In our discovery set, the number of copies varies between 15 (450 bp) and 16 (480 bp). Translation of the newly assembled haplotype sequence from CHM13 (15 copies, 450 bp) shows that the insertion maintains the open reading frame and adds an additional 150 amino acids (Iso-Seq alignments right). For each panel: regions of shared or major allele structural variation are annotated and compared between GRCh38, alternate human reference assemblies (CHM1/CHM13), and nonhuman primates. Multiple sequence alignments were generated using MAFFT or visualized using Miropeats against sequenced large-insert clones. Additional functional annotations are shown using short-read Illumina RNA-seq data, PolyA-seq, and PacBio long-read Iso-Seq data.
Figure 5.
Figure 5.. Correcting Regulatory Elements and the FOXO6 Reading Frame
(A) A high-GC 1.2 kbp insertion immediately upstream of KDM6B was discovered in all but one sample (HG04217, Telugu). This variant is proximal to an AGP switch-point in GRCh38, and it was genotypable in 16% of Illumina samples with an allele frequency of 1.0 suggesting that observed variation among humans and nonhuman primates may be a technical artifact. (B) A high-GC 1.5 kbp insertion proximal to the FGFR1OP promoter appears to be present in nonhuman primates but has become variable in humans with a discovery frequency of 0.66 and genotype allele frequency of 0.76. (C) A 200 bp shared insertion (80% GC) in the final exon of FOXO6 is surrounded by low-complexity, GC-rich (> 70%) repeat sequences. Translation of the complete open reading frame (ORF) demonstrates a 67 amino acid deletion in the reference (ENST00000641094.1). (D) Using the gnomAD database, we identified loss-of-function (LoF) variation in FOXO6 (red points) and show their coding positions (x axis) and their allele count (y axis) with a dashed line representing the SV insertion. The LoF variants with the highest allele counts (6, 10, and 38) were no longer annotated as LoF when translated in the corrected reading frame. Two frameshift variants at the breakpoint of the insertion are a 32 bp and a 200 bp insertion with an allele count of 3 and 2, respectively, and the inserted sequence for both is > 99% identical with our SV call.

Similar articles

Cited by

References

    1. 1000 Genomes Project Consortium, Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, Kang HM, Marth GT, McVean GA, Flicek P, et al. (2012). An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65. - PMC - PubMed
    1. Beck CR, Collier P, Macfarlane C, Malig M, Kidd JM, Eichler EE, Badge RM, and Moran JV (2010). LINE-1 retrotransposition activity in human genomes. Cell 141, 1159–1170. - PMC - PubMed
    1. Benson G (1999). Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27, 573–580. - PMC - PubMed
    1. Berlin K, Koren S, Chin C-S, Drake JP, Landolin JM, and Phillippy AM (2015). Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol 33, 623–630. - PubMed
    1. Brandt DYC, Aguiar VRC, Bitarello BD, Nunes K, Goudet J, and Meyer D (2015). Mapping Bias Overestimates Reference Allele Frequencies at the HLA Genes in the 1000 Genomes Project Phase I Data. G3 (Bethesda) 5, 931–941. - PMC - PubMed

Publication types