Characterizing the Major Structural Variant Alleles of the Human Genome

doi:10.1016/j.cell.2018.12.019

. 2019 Jan 24;176(3):663-675.e19.

doi: 10.1016/j.cell.2018.12.019. Epub 2019 Jan 17.

Characterizing the Major Structural Variant Alleles of the Human Genome

Peter A Audano¹, Arvis Sulovari¹, Tina A Graves-Lindsay², Stuart Cantsilieris¹, Melanie Sorensen¹, AnneMarie E Welch¹, Max L Dougherty¹, Bradley J Nelson¹, Ankeeta Shah³, Susan K Dutcher², Wesley C Warren², Vincent Magrini⁴, Sean D McGrath⁵, Yang I Li⁶, Richard K Wilson⁴, Evan E Eichler⁷

Affiliations

¹ Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA 98195, USA.
² McDonnell Genome Institute, Department of Genetics, Washington University School of Medicine, St. Louis, MO 63108, USA.
³ Committee on Genetics, Genomics, and Systems Biology, University of Chicago, Chicago, IL 60637, USA.
⁴ Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH 43205, USA; The Ohio State University College of Medicine, Columbus, OH 43210, USA.
⁵ Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH 43205, USA.
⁶ Section of Genetic Medicine, University of Chicago, Chicago, IL 60637, USA; Department of Human Genetics, University of Chicago, Chicago, IL 60637, USA.
⁷ Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA 98195, USA; Howard Hughes Medical Institute, University of Washington, Seattle, WA 98195, USA. Electronic address: [email protected].

PMID: 30661756
PMCID: PMC6438697
DOI: 10.1016/j.cell.2018.12.019

Characterizing the Major Structural Variant Alleles of the Human Genome

Peter A Audano et al. Cell. 2019.

. 2019 Jan 24;176(3):663-675.e19.

doi: 10.1016/j.cell.2018.12.019. Epub 2019 Jan 17.

Authors

Affiliations

¹ Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA 98195, USA.
² McDonnell Genome Institute, Department of Genetics, Washington University School of Medicine, St. Louis, MO 63108, USA.
³ Committee on Genetics, Genomics, and Systems Biology, University of Chicago, Chicago, IL 60637, USA.
⁴ Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH 43205, USA; The Ohio State University College of Medicine, Columbus, OH 43210, USA.
⁵ Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH 43205, USA.
⁶ Section of Genetic Medicine, University of Chicago, Chicago, IL 60637, USA; Department of Human Genetics, University of Chicago, Chicago, IL 60637, USA.
⁷ Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA 98195, USA; Howard Hughes Medical Institute, University of Washington, Seattle, WA 98195, USA. Electronic address: [email protected].

PMID: 30661756
PMCID: PMC6438697
DOI: 10.1016/j.cell.2018.12.019

Abstract

In order to provide a comprehensive resource for human structural variants (SVs), we generated long-read sequence data and analyzed SVs for fifteen human genomes. We sequence resolved 99,604 insertions, deletions, and inversions including 2,238 (1.6 Mbp) that are shared among all discovery genomes with an additional 13,053 (6.9 Mbp) present in the majority, indicating minor alleles or errors in the reference. Genotyping in 440 additional genomes confirms the most common SVs in unique euchromatin are now sequence resolved. We report a ninefold SV bias toward the last 5 Mbp of human chromosomes with nearly 55% of all VNTRs (variable number of tandem repeats) mapping to this portion of the genome. We identify SVs affecting coding and noncoding regulatory loci improving annotation and interpretation of functional variation. These data provide the framework to construct a canonical human reference and a resource for developing advanced representations capable of capturing allelic diversity.

Keywords: gap closure; human reference genome; major allele; real-time (SMRT) sequencing; single-molecule; structural variation; whole-genome sequence and assembly.

PubMed Disclaimer

Figures

**Figure 1.. SV Discovery in 15 Human Genomes**
(A) Variants from each sample were merged using a nonredundant strategy starting with CHM1 and iteratively adding unique calls from additional samples. The growth rate of the nonredundant set declines as the number of samples increases. Variants shared among all samples are shown as red portions of each bar. (B) The number of variants in each discovery class is shown per sample. As expected, African samples (asterisks) contribute a higher proportion of singleton variants. (C) Discovery class frequency for each variant type: insertion (INS), deletion (DEL), and inversion (INV). Compared to deletions, a greater proportion of insertions are shared among all samples. (D) For SVs that were genotypable in all 440 population samples and in non-repetitive loci, a discovery frequency is shown with bars colored by genotyping support based on allele frequency (AF). Generally, AF supports the genotype frequency. No support: AF = 0, Single: One allele, Poly: 0.5 > AF > 0, Major: 1 > AF ≥ 0.5, Shared: AF = 1. (E) For each SV, the distance to the end of the chromosome arm was calculated and divided into 500 kbp bins. The number of calls within 5 Mbp of the chromosome end (dashed box) confirms a nonrandom distribution of SVs.

**Figure 2.. VNTR Distribution and Double-Strand Break Correlation**
(A) An ideogram showing the distribution of VNTRs (green; below chromosome) and STRs (blue; above chromosome). STRs (n = 16,619) and VNTRs (n = 55,585) were defined as tandem repeats within the SV sequence with tandem motif lengths of ≤ 6 bp and ≥ 7 bp, respectively. The tick marks on the axes for each chromosome indicate a value of 20 per 500 kbp bin. (B) A cumulative distribution of VNTRs shows that the most rapid saturation pattern for VNTRs belongs to chromosome 17p with approximately 85% of all VNTRs found in the last 5 Mbp. Windows of 500 kbp sliding from telomere ends to the centromere were used to cumulatively count STRs and VNTRs. The x axis is truncated at 20 Mbp. (C) Abundance of STRs and VNTRs is positively correlated with the distribution of double-strand breaks with the strongest correlation occurring for larger VNTRs (R = 0.48) compared to STRs (R = 0.27).

**Figure 3.. GC Content Distribution**
(A) The mean GC composition (dashed vertical lines colored by discovery class) is greater than the reference (black dashed vertical line), but the distribution is also skewed toward lower GC content. The null distribution over the reference was computed excluding the same regions used to filter SV calls. (B) Excluding repeats, the GC distribution follows the reference distribution more closely, but shared variants still exhibit a multi-modal distribution with peaks in GC-rich regions. Repeat content was annotated by RepeatMasker and Tandem Repeats Finder (TRF). In addition to the SV call filter, SD- and TRF-annotated loci in GRCh38 are excluded when calculating the reference distribution.

**Figure 4.. Missing Genic and Regulatory Sequence**
(A) A shared 1.6 kbp insertion in the 5’ UTR of *UBEQ2L1* is almost completely comprised of simple repeat units (CACA) or low-complexity, GC-rich sequences. The breakpoints lie precisely at the start position of the 5’ UTR, and the missing sequence is largely conserved among chimpanzee and orangutan haplotypes. (B) A 458 bp insertion is detected in 50% of the discovery samples in the large 5.63 kbp 3’ UTR of *APOOL*. The insertion is comprised of an AT-rich repeat array consisting of 30 bp units for a total of 24 tandem copies. Because of its AT-rich sequence composition, analysis with RNA-seq is inconclusive (“ind human” is a brain sample from a single anonymous individual). Comparison with nonhuman primates reveals that the repeat array is largely absent. (C) A 1.1 kbp shared insertion in the 3’ UTR of the *ADARB1* corresponds to a large VNTR comprised primarily of GC-rich sequence. Each repeat unit is 42 bp with a variable number of copies present in CHM13, chimpanzee, and orangutan. We detect 31 tandem copies in CHM13 compared to only 7 in the GRCh38 reference assembly. (D) A 13.8 kbp inversion in intron 32 of *DSCAM*. The shared inversion is flanked by inverted, complete LINE-L1 repeat sequences. (E) A 480 bp shared insertion detected in the first exon of *RRBP1* (transcript ENST00000246043.8) is associated with gaps in RefSeq and UCSC gene annotations (top). Mapping human IPS-derived PacBio Iso-Seq data to the GRCh38 reference assembly identifies discordant read alignments at the insertion site (Iso-Seq alignments, left). Analysis of the insertion and adjacent flanking sequence identifies a large VNTR (1,380 bp) comprised of 30 bp repeat units. In our discovery set, the number of copies varies between 15 (450 bp) and 16 (480 bp). Translation of the newly assembled haplotype sequence from CHM13 (15 copies, 450 bp) shows that the insertion maintains the open reading frame and adds an additional 150 amino acids (Iso-Seq alignments right). For each panel: regions of shared or major allele structural variation are annotated and compared between GRCh38, alternate human reference assemblies (CHM1/CHM13), and nonhuman primates. Multiple sequence alignments were generated using MAFFT or visualized using Miropeats against sequenced large-insert clones. Additional functional annotations are shown using short-read Illumina RNA-seq data, PolyA-seq, and PacBio long-read Iso-Seq data.

**Figure 5.. Correcting Regulatory Elements and the *FOXO6* Reading Frame**
(A) A high-GC 1.2 kbp insertion immediately upstream of *KDM6B* was discovered in all but one sample (HG04217, Telugu). This variant is proximal to an AGP switch-point in GRCh38, and it was genotypable in 16% of Illumina samples with an allele frequency of 1.0 suggesting that observed variation among humans and nonhuman primates may be a technical artifact. (B) A high-GC 1.5 kbp insertion proximal to the *FGFR1OP* promoter appears to be present in nonhuman primates but has become variable in humans with a discovery frequency of 0.66 and genotype allele frequency of 0.76. (C) A 200 bp shared insertion (80% GC) in the final exon of *FOXO6* is surrounded by low-complexity, GC-rich (> 70%) repeat sequences. Translation of the complete open reading frame (ORF) demonstrates a 67 amino acid deletion in the reference (ENST00000641094.1). (D) Using the gnomAD database, we identified loss-of-function (LoF) variation in *FOXO6* (red points) and show their coding positions (x axis) and their allele count (y axis) with a dashed line representing the SV insertion. The LoF variants with the highest allele counts (6, 10, and 38) were no longer annotated as LoF when translated in the corrected reading frame. Two frameshift variants at the breakpoint of the insertion are a 32 bp and a 200 bp insertion with an allele count of 3 and 2, respectively, and the inserted sequence for both is > 99% identical with our SV call.

See this image and copyright information in PMC

Cited by

Quartet DNA reference materials and datasets for comprehensively evaluating germline variant calling performance.
Ren L, Duan X, Dong L, Zhang R, Yang J, Gao Y, Peng R, Hou W, Liu Y, Li J, Yu Y, Zhang N, Shang J, Liang F, Wang D, Chen H, Sun L, Hao L; Quartet Project Team; Scherer A, Nordlund J, Xiao W, Xu J, Tong W, Hu X, Jia P, Ye K, Li J, Jin L, Hong H, Wang J, Fan S, Fang X, Zheng Y, Shi L. Ren L, et al. Genome Biol. 2023 Nov 27;24(1):270. doi: 10.1186/s13059-023-03109-2. Genome Biol. 2023. PMID: 38012772 Free PMC article.
Detection of breeding signatures in wheat using a linkage disequilibrium-corrected mapping approach.
Dadshani S, Mathew B, Ballvora A, Mason AS, Léon J. Dadshani S, et al. Sci Rep. 2021 Mar 9;11(1):5527. doi: 10.1038/s41598-021-85226-1. Sci Rep. 2021. PMID: 33750919 Free PMC article.
The role of clustered protocadherins in neurodevelopment and neuropsychiatric diseases.
Flaherty E, Maniatis T. Flaherty E, et al. Curr Opin Genet Dev. 2020 Dec;65:144-150. doi: 10.1016/j.gde.2020.05.041. Epub 2020 Jul 14. Curr Opin Genet Dev. 2020. PMID: 32679536 Free PMC article. Review.
Towards accurate and reliable resolution of structural variants for clinical diagnosis.
Liu Z, Roberts R, Mercer TR, Xu J, Sedlazeck FJ, Tong W. Liu Z, et al. Genome Biol. 2022 Mar 3;23(1):68. doi: 10.1186/s13059-022-02636-8. Genome Biol. 2022. PMID: 35241127 Free PMC article. Review.
Chimeric RNAs Discovered by RNA Sequencing and Their Roles in Cancer and Rare Genetic Diseases.
Sun Y, Li H. Sun Y, et al. Genes (Basel). 2022 Apr 22;13(5):741. doi: 10.3390/genes13050741. Genes (Basel). 2022. PMID: 35627126 Free PMC article. Review.

See all "Cited by" articles

References

1. 1000 Genomes Project Consortium, Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, Kang HM, Marth GT, McVean GA, Flicek P, et al. (2012). An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65. - PMC - PubMed
1. Beck CR, Collier P, Macfarlane C, Malig M, Kidd JM, Eichler EE, Badge RM, and Moran JV (2010). LINE-1 retrotransposition activity in human genomes. Cell 141, 1159–1170. - PMC - PubMed
1. Benson G (1999). Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27, 573–580. - PMC - PubMed
1. Berlin K, Koren S, Chin C-S, Drake JP, Landolin JM, and Phillippy AM (2015). Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol 33, 623–630. - PubMed
1. Brandt DYC, Aguiar VRC, Bitarello BD, Nunes K, Goudet J, and Meyer D (2015). Mapping Bias Overestimates Reference Allele Frequencies at the HLA Genes in the 1000 Genomes Project Phase I Data. G3 (Bethesda) 5, 931–941. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Research Materials
- Coriell Cell Repositories
Miscellaneous
- NCI CPTAC Assay Portal

[1] 1000 Genomes Project Consortium, Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, Kang HM, Marth GT, McVean GA, Flicek P, et al. (2012). An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65. - PMC - PubMed

[2] 1000 Genomes Project Consortium, Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, Kang HM, Marth GT, McVean GA, Flicek P, et al. (2012). An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65. - PMC - PubMed

[3] Beck CR, Collier P, Macfarlane C, Malig M, Kidd JM, Eichler EE, Badge RM, and Moran JV (2010). LINE-1 retrotransposition activity in human genomes. Cell 141, 1159–1170. - PMC - PubMed

[4] Beck CR, Collier P, Macfarlane C, Malig M, Kidd JM, Eichler EE, Badge RM, and Moran JV (2010). LINE-1 retrotransposition activity in human genomes. Cell 141, 1159–1170. - PMC - PubMed

[5] Benson G (1999). Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27, 573–580. - PMC - PubMed

[6] Benson G (1999). Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27, 573–580. - PMC - PubMed

[7] Berlin K, Koren S, Chin C-S, Drake JP, Landolin JM, and Phillippy AM (2015). Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol 33, 623–630. - PubMed

[8] Berlin K, Koren S, Chin C-S, Drake JP, Landolin JM, and Phillippy AM (2015). Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol 33, 623–630. - PubMed

[9] Brandt DYC, Aguiar VRC, Bitarello BD, Nunes K, Goudet J, and Meyer D (2015). Mapping Bias Overestimates Reference Allele Frequencies at the HLA Genes in the 1000 Genomes Project Phase I Data. G3 (Bethesda) 5, 931–941. - PMC - PubMed

[10] Brandt DYC, Aguiar VRC, Bitarello BD, Nunes K, Goudet J, and Meyer D (2015). Mapping Bias Overestimates Reference Allele Frequencies at the HLA Genes in the 1000 Genomes Project Phase I Data. G3 (Bethesda) 5, 931–941. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Characterizing the Major Structural Variant Alleles of the Human Genome

Affiliations

Characterizing the Major Structural Variant Alleles of the Human Genome

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials

Miscellaneous

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials

Miscellaneous