Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Sep;585(7823):79-84.
doi: 10.1038/s41586-020-2547-7. Epub 2020 Jul 14.

Telomere-to-telomere assembly of a complete human X chromosome

Affiliations

Telomere-to-telomere assembly of a complete human X chromosome

Karen H Miga et al. Nature. 2020 Sep.

Abstract

After two decades of improvements, the current human reference genome (GRCh38) is the most accurate and complete vertebrate genome ever produced. However, no single chromosome has been finished end to end, and hundreds of unresolved gaps persist1,2. Here we present a human genome assembly that surpasses the continuity of GRCh382, along with a gapless, telomere-to-telomere assembly of a human chromosome. This was enabled by high-coverage, ultra-long-read nanopore sequencing of the complete hydatidiform mole CHM13 genome, combined with complementary technologies for quality improvement and validation. Focusing our efforts on the human X chromosome3, we reconstructed the centromeric satellite DNA array (approximately 3.1 Mb) and closed the 29 remaining gaps in the current reference, including new sequences from the human pseudoautosomal regions and from cancer-testis ampliconic gene families (CT-X and GAGE). These sequences will be integrated into future human reference genome releases. In addition, the complete chromosome X, combined with the ultra-long nanopore data, allowed us to map methylation patterns across complex tandem repeats and satellite arrays. Our results demonstrate that finishing the entire human genome is now within reach, and the data presented here will facilitate ongoing efforts to complete the other human chromosomes.

PubMed Disclaimer

Conflict of interest statement

E.E.E. is on the scientific advisory board of DNAnexus. K.H.M., S.K. and W.T. have received travel funds to speak at symposia organized by Oxford Nanopore. W.T. has two patents licensed to Oxford Nanopore (US patent 8,748,091 and US patent 8,394,584). A.D.S., J.-M.B. and S.S. are employees of Arima Genomics. R.R. shares equity in NanoString Technologies and is the principal investigator on an NIH SBIR subcontract research agreement with TwinStrand Biosciences. All other authors have no competing interests to declare.

Figures

Fig. 1
Fig. 1. CHM13 whole-genome assembly and validation.
a, Gapless contigs are illustrated as blue and orange bars next to the chromosome ideograms (highlighting contig breaks). Several chromosomes are broken only in centromeric regions. Large gaps between contigs (for example, middle of chr1) indicate sites of large heterochromatic blocks (arrays of human satellite 2 and 3 in yellow) or ribosomal DNA arrays with no GRCh38 sequence. Centromeric satellite arrays that are expected to be similar in sequence between non-homologous chromosomes are indicated: chr1, chr5 and chr19 (green); chr4 and chr9 (light blue); chr5 and chr19 (pink); chr13 and chr21 (red); and chr14 and chr22 (purple). b, The X chromosome was selected for manual assembly, and was initially broken at three locations: the centromere (artificially collapsed in the assembly), a large segmental duplication (DMRTC1B, 120 kb), and a second segmental duplication with a paralogue on chromosome 2 (134 kb). Gaps in the GRCh38 reference (black) and known segmental duplications (red; paralogous to Y, pink) are annotated. Repeats larger than 100 kb are named with the expected size (kb) (blue, tandem repeats; red, segmental duplications). c, Misassembly of the GAGE locus identified by the optical map (top), and corrected version (bottom) showing the final assembly of 19 (9.5 kb) full-length repeat units and two partial repeats. d, Quality of the GAGE locus before and after polishing using unique (single-copy) markers to place long reads. Dots indicate coverage depth (number of mapped sequencing reads overlapping each base) of the primary (black) and secondary (red) alleles recovered from mapped PacBio HiFi reads (Supplementary Note 4). Because the CHM13 genome is effectively haploid, regions of low coverage or increased secondary allele frequency indicate low-quality regions or potential repeat collapses. Marker-assisted polishing markedly improved allele uniformity across the entire GAGE locus.
Fig. 2
Fig. 2. Validated structure of the 3.1-MB CHM13 X-centromere array.
a, Top, the array, with approximately 2-kb repeat units labelled by vertical bands (grey is the canonical unit; coloured are structural variants). A single LINE/L1Hs insertion in the array is marked by an arrowhead. Bottom, a predicted restriction map for enzyme BglI, with dashed lines indicating regions outside of the DXZ1 array. A minimum tiling path was reconstructed for illustration purposes and was not the mechanism for initial assembly (Extended Data Fig. 5b). b, Experimental PFGE Southern blotting for a BglI digest in duplicate (band sizing indicated by triangles; BglI, 2.87 Mb ± 0.16), that matches the in silico predicted band patterns (a) for the CHM13 array (experimentally repeated six times with similar results). c, Array size estimates were provided using ddPCR (performed in triplicate; mean ± s.d.) optimized against PFGE Southern blots (HAP1, n = 6; T6012, n = 4; LT690, n = 7; CHM13, n = 13). d, Catalogue of 33 DXZ1 structural variants identified relative to the 2,057-bp canonical repeat unit (grey), along with the number of instances observed, frequency in the array, number of alpha satellite monomers and size. INS, insertion (that is, the 8.1-kb inserted LINE/L1Hs). e, Coverage depth of mapped (grey) and uniquely anchored (black) nanopore reads to the DXZ1 array. Marker-assisted polishing (bottom) improves coverage uniformity versus the unpolished (top) assembly. Single-copy, unique markers are shown as vertical green bands, with a decreased but non-zero density across the array. f, Distributions show the spacing between adjacent unique markers on chromosome X and DXZ1. On average, unique markers are found every 66 bases on chromosome X, but only every 2.3 kb in DXZ1, with the longest gap between any two adjacent markers being 42 kb.
Fig. 3
Fig. 3. Chromosome-wide analysis of CpG methylation.
Methylation estimates were calculated by smoothing methylation frequency data with a window size of 500 nucleotides. Coverage depth and high quality methylation calls (|log-likelihood| > 2.5) for PAR1, DXZ1 and DXZ4 are shown as insets. Only reads with a confident unique anchor mapping and the presence of at least one high-quality methylation call were considered. a, Nanopore coverage and methylation calls for pseudoautosomal region 1 (PAR1) of chromosome X (1,563–2,600,000). Bottom Integrated Genomics Viewer (IGV) inset shows a region of hypomethylation within PAR1 (770,545–801,293) with unmethylated bases in blue and methylated bases in red. b, Methylation in the DXZ1 array, with bottom IGV inset showing an approximately 93-kb region of hypomethylation near the centromere of chromosome X (59,213,083–59,306,271). c, Vertical black dashed lines indicate the beginning and end coordinates of the DXZ4 array. Left IGV inset shows a methylated region of DXZ4 in chromosome X (113,870,751–113,901,499); right IGV inset shows a transition from a methylated to an unmethylated region of DXZ4 (114,015,971–114,077,699).
Extended Data Fig. 1
Extended Data Fig. 1. Spectral karyotyping analysis of CHM13 confirmed a normal 46,XX karyotype.
a, Chromosomes and karyotype of CHM13 cell line at passage 10. Mitotic metaphase spreads were prepared from cells treated with colcemid and processed as detailed in Methods. Spectral karyotyping analysis demonstrated normal. 46,XX karyotype. Representative karyotype is shown from one of ten spreads analysed, all ten reported had similar results. Scale bar, 10 μm. b, CHM13 G-banding karyotype. A total of 20 CHM13 metaphase spreads were independently characterized and all showed a similar normal 46, XX female karyotype, as shown.
Extended Data Fig. 2
Extended Data Fig. 2. Inferred ancestry of CHM13.
a, b, Proportion of ancestry explained by each cluster as estimated by ADMIXTURE using K = 6 (a) or K = 9 (b) for 10 randomly sampled individuals from each population and CHM13. Analysis based on 1,964 unrelated individuals from the 1KG and SGDP. CHM13 is highlighted in red font along with a black bounding rectangle.
Extended Data Fig. 3
Extended Data Fig. 3. Results of using CHM13 as a reference when describing structural variation.
Assemblytics large insertion and deletion calls for four long-read assemblies with respect to CHM13 (in dark red or red) and GRCh38 (in black or grey). Using CHM13 as a reference yields balanced counts of insertions and deletions, whereas an excess of insertion calls is observed when using GRCh38, suggesting a probable deletion bias in GRCh38. SVs, structural variants.
Extended Data Fig. 4
Extended Data Fig. 4. Telomere length in the reads and the assembly.
The assembly telomere sizes are consistent with the larger sizes observed in the reads. The shorter peak in telomere length within the reads is probably an artefact of premature read end not the true telomere end. ONT, Oxford Nanopore Technologies; PB, Pacific Biosciences.
Extended Data Fig. 5
Extended Data Fig. 5. Evaluation of the structure of the X-centromeric satellite array (DXZ1) assembly.
a, The satellite array on the X chromosome (DXZ1) is defined at the sequence level as a multi-megabase size array of alpha satellite DNA. The canonical repeat of the DXZ1 array is defined by 12 divergent monomers that are ordered to form a larger approximately 2-kb repeating unit, known as a ‘higher-order repeat’ (HOR) (shown in grey, with HOR in black and circles representing each of the twelve approximately 171-bp monomers). The HORs are tandemly arranged into a large, multi-megabase sized satellite array (with previous published PFGE-Southern estimates suggesting a mean of 3 Mb) with a limited number of rearrangements in the HOR repeat structure (as indicated in yellow for a deletion to a 5-mer variant) and nucleotide differences between repeat copies. Our assembly strategy initially identified and annotated all uninterrupted head-to-tail tandem arrays of ‘canonical’ repeats and sites of structural variants in each nanopore read in our DXZ1 library (Methods). The spacing of canonical repeats to flanking structural variants informed the precise alignment between reads. Contigs were generated by taking the consensus of these uniquely placed ultra-long reads. b, The T2T-X CHM13 array was originally segmented into seven structural-variant-determined contigs. Ordering and overlap between the contigs was made using shared positions of Duplex-seq DXZ1 kmers and low-coverage (that is, 1–2 reads) support of ultra-long data that confidently spanned contig ordering. Three regions (marked with an asterisk) were only determined by single-nucleotide-variant overlap. We improved the prediction of these overlaps in implementing an orthogonal method, centroFlye, which studies single variant positions in the DXZ1 nanopore reads to guide the final positioning of the overlap between the contigs (and confirm the existing overlap in the region closest to p-arm). c, Comparisons with DXZ1 higher-order repeat variant frequency in the nanopore ultra-long-read data HiFi long-read PacBio data were highly concordant. DXZ1 repeat unit variants were predicted in the HiFi dataset using Alpha-CENTAURI. The DXZ1 repeat units, shown as arrows, are composed of 12 smaller approximately 171-bp repeats (indicated as small circles within the arrow). In total, we identified 7,316 DXZ1-containing HiFi reads. We characterized a database of 38,184 (98.2%) full-length DXZ1 canonical 12-mer repeats and 691 HORs with variant repeat structure (1.8%). Changes from the canonical repeat unit are indicated with a dashed line and each structural variant marks a colour, and its positioning within the array assembly is indicated (ordered p-arm to qHiFi-arm) above. The majority of reads were determined to contain purely DXZ1-alpha satellite (7,305/7,316, or 99.85%). Of the remaining reads, ten reads provided evidence for a transition from DXZ1 into the single L1Hs insertion in our assembly. We identified only a single read that we could not assign to our assembly owing to a 902-bp homopolymer ([G]n), which may present a sequencing artefact. d, A minimum tiling path was reconstructed for illustration purposes (as shown in Fig. 2a) and was not the mechanism for initial assembly. e, DXZ1 read overlap assembly using structural variant overlap and positioning. Read IDs and length are provided from Xp to Xq: (1) ab9c12a7-08db-4524-8332-373129eaa4fb, 442,119 bp. (2) 063fca09-81fc-4c2d-81ad-16fb2bfee76f, 364,710 bp. (3) 3d0fa869-028f-45be-be41-b2487897bb25, 380,361 bp. (4) a5cf4e19-8eff-4035-8238-ae81963b854f, 362,052 bp. (5) c6f29ca1-d84d-4881-9042-dfb37bc9f111, 482,907 bp. (6) 1ccd919f-5726-4d79-8cfe-fe2b344070a1, 275,718 bp. (7) e39308c6-0c73-45d5-9b8d-7f764af858be, 351,045 bp. (8) 86ac29ba-5a93-4c08-aa18-c07829a5b696, 393,007 bp. (9) 64d464d1-f317-4dff-a259-de6097a5cd4c, 221,510 bp. (10) 08e000a1-69dd-40fb-9fd1-942f159ec6b7, 262,585 bp. (11) 1ef64f71-9477-4a5b-bf7e-a356785cc656, 421,096 bp. (12) a1e01c13-7ca1-4dc5-85b1-6b69ec2124f9, 371,129 bp.
Extended Data Fig. 6
Extended Data Fig. 6. DXZ1 array evaluation by PFGE Southern blotting.
Alpha satellite array sizes were estimated by PFGE and Southern blotting using established methods,. In silico digest of the approximately 3.1-Mb DXZ1 array is predicted to produce three bands with a complete BglI digest: about 659 kb, about 2,153 kb and about 294 kb, which are concordant with the replicate PFGE Southern experiments shown for BglI (about 2.1, about 0.7 and about 0.3 Mb). In silico digest with BstEII provides evidence for six bands, of which three are less than approximately 200 kb and below the range of detection (as marked with grey band). The three remaining bands are once again concordant with observed PFGE-Southern replicates for BstEII (about 1.8, about 0.7 and about 0.3 Mb). HAP1 and DLD1 are included as internal controls. This experiment was repeated seven times with similar results.
Extended Data Fig. 7
Extended Data Fig. 7. Initial polishing decreased the assembly quality within the largest repeats.
a, b, The initial Canu assembly of the GAGE locus (a) was further corrupted owing to standard long-read polishing (arrow, nanopolish) (b). Black dots are coverage of the primary allele and red dots are coverage of the secondary allele (PacBio CLR data). The CHM13 genome is effectively haploid so one allele is expected. Regions of low coverage or increased secondary allele frequency indicate low-quality regions or potential repeat collapses. Owing to mismapping of reads during the polishing process, allele coverage becomes less uniform. A modified polishing process, using the unique k-mer strategy, corrects this effect. cf, The left-side plots are assemblies before polishing. The right-side plots show the same regions after unique k-mer-assisted polishing (racon, 2 rounds nanopolish, 2 rounds arrow, 2 rounds 10X). The regions are GAGE locus (48.6–49 Mb) (c), 70.8–71.3 Mb (d), 138.6–139.7 Mb (e) and cenX (57–61 Mb) (f). gj, Same loci as cf but with PacBio HiFi rather than CLR mapped.
Extended Data Fig. 8
Extended Data Fig. 8. Marker-assisted mapping using unique (single-copy) sequences that are present on the CHM13 X chromosome improve polishing.
a, 21-mer distribution from the 10X Genomics reads. 21-mers were collected with Meryl and the plot was generated with GenomeScope1.0 to visualize and confirm the haploid nature of CHM13 and genome size (len). k-mers with counts between 5 and 58 (inclusive) were used as unique markers when polishing the X chromosome. b, Coverage histograms of PacBio CLR (black), HiFi (blue), and ultra-long (green) reads across the complete X chromosome. Reads were filtered using the same unique marker based filtering as for polishing. c, Mapped nanopore reads show uniform coverage across the complete X chromosome. Reads were filtered using the same unique marker based filtering as for polishing. Marker density is shown below the read alignments. d, Strand-seq validation of the chromosome X assembly. Strand-seq sequences only single template strands from each homologous chromosome. Sequencing reads originating from such single stranded DNA possess directionality, a feature that can be used to assess a long range contiguity of individual homologues. On the basis of the inheritance of single stranded DNA we distinguish three possible strand states: WW – both homologues inherited Watson template strand, CC – both homologues inherited Watson template strand and WC – one homologue inherited Watson and the other Crick template strand. By tracking changes in strand states along each chromosome we are able to pinpoint locations of recurrent strand state changes that are indicative of a genome misassembly. We have analysed in total 57 Strand-seq libraries and mapped 28 localized strand state changes. These strand state changes are randomly distributed along chromosome X assembly and therefore are indicative of a double-strand break that occurred during DNA replication instead of real genome misassembly. Such breaks are usually repaired by available sister chromatids and therefore often result in change in strand directionality. Black asterisks show small localized strand state changes. Such events are either caused by noisy reads inherent to Strand-seq library preparation or two double-strand-breaks that occurred very close to each other. e, Because it is unlikely for a double-strand-break to occur at exactly the same position in multiple single cells, a real genome misassembly is visible in Strand-seq data as a recurrent change in strand state at the same position in a given contig or scaffold. None of these signatures was observed in the CHM13 chromosome X assembly.
Extended Data Fig. 9
Extended Data Fig. 9. Hi-C read mapping to the chromosome X assembly.
The whole X is shown on the left, and the right is zoomed on the DXZ4 locus. The heat map shows clear boundaries around DXZ4, indicating two large superdomains separated by DXZ4.
Extended Data Fig. 10
Extended Data Fig. 10. Methylation estimates across centromeric satellite array assembly on chromosome 8 (D8Z2) (chr8: 43,281,085–45,333,062).
Methylated values were calculated by smoothing frequency data with a window size of 500 nucleotides. Read coverage shown relies on our unique-anchor mapping and the presence of at least one high-quality methylation call on the read |log-likelihood| > 2.5. Similar to our previous methylation analysis on chromosome X centromeric satellite array (DXZ1), we observe an unmethylated region (about 75 kb) in the centromere of chromosome 8 (as shown: chr8: 44,830,000–44,900,000).

Comment in

  • A long read of the human genome.
    H Wrighton K. H Wrighton K. Nat Rev Genet. 2020 Oct;21(10):577. doi: 10.1038/s41576-020-0273-5. Nat Rev Genet. 2020. PMID: 32747762 No abstract available.

Similar articles

Cited by

References

    1. Jain M, et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 2018;36:338–345. - PMC - PubMed
    1. Schneider VA, et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 2017;27:849–864. - PMC - PubMed
    1. Ross MT, et al. The DNA sequence of the human X chromosome. Nature. 2005;434:325–337. - PMC - PubMed
    1. Mefford HC, Eichler EE. Duplication hotspots, rare genomic disorders, and common disease. Curr. Opin. Genet. Dev. 2009;19:196–204. - PMC - PubMed
    1. Langley SA, Miga KH, Karpen GH, Langley CH. Haplotypes spanning centromeric regions reveal persistence of large blocks of archaic DNA. eLife. 2019;8:e42989. - PMC - PubMed

Publication types