Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2020 Oct;21(10):597-614.
doi: 10.1038/s41576-020-0236-x. Epub 2020 Jun 5.

Long-read human genome sequencing and its applications

Affiliations
Review

Long-read human genome sequencing and its applications

Glennis A Logsdon et al. Nat Rev Genet. 2020 Oct.

Abstract

Over the past decade, long-read, single-molecule DNA sequencing technologies have emerged as powerful players in genomics. With the ability to generate reads tens to thousands of kilobases in length with an accuracy approaching that of short-read sequencing technologies, these platforms have proven their ability to resolve some of the most challenging regions of the human genome, detect previously inaccessible structural variants and generate some of the first telomere-to-telomere assemblies of whole chromosomes. Long-read sequencing technologies will soon permit the routine assembly of diploid genomes, which will revolutionize genomics by revealing the full spectrum of human genetic variation, resolving some of the missing heritability and leading to the discovery of novel mechanisms of disease.

PubMed Disclaimer

Conflict of interest statement

COMPETING INTERESTS

E.E.E. is on the scientific advisory board (SAB) of DNAnexus, Inc.

Figures

Figure 1.
Figure 1.. Overview of short-read sequencing technologies.
a) In short-read sequencing by Illumina, DNA fragments (yellow and purple) are ligated to adapters (blue and aqua) that contain unique molecular identifiers as well as sequences complementary to the oligonucleotides that are attached to the surface of a flow cell. The modified DNA is loaded onto a flow cell, and the adapters from the modified DNA hybridize to the oligonucleotides that coat the surface of the flow cell. Once the fragments have attached, cluster generation begins, where thousands of copies of each fragment are generated through a process known as bridge amplification. In this process, the strand folds over, and the adapter on the end of the molecule hybridizes to another oligonucleotide in the flow cell. A polymerase incorporates nucleotides to build double-stranded bridges of the DNA molecules, which are subsequently denatured to leave single-stranded DNA fragments tethered to the flow cell. This process is repeated over and over, generating several million dense clusters of double-stranded DNA. After bridge amplification, the reverse DNA strands are cleaved and washed away, leaving only the forward strands. Then, sequencing by synthesis begins, in which fluorescently labeled deoxyribonucleotide triphosphates (dNTPs) are incorporated into the newly synthesized DNA strand at each cycle. After incorporation, a laser excites the fluorophore on the strand, which emits a characteristic fluorescence emission signal that corresponds to the base. b) In Hi-C sequencing, nuclear chromatin is crosslinked with formaldehyde, which covalently bonds protein-DNA complexes in close proximity to each other. Crosslinked chromatin is digested with a restriction enzyme or nuclease, and single-stranded DNA overhangs are filled in and repaired with biotin-linked nucleotides before religating the DNA. Chemical crosslinks are reversed, proteins degraded, and the purified DNA is nonspecifically sheared (for example, by sonication). Biotin-labeled DNA is pulled down with streptavidin-conjugated beads and paired-end sequenced to reveal the junctions between two DNA loci (light and dark blue). Because the contact frequency between pairs of loci strongly correlates with distance, the majority of sequenced junctions encompass two loci from the same chromosome. As a result, Hi-C data can be used to provide linkage information between pairs of loci tens of megabases apart on a single chromosome (as shown in the contact map).
Figure 2.
Figure 2.. Overview of long-read sequencing technologies.
a) In single-molecule, real-time (SMRT) sequencing by Pacific Biosciences (PacBio), DNA (yellow for forward strand, dark blue for reverse strand) is fragmented and ligated to hairpin adapters (light blue) to form a topologically circular molecule known as a SMRTbell. Once the SMRTbell is generated, it is bound by a DNA polymerase and loaded onto a SMRT Cell for sequencing. Each SMRT Cell can contain up to eight million zero-mode waveguides (ZMWs), which are chambers that hold picoliter volumes. A light penetrates the lower 20–30 nm of each well, reducing the detection volume of the well to only 20 zeptoliters (10−21 liters). As the DNA mixture floods the ZMWs, the SMRTbell template and polymerase become immobilized on the bottom of the chamber. Fluorescently labeled dNTPs are added to begin the sequencing reaction. As the polymerase begins to synthesize the new strand of DNA, a fluorescent dNTP is briefly held in the detection volume, and a light pulse from the bottom of the well excites the fluorophore. Unincorporated dNTPs are not typically excited by this light but, in rare cases, can become excited if they diffuse into the excitation volume, thereby contributing to noise and error in PacBio sequencing. The light emitted from the excited fluorophore is detected by a camera, which records the wavelength and relative position of the incorporated base in the nascent strand. The phosphate-linked fluorophore is then cleaved from the nucleotide as part of the natural incorporation of the base into the new strand of DNA and released into the buffer, preventing fluorescent interference during the subsequent light pulse. The DNA sequence is determined by the changing fluorescence emissions that are recorded within each ZMW, with a different color corresponding to each DNA base (for example, green, T; yellow, C; red, G; blue, A). b) In Oxford Nanopore Technologies (ONT) sequencing, arbitrarily long DNA (orange for forward strand, purple for reverse strand) are tagged with sequencing adapters (light blue) preloaded with a motor protein on one or both ends. The DNA is combined with tethering proteins and loaded onto the flow cell for sequencing. The flow cell contains thousands of protein nanopores embedded in a synthetic membrane, and the tethering proteins bring the DNA molecules toward these nanopores. Then, the sequencing adapter inserts into the opening of the nanopore, and the motor protein begins to unwind the double-stranded DNA. An electric current is applied, which, in concert with the motor protein, drives the negatively charged DNA through the pore at a rate of about 450 bases per second. As the DNA moves through the pore, it causes characteristic disruptions to the current, known as a ‘squiggle’ [G]. Changes in current within the pore correspond to a particular k-mer (i.e., a string of DNA bases of length k) which is used to identify the DNA sequence.
Figure 3.
Figure 3.. PacBio and ONT long-read data types.
a) The PacBio platform can generate continuous long reads (CLR) or high-fidelity (HiFi) reads. CLR data is generated by sequencing a SMRTbell template containing a >30 kb DNA insert (yellow for forward strand, dark blue for reverse strand). Because of the large insert size, the polymerase often only completes a single pass through one strand of the template. A base is incorrectly called in about 1 out of every 10 bases, resulting in an error rate of 8–15% in the CLR. HiFi reads are generated by circular consensus sequencing (CCS) of a SMRTbell template containing a 10–30 kb DNA insert. The smaller insert size allows the polymerase to make several passes around the SMRTbell template. A consensus sequence is produced from the subreads, resulting in an error rate of ≤1% in the HiFi read. b) The ONT platform can generate long or ultra-long reads. To generate long and ultra-long ONT reads, high-molecular-weight (HMW) DNA is first extracted from cells or tissue. This extraction is commonly performed using either a commercially available DNA extraction kit, such as Qiagen’s Puregene kit or Genomic-tip 500/G kit, or via traditional methods, such as a phenol-chloroform extraction followed by either an ethanol or isopropanol precipitation. Kit-extracted DNA most often generates long (10–100 kb) reads, whereas high-molecular-weight DNA extracted by phenol-chloroform generates ultra-long (>100 kb) reads. c) Read length distributions and base accuracies of PacBio and ONT long-read data types differ. Shown are plots of the read length and accuracy distributions for: PacBio HG002 CLR data generated on the Sequel II platform; PacBio CHM13 HiFi data generated on the Sequel II platform; ONT CHM13 long-read data generated on the PromethION; and ONT ultra-long reads generated on the MinION and GridION. Read accuracy was estimated by aligning raw reads from each data type to GRCh38 and counting alignment differences as errors in the reads. Links to the publicly available datasets, a description of the methods used, and the code required to reproduce the analysis are provided in a Supplementary Note. A similar analysis was also performed in which raw reads were aligned to the T2T CHM13 assembly, and differences in alignment between the reads and the highly curated ChrX were counted to estimate read accuracy. PacBio HiFi reads have a visibly higher read accuracy distribution when aligned to the CHM13 T2T assembly than GRCh38 because the high accuracy of the HiFi reads (>99%) is sufficient to detect differences between the two genome assemblies, which are interpreted as base errors. The other long-read data types are not accurate enough to detect differences between the two genome assemblies. Consequently, the accuracy distribution for these other data types are similar (Supplementary Figure 1a and Supplementary Note). d) Homopolymer accuracy differs between PacBio and ONT long-read data types. Shown is a plot of the homopolymer accuracy for the PacBio CLR, PacBio HiFi, ONT long, and ONT ultra-long datasets used in panel c. Homopolymer error was estimated by aligning raw reads from each data type to GRCh38 and comparing the observed homopolymer length in the reads to the homopolymer length. A similar analysis was performed where raw reads were aligned to the T2T CHM13 assembly, and homopolymer error was estimated based on the comparison between the observed homopolymer length in the reads and the true homopolymer length in the highly curated ChrX assembly. In both cases, homopolymers ≥5 bases were assessed for accuracy (Supplementary Figure 1b and Supplementary Note).
Figure 4.
Figure 4.. Long-read data improves genome assembly.
a) The plot shows the number of contigs and the contig N50 for 18 unphased human genome assemblies listed in Table 2. Genomes assembled from long-read data (PacBio or ONT) have fewer contigs and higher contig N50s compared to those assembled from short-read data (Illumina). Combining long-read data types (PacBio + ONT) produces a genome assembly with even fewer contigs and a higher contig N50, surpassing that of the reference genome (GRCh38, hg38) in contiguity. b) The schematic illustrates a genome assembly phasing approach known as Strand-seq. In this approach, the template strand [i.e. the Watson (W, orange) or Crick (C, teal) strand)] is sequenced via short-read sequencing to generate template-specific short reads. When the W and C template strands are inherited from either parent, these templates-specific reads can be assigned to either parental homologue based on the direction they map to a genome assembly. For example, here, we show Strand-seq reads aligned to chromosome 2 and binned in 200 kb genomic stretches (orange and teal bars). Strand-seq reads containing a haplotype-specific SNP are able to partition long reads into haplotype 1 (H1, empty circles) or haplotype 2 (H2, filled circles). Haplotype-partitioned long reads permit the detection of structural variation, such as the deletion in H1 shown here, and can be assembled to generate haplotigs that span the region, thereby generating a phased genome assembly. c) Chromosome ideograms are shown that compare the 2001 Human Genome Project assembly (hg1) and the 2019 T2T assembly (CHM13 rel3 assembly). hg1 had >145,000 gaps and nearly 150,000 contigs, whereas the rel3 assembly has <1000 gaps and <1000 contigs (see Table 2 for additional statistics). Contigs are represented by alternating black and gray blocks, absent sequences are represented by white blocks, and centromeres are represented by red blocks.
Figure 5.
Figure 5.. Long-read data provides insights into the biological relevance of structural variation and human evolution and diversity.
a) The NOTCH2NLA, B, and C genes are located within chromosome 1q21.1, a segmental duplication-rich region of the genome partially assembled by PacBio CLR sequencing of BAC clones. The region was originally incorrectly assembled in the human reference genome. Deletions and duplications mediated by the segmental duplication-rich region can cause thrombocytopenia-absent radius (TAR) syndrome as well as distal 1q21.1 deletion/duplication syndrome,. High-quality sequencing of the region allowed the breakpoints of these disease-causing rearrangements to be better defined and improved the annotation of human-specific NOTCH2NL duplicate genes. Subsequent sequencing of patients affected with neuronal intranuclear inclusion disease (NIID) and leukoencephalopathy using long-read PacBio CLR and ONT sequencing recently identified a GGC repeat expansion in Exon 1 of NOTCH2NLC in affected patients (exons are in red; untranslated regions (UTRs) are in gray). Expansion of the repeat is associated with the production of anti-sense transcripts whose role is uncertain but may interfere with the expression and regulation of the gene family. Figure adapted from Ref. . SDs, segmental duplications; SVs, structural variants. b) Heatmap of differentially expressed genes located near structural variants in chimpanzee and human. Differences in macaque, chimpanzee, and human brain expression are shown for genes where a human-specific structural variant maps within 50 kbp of a transcription start and end. Structural changes, such as a deletion of an enhancer region as shown here, can cause changes in gene expression fundamental to brain development.
Figure 6.
Figure 6.. Long-read platforms can be used to sequence RNA and detect nucleic acid modifications.
a) Long-read RNA sequencing can be used for full-length isoform discovery. A newly resolved sequence on chromosome 10 of the CHM13 genome (dark blue) revealed a previously undiscovered gene, GPRIN2B (light blue). Using PacBio Iso-Seq, full-length transcripts were identified that completely span GPRIN2B, validating the new gene model. Adapted from Ref. . b) The assembly of the entire X chromosome centromere reveals that the majority of the α-satellite repeat region is heavily methylated, except for a ~93 kb hypomethylated region, which was discovered via ONT long-read sequencing of native DNA molecules and subsequent analysis with the methylation detection tool, Nanopolish. Adapted from Ref. .

Similar articles

  • Genetic variation and the de novo assembly of human genomes.
    Chaisson MJ, Wilson RK, Eichler EE. Chaisson MJ, et al. Nat Rev Genet. 2015 Nov;16(11):627-40. doi: 10.1038/nrg3933. Epub 2015 Oct 7. Nat Rev Genet. 2015. PMID: 26442640 Free PMC article. Review.
  • Leveraging the power of long reads for targeted sequencing.
    Iyer SV, Goodwin S, McCombie WR. Iyer SV, et al. Genome Res. 2024 Nov 20;34(11):1701-1718. doi: 10.1101/gr.279168.124. Genome Res. 2024. PMID: 39567237 Free PMC article. Review.
  • Genomic Analysis in the Age of Human Genome Sequencing.
    Lappalainen T, Scott AJ, Brandt M, Hall IM. Lappalainen T, et al. Cell. 2019 Mar 21;177(1):70-84. doi: 10.1016/j.cell.2019.02.032. Cell. 2019. PMID: 30901550 Free PMC article. Review.
  • Semi-automated assembly of high-quality diploid human reference genomes.
    Jarvis ED, Formenti G, Rhie A, Guarracino A, Yang C, Wood J, Tracey A, Thibaud-Nissen F, Vollger MR, Porubsky D, Cheng H, Asri M, Logsdon GA, Carnevali P, Chaisson MJP, Chin CS, Cody S, Collins J, Ebert P, Escalona M, Fedrigo O, Fulton RS, Fulton LL, Garg S, Gerton JL, Ghurye J, Granat A, Green RE, Harvey W, Hasenfeld P, Hastie A, Haukness M, Jaeger EB, Jain M, Kirsche M, Kolmogorov M, Korbel JO, Koren S, Korlach J, Lee J, Li D, Lindsay T, Lucas J, Luo F, Marschall T, Mitchell MW, McDaniel J, Nie F, Olsen HE, Olson ND, Pesout T, Potapova T, Puiu D, Regier A, Ruan J, Salzberg SL, Sanders AD, Schatz MC, Schmitt A, Schneider VA, Selvaraj S, Shafin K, Shumate A, Stitziel NO, Stober C, Torrance J, Wagner J, Wang J, Wenger A, Xiao C, Zimin AV, Zhang G, Wang T, Li H, Garrison E, Haussler D, Hall I, Zook JM, Eichler EE, Phillippy AM, Paten B, Howe K, Miga KH; Human Pangenome Reference Consortium. Jarvis ED, et al. Nature. 2022 Nov;611(7936):519-531. doi: 10.1038/s41586-022-05325-5. Epub 2022 Oct 19. Nature. 2022. PMID: 36261518 Free PMC article.
  • ARKS: chromosome-scale scaffolding of human genome drafts with linked read kmers.
    Coombe L, Zhang J, Vandervalk BP, Chu J, Jackman SD, Birol I, Warren RL. Coombe L, et al. BMC Bioinformatics. 2018 Jun 20;19(1):234. doi: 10.1186/s12859-018-2243-x. BMC Bioinformatics. 2018. PMID: 29925315 Free PMC article.

Cited by

References

    1. van Dijk EL, Jaszczyszyn Y, Naquin D & Thermes C The third revolution in sequencing technology. Trends Genet. 34, 666–681 (2018). - PubMed
    1. Shendure J et al. DNA sequencing at 40: past, present and future. Nature 550, 345–353 (2017). - PubMed
    1. Sudmant PH et al. An integrated map of structural variation in 2,504 human genomes. Nature 526, 75–81 (2015). - PMC - PubMed
    1. Sudmant PH et al. Global diversity, population stratification, and selection of human copy-number variation. Science 349, aab3761 (2015). - PMC - PubMed
    1. Ng SB et al. Exome sequencing identifies MLL2 mutations as a cause of Kabuki syndrome. Nat Genet 42, 790–793 (2010). - PMC - PubMed

Publication types