Complex genetic variation in nearly complete human genomes

Logsdon, Glennis A.; Ebert, Peter; Audano, Peter A.; Loftus, Mark; Porubsky, David; Ebler, Jana; Yilmaz, Feyza; Hallast, Pille; Prodanov, Timofey; Yoo, DongAhn; Paisie, Carolyn A.; Harvey, William T.; Zhao, Xuefang; Martino, Gianni V.; Henglin, Mir; Munson, Katherine M.; Rabbani, Keon; Chin, Chen-Shan; Gu, Bida; Ashraf, Hufsah; Scholz, Stephan; Austine-Orimoloye, Olanrewaju; Balachandran, Parithi; Bonder, Marc Jan; Cheng, Haoyu; Chong, Zechen; Crabtree, Jonathan; Gerstein, Mark; Guethlein, Lisbeth A.; Hasenfeld, Patrick; Hickey, Glenn; Hoekzema, Kendra; Hunt, Sarah E.; Jensen, Matthew; Jiang, Yunzhe; Koren, Sergey; Kwon, Youngjun; Li, Chong; Li, Heng; Li, Jiaqi; Norman, Paul J.; Oshima, Keisuke K.; Paten, Benedict; Phillippy, Adam M.; Pollock, Nicholas R.; Rausch, Tobias; Rautiainen, Mikko; Song, Yuwei; Söylev, Arda; Sulovari, Arvis; Surapaneni, Likhitha; Tsapalou, Vasiliki; Zhou, Weichen; Zhou, Ying; Zhu, Qihui; Zody, Michael C.; Mills, Ryan E.; Devine, Scott E.; Shi, Xinghua; Talkowski, Michael E.; Chaisson, Mark J. P.; Dilthey, Alexander T.; Konkel, Miriam K.; Korbel, Jan O.; Lee, Charles; Beck, Christine R.; Eichler, Evan E.; Marschall, Tobias

doi:10.1038/s41586-025-09140-6

Download PDF

Article
Open access
Published: 23 July 2025

Complex genetic variation in nearly complete human genomes

Nature (2025)Cite this article

8 Citations
417 Altmetric
Metrics details

Subjects

Abstract

Diverse sets of complete human genomes are required to construct a pangenome reference and to understand the extent of complex structural variation. Here we sequence 65 diverse human genomes and build 130 haplotype-resolved assemblies (median continuity of 130â€‰Mb), closing 92% of all previous assembly gaps^1,2 and reaching telomere-to-telomere status for 39% of the chromosomes. We highlight complete sequence continuity of complex loci, including the major histocompatibility complex (MHC), SMN1/SMN2, NBPF8 and AMY1/AMY2, and fully resolve 1,852 complex structural variants. In addition, we completely assemble and validate 1,246 human centromeres. We find up to 30-fold variation in Î±-satellite higher-order repeat array length and characterize the pattern of mobile element insertions into Î±-satellite higher-order repeat arrays. Although most centromeres predict a single site of kinetochore attachment, epigenetic analysis suggests the presence of two hypomethylated regions for 7% of centromeres. Combining our data with the draft pangenome reference¹ significantly enhances genotyping accuracy from short-read data, enabling whole-genome inference³ to a median quality value of 45. Using this approach, 26,115 structural variants per individual are detected, substantially increasing the number of structural variants now amenable to downstream disease association studies.

Semi-automated assembly of high-quality diploid human reference genomes

Article Open access 19 October 2022

A pangenome reference of 36 Chinese populations

Article Open access 14 June 2023

The structure, function and evolution of a complete human chromosome 8

Article Open access 07 April 2021

Main

Long-read sequencing (LRS) technologies were critical to the completion of the first human genome⁴. LRS technologies significantly increase the sensitivity to detect structural variants (SVs), defined as variants 50â€‰bp in length or longer, and coupling LRS data with Hi-C⁵, single-cell template strand sequencing (Strand-seq)⁶ or trio data⁷ provided the necessary short-range and long-range phasing data to assemble both haplotypes. The high sequence quality and contiguity of such diploid genome assemblies have made the first draft human pangenome reference possible¹.

Despite these advances, gaps remain, especially at genetically complex loci². For example, in our previous assembly of 32 human genomes as part of the Human Genome Structural Variation Consortium (HGSVC)⁸, we found that most centromeres and more than half of the large, highly identical segmental duplications (SDs) were incomplete, resulting in missing protein-coding genes². Closing these gaps in the first complete human genome⁴ required combining the complementary strengths of PacBio high-fidelity (HiFi) reads (approximately 18â€‰kb in length and high base-level accuracy) and ultra-long Oxford Nanopore Technologies (ONT) reads (more than 100â€‰kb in length but with lower base-level accuracy). Computational tools such as Verkko⁹ and hifiasm (ultra-long)¹⁰ have automated this process. Here we present new resources and results from the HGSVC (Supplementary Fig. 1), targeting a diverse set of 65 humans predominantly from the 1000 Genomes Project (1kGP) cohort¹¹ with the goal of producing a genetically diverse sampling of nearly gapless chromosomes, including the centromeres and complex SDs.

Production of 130 haplotype assemblies

Data production

We selected 65 human lymphoblastoid cell lines representing individuals spanning five continental groups and 28 population groups for sequencing (Fig. 1a and Supplementary Table 1). We generated approximately 47-fold coverage of PacBio HiFi and approximately 56-fold coverage of ONT (approximately 36-fold ultra-long) long reads on average per individual (Extended Data Fig. 1a,b and Supplementary Table 2; see Methods). In addition, we performed Strand-seq (Supplementary Table 2), Bionano Genomics optical mapping (Supplementary Table 3), Hi-C sequencing (Supplementary Tables 4 and 5), isoform sequencing (Iso-Seq; Supplementary Table 6) and RNA sequencing (RNA-seq; Supplementary Table 7).

Assembly

We generated haplotype-resolved assemblies from all 65 diploid individuals using Verkko⁹ (Fig. 1a and Supplementary Tables 1 and 2; see Methods). The phasing signal was produced with Graphasing¹², leveraging Strand-seq to globally phase assembly graphs at a quality on par with trio-based workflows¹² (Methods). This approach enabled us to cover all 26 populations from the 1kGP by including individuals that are not part of a family trio. The resulting set of 130 haploid assemblies is highly contiguous (median area under the Nx curve (auN) of 137â€‰Mb; Fig. 1b and Supplementary Table 8) and accurate at the base-pair level (median quality value between 54 and 57; Fig. 1c and Supplementary Table 9; see Methods). We estimated the assemblies to be 99% complete (median) for known single-copy genes (Extended Data Fig. 1c and Supplementary Table 10) and toÂ close 92% of previously reported gaps in PacBio HiFi-only assemblies² (Supplementary Figs. 2 and 3 and Supplementary Table 11; see Methods).

We integrated a range of quality control annotations for each assembly using established tools such as Flagger, NucFreq, Merqury and Inspector (Supplementary Tables 12 and 13 and Figs. 4 and 5) to compute robust error estimates for each assembled base (Supplementary Tables 14â€“17; see Methods). We estimated that 99.6% of the phased sequence (median) has been assembled correctly (Supplementary Table 18). For the three family trios in our dataset (SH032, Y117 and PR05 (ref. ¹¹)), we assessed the parental support for the respective haplotypes in the childâ€™s assembly via assembly-to-assembly alignments and found that a median of 99.9% of all sequence assembled in contigs of more than 100â€‰kb are supported by one parent assembly (Supplementary Table 19; see Methods). In total, Verkko assembled 602 chromosomes as a single gapless contig from telomere to telomere (T2T; median of 10 per genome) and an additional 559 as a single scaffold (median of 8 per genome), that is, in a connected sequence containing one or more N-gaps (Fig. 1d,e, Supplementary Table 20 and Supplementary Fig. 6; see Methods).

Certain regions, such as centromeres or the Yq12 region, remained challenging to assemble and evaluate. We therefore complemented our assembly efforts by running hifiasm (ultra-long)¹⁰ on the same dataset (Supplementary Tables 21â€“23 and Supplementary Figs. 7 and 8; see Methods), but restricted the use of the resulting assemblies to extending our analysis set for centromeres and the Yq12 region after manual curation of the relevant sequences.

Variant calling

From our phased assemblies, we identified 188,500 SVs, 6.3 million indels and 23.9 million single-nucleotide variants (SNVs) against the T2T-CHM13v.2.0 (T2T-CHM13) reference (Fig. 1f). Against GRCh38-NoALT (GRCh38), we identified 176,531 SVs, 6.2 million indels and 23.5 million SNVs (Supplementary Table 24; see Data availability). Callsets for both references were led by PAV (v.2.4.0.1)⁸ with orthogonal support from 10 other independent callers with sensitivity for SVs, indels and SNVs (Supplementary Table 25; see Methods). We found higher support for PAV calls across all callers (99.7%) than other methods (99.7% to 67.9%; Extended Data Fig. 1d and Supplementary Fig. 9), with one exception for SVIM-asm, when run using the alignment parameters for PAV (99.70% SVIM-asm versus 99.66% PAV; Supplementary Table 26). With our current assemblies and this approach, we increased the size of the SV callset by 59% and reduced false discovery by 55% on average compared with previous callsets⁸ (Supplementary Tables 27 and 28 and Supplementary Methods). With one additional individual, we estimated that our callset would increase by 842 SV insertions and deletions with a 1.86Ã— enrichment for an African versus a non-African individual (1,117 versus 599; Supplementary Methods).

Per assembled haplotype, we identified 7,772 SV insertions (12,903 per genome) and 7,745 SV deletions (12,505 per genome) on average in the T2T-CHM13 reference (Fig. 1g). As expected, GRCh38 SVs are unbalanced^8,13 with 11,275 SV insertions per haplotype (17,458 per genome) and 6,972 SV deletions per haplotype (10,868 per genome) on average (Extended Data Fig. 1e and Supplementary Tables 29 and 30), with excess insertions occurring in high-allele-frequency variants, which can be largely explained by reference errors¹⁴. As expected, a distinct peak for fixed SVs (100% allele frequency) is apparent for GRCh38 SV insertions composed of variants in GRCh38 with no representation in T2T-CHM13 (Extended Data Fig. 1f).

An improved genomic resource

Mobile element insertions

Mobile element insertions (MEIs)¹⁵ constitute 8.2% of all SVs (relative to T2T-CHM13). We identified 12,919 putative MEIs from the 130 haplotype assemblies (Supplementary Table 31 and Supplementary Fig. 10; see Methods; for the GRCh38 union callset, see Supplementary Table 32 and Supplementary Fig. 11). Comparison with an orthogonal MEI callset showed a high concordance of 92.1% (Supplementary Tables 33 and 34; see Methods). Of note, we found 559 full-length L1 insertions (L1HS and L1PA2), with 96.1% possessing at least one intact open reading frame (ORF) and 82.3% harbouring two intact ORFs. Therefore, the vast majority of full-length L1 MEIs appear to retain the potential to retrotranspose. Compared with our previous study⁸ (nâ€‰=â€‰9,453 MEIs; 7,738 for Alu, 1,775 for L1 and 540 for SINE-VNTR-AluÂ (SVA)), the total number of MEIs increased by 36.65% primarily due to an increase in individuals of African descent (Supplementary Fig. 10d). Finally, we screened the PAV deletion callset and identified 2,450 polymorphic MEIs present in T2T-CHM13 (Supplementary Tables 35 and 36 and Supplementary Fig. 12).

Inversions

Identifying inversions is challenging due to the frequent location of their boundaries in long, highly identical repeat sequences. We identified 276 T2T-CHM13-based and 298 GRCh38-based inversions in the main callset and performed quality control by re-genotyping these calls using ArbiGent on Strand-seq data¹⁶ (Supplementary Tables 37 and 38 and Supplementary Methods) as well as manual inspection (Supplementary Table 37, Supplementary Figs. 13 and 14 and Supplementary Methods). Of note, we found 21 novel inversions in the PAV callset, of which 18 were detected among 24 new individuals added in the current study. These include a large (1.8â€‰Mb) inversion at chromosome 5q35 that overlaps with the Sotos syndrome critical region¹⁷.

Segmental duplications

SDs are defined independently for each haplotype as segments occuring more than once with more than 1â€‰kb in length and more than 90% identity. Owing to their propensity to undergo non-allelic homologous recombination, they are enriched tenfold for copy number variation and are the source of some of the most complex forms of genic structural polymorphism in the human genome^18,19. Overall, we found an average of 168.1â€‰Mb (s.d. of 9.2â€‰Mb) of SDs per human genome and observed an improved representation of interchromosomal SDs (Supplementary Figs. 15 and 16) when compared with the Human Pangenome Reference Consortium (HPRC) release¹. Using T2T-CHM13 as a gauge of completeness (193.7â€‰Mb), we estimated that 25.6â€‰Mb of SDs still remain unresolved per haploid genome (Extended Data Fig. 2a). Most of these unresolved SDs (21.2â€‰Mb) correspond to the acrocentric short arms of chromosomes 13, 14, 15, 21 and 22 (refs. ^4,20). We found that 80â€“90% of SDs are accurately assembled depending on the genome (Supplementary Figs. 17 and 18; see Methods).

When analysing SDs outside of acrocentric regions and where the copy number was supported by fastCN (Supplementary Fig. 19; see Methods), we classified at least 92.8â€‰Mb of the SDs as shared among most humans (present in at least 90% of individuals) and 61.0â€‰Mb as variable across the human population (Extended Data Fig. 2b). In addition, we identified 33â€‰Mb of the SD sequence present in a single copy or not annotated as SDs in T2T-CHM13 (Extended Data Fig. 2c,d). The majority of these (23.6â€‰Mb, including 2.4â€‰Mb of X chromosome SDs) are novel when compared with a recent analysis of 170 human genomes²¹ and completely or partially overlap with 167 protein-coding genes (Supplementary Fig. 20). Of note, 31 loci (0.4â€‰Mb) are shared among most humans but not classified as duplicated in the T2T-CHM13 human genome, suggesting that this unique status in the reference represents the minor allele in the human population, a cell line artefact or, less likely, an error in the assembly. Examining genomes by continental group, both the absolute SD content²¹ (Supplementary Figs. 21 and 22) and the number of new SDs added per genome is highest for African individuals (3.97â€‰Mb per individual) when compared with genomes of non-African individuals (2.88â€‰Mb per individual).

Genomes with African ancestry have, on average, 468 additional paralogous genes (nâ€‰=â€‰21,595 total genes) when compared with genomes of non-African individuals (nâ€‰=â€‰21,127 total genes; Methods). We identified a total of 727 multi-copy genes that have SDs spanning at least 90% of the gene body, with a large proportion corresponding to shared (nâ€‰=â€‰335 or 46.1%) and variable (nâ€‰=â€‰292 or 40.2%) SDs (Supplementary Table 39). Comparing the copy numbers to the HPRC assemblies¹, we discovered a similar distribution of genes (Supplementary Fig. 23). Among copy number polymorphic genes, we identified 16 gene families in which the distribution significantly differs between the HPRC and our data (Supplementary Fig. 23; adjusted Pâ€‰<â€‰0.05, two-sided Welchâ€™s t-test); however, the contiguity for copy number variant genes was considerably greater in our assemblies versus HPRC; 5.88% of duplicated genes in our assemblies are within 200â€‰kb of a contig break or unknown base (â€˜Nâ€™) compared with 13.95% of duplicated genes in HPRC assemblies (Supplementary Fig. 24).

Y chromosome variation

The Y chromosome remains among the most challenging of human chromosomes to fully assemble due to its highly repetitive sequence composition²⁰ (Fig. 2a). Our resource provides highly contiguous Y assemblies for 30 male individuals. Seven of these (23%) assembled without breaks across the male-specific Y region (excluding the pseudoautosomal regions, six assembled as T2T scaffolds and one that has a break in the pseudoautosomal region 1; Supplementary Figs. 25 and 26). Of these seven, four are novel fully assembled human Y chromosomes representing E1b1a, R2a and R1b1a Y lineages prevalent in populations of African, Asian and European descent^22,23,24 (Supplementary Fig. 27).

**Fig. 2: An improved genomic resource for challenging loci.**

Our assemblies enable the investigation of the largest heterochromatic region in the human genome, Yq12, mostly composed of highly similar (but size variable) alternating arrays of DYZ1 (HSat3A6, approximately 3.5-kb unit size) and DYZ2 (HSat1B, approximately 2.4-kb unit size) repeats (Fig. 2a). The Yq12 regions across 16 individuals (9 novel and 7 previously published) range from 17.85 to 37.39â€‰Mb (mean of 27.25â€‰Mb, median of 25.62â€‰Mb), with high levels of variation in the number (34â€“86 arrays; mean of 60, median of 58) and length of DYZ1 (24.4â€‰kb to 3.59â€‰Mb; mean of 525.7â€‰kb, median of 455.0â€‰kb) and DYZ2 (11.2â€‰kb to 2.20â€‰Mb; mean of 358.0â€‰kb, median of 273.3â€‰kb) repeat arrays^23,24 (Supplementary Table 40 and Supplementary Fig. 28). Investigating the dynamics of Yq12 remains challenging²⁵; however, using the duplication and deletion patterns of four unique Alu insertions, we can examine this genomic region over time (Fig. 2a and Supplementary Fig. 28). For example, in NA19239, the presence ofÂ two unique AluYÂ retrotransposon insertions allows clear visualization of a tandem duplication in the region.

Functional effects of SVs

To identify SVs disrupting protein-coding genes under selective constraint²⁶, we intersected all 176,531 GRCh38-based SVs with coding exons from GENCODE v.45. We found 1,535 SVs, including 938 deletions, 80 inversions, 504 insertions and 13 MEIs, that disrupt 985 unique genes (Supplementary Table 41). A mean of 368 genes per genome have an SV breakpoint altering the coding sequence. On average, only 11.7 genes (3.2%) were disrupted by a singleton variant unique to that individual, whereas 96.8% of genes were disrupted by polymorphic SVs, and 27.8% were disrupted by major-allele SVs (more than 50% allele frequency). Of the 1,535 genes affected by SVs, only 37 were predicted to be intolerant to loss of function in humans (loss-of-function observed/expected upper bound fraction (LOEUF)â€‰<â€‰0.35)²⁷. Polymorphic SVs altered 16 constrained genes, suggesting that the SVs did not result in loss of function. Indeed, we found that tandem repeat unit variants in coding sequences of four constrained genes were in frame (MUC5B, ACAN, FMN2 and ARMCX4). Deletion of one or more 59-bp VNTR units overlapping the last 8â€‰bp of MUC5B exon 37 left coding sequences and splice sites intact.

To assess isoform differences and detect imprinted genes, we generated long-read Iso-Seq data for 12 of the 65 individuals (EBV-transformed lymphoblastoid B cell lines) and aligned these to donor-matched haplotype assemblies (Fig. 2b, Extended Data Fig. 3a and Supplementary Methods). Using our SV callset (Methods), we identified 136 structurally variable protein-coding gene sequences (Supplementary Table 42 and Supplementary Methods). Of these 136 genes, 58% (nâ€‰=â€‰79) contained a common SV (allele frequencyâ€‰>â€‰0.05; Extended Data Fig. 3b). One example, ZNF718, creates nine unique isoforms (Fig. 2c) due to a common (allele frequencyâ€‰=â€‰0.55) 6,142-bp polymorphic deletion that removes exons 2 and 3 from the canonical transcript as well as the 3â€² part of an exon annotated as an alternate first exon (Extended Data Fig. 3b). Across the 14 wild-type ZNF718 haplotypes, we found three known isoforms and four previously unreported isoforms (Methods). In contrast to other protein-coding genes with a single SV (Extended Data Fig. 3c), we found greater transcript diversity among the variant haplotypes of ZNF718 than wild-type haplotypes. We also searched for SVs affecting nearby gene expression (RNA-seq) and identified 122 unique SVs proximal (less than 50â€‰kb) to 98 differentially expressed genes across the 12 individuals, representing an enrichment compared with randomly permuted SVs (Extended Data Fig. 3d; empirical Pâ€‰=â€‰0.001; Supplementary Table 43 and Supplementary Fig. 29; see Methods). Genome-wide, SVs were depleted across protein-coding genes and regulatory regions in the genome, as expected²⁸ (Extended Data Fig. 3e,f and Supplementary Fig. 30). By intersecting these 122 SVs with Hi-C data from the same individuals, we found that 29 of the SVs (associated with 24 genes) correspond to contact density changes in chromatin conformation regions (Extended Data Fig. 3g, Supplementary Table 44 and Supplementary Methods). Finally, we identified 3,818 SVs in high linkage disequilibrium with single-nucleotide polymorphism (SNP) loci from genome-wide association studies (GWAS) of human disease (Extended Data Fig. 3h and Supplementary Table 45; see Methods).

Genotyping and integrated reference panel

Genome-wide genotyping with PanGenie

Pangenome references have enabled genome inference, a process leveraging haplotype structures to genotype all variation encoded within a pangenome in a new individual from short-read whole-genome sequencing data³. We therefore constructed a pangenome graph containing all 65 genomes assembled here as well as 42 HPRC genome assemblies¹ with Minigraph-Cactus and detected variants by identifying graph bubbles relative to T2T-CHM13 (Methods). We used PanGenie to genotype bubbles across all 3,202 individuals from the 1kGP cohort based on Illumina data²⁹ and decomposed the 30,490,169 bubbles into 28,343,728 SNPs, 10,421,787 indels and 547,663 SV alleles¹ (Supplementary Fig. 31; see Methods). Leave-one-out experiments confirmed high genotype concordance of up to approximately 94% for biallelic SVs (Supplementary Figs. 32â€“34), and filtering the genotypes^1,8 resulted in a set of reliably genotypable variants comprising 25,695,951 SNPs, 5,774,201 indels and 478,587 SV alleles (Supplementary Table 46, Supplementary Figs. 35 and 36 and Supplementary Methods). We note that this set of SV alleles is larger than our main PAV callset (188,500 SVs) because it includes the HPRC genome assemblies and at the same time retains all SV alleles at multi-allelic sites (Supplementary Fig. 37 and Supplementary Methods).

We compared our genotyped set to other SV sets for the 1kGP cohort, including the HPRC PanGenie genotypes that we produced previously¹, as well as the 1kGP short-read high-coverage SV callset (1kGP-HC)²⁹ (Supplementary Figs. 38 and 39). On average, we found 26,115 SVs per genome, whereas this number was 18,462 for the HPRC genotypes and 9,596 for the 1kGP-HC SV calls. We specifically observed increases for rare variants (allele frequencyâ€‰<â€‰1%; Fig. 2d). While the average number of rare SVs per genome was 87 for non-African individuals in the HPRC set and 169 in the 1kGP-HC set, we can now access on average 362 rare alleles. For African individuals, we detected 1,490 rare SVs per genome, whereas there were 382 previously for the HPRC and 477 for the 1kGP-HC set.

Personal genome reconstruction

Next, we asked to what extent our improved genotyping abilities allow us to reconstruct the full haplotypic sequences of genomes sequenced with short reads. To this end, we combined our filtered PanGenie genotypes with rare SNP and indel calls obtained from Illumina reads for all 3,202 1kGP individuals (Methods) and phased this combined set using SHAPEIT5 (Supplementary Fig. 31, step 3, and Supplementary Figs. 40 and 41; see Methods).

We produced consensus haplotype sequences for all 3,202 individuals (6,404 haplotypes) by implanting the phased variants into T2T-CHM13 (only chromosomes 1â€“22 and X chromosome) and compared with consensus haplotypes produced from the GRCh38-based phased 1kGP-HC panel²⁹. While the median k-mer-based quality value of the long-read assemblies was 53, we observed a median k-mer-based quality value of 45 for the consensus haplotypes computed from our short-read-based phased genotypes (Fig. 2e and Supplementary Fig. 42). To enable a fair comparison with the GRCh38-based 1kGP-HC consensus haplotypes, we additionally computed our k-mer-based quality value estimates restricted to regions shared between T2T-CHM13 and GRCh38 (â€˜CHM13-syntenicâ€™). For these regions, we observed a median quality value of 48, whereas the quality value for the 1kGP-HC set was lower (median of 43; Fig. 2e and Supplementary Fig. 42). In addition, we observed higher k-mer completeness values (median of 97.4%) than for the 1kGP-HC-phased set (median of 97.1%; Extended Data Fig. 4a and Supplementary Fig. 42). Because k-mer-based quality value estimates do not fully capture structural sequence correctness, we additionally used PAV to compute variant-calling-based quality value estimates for each 1-Mb genomic window (Methods). This expectedly resulted in lower quality value estimates (median quality value for 1kGP-HC of 26.7; median quality value for PanGenie of 34.2), but confirms the gain of PanGenie over standard short-read pipelines (Supplementary Figs. 43â€“45). Of note, PanGenie enables an accurate genome reconstruction of quality valueâ€‰>â€‰30 routinely (78% of all 1-Mb windows), whereas that is rarely achieved for the 1kGP-HC callset (24% of all 1-Mb windows).

Targeted genotyping of complex loci

Although PanGenie performed well in this genome-wide setting, its use of k-mer information could make it difficult to genotype complex, repeat-rich loci with few unique k-mers. We therefore used the targeted method Locityper³⁰ to genotype the 1kGP cohort across 347 polymorphic targets covering 18.2â€‰Mb and 494 protein-coding genes (Methods), including 268 challenging medically relevant genes³¹. For this challenging set of regions, the 1kGP-HC callset reaches a variant-based quality value of 30 for only 34.5% and a variant-based quality value of 40 for only 12.8% of predictions³⁰.

The performance of Locityper is constrained by the haplotypes available in the reference set. Therefore, we first evaluated haplotype availability by comparing sequences of the unrelated assembled haplotypes. Across all target loci, 51.5% of our assembled haplotypes were similar (variant-based quality valueâ€‰â‰¥â€‰40) to some other haplotype from the full reference panel described above, compared with only 39.6% of haplotypes when restricting to an HPRC-only reference panel¹ (Fig. 2f).

The increased haplotype availability translates into improved genotyping of polymorphic loci and we observed 80.0% haplotypes to be predicted with variant-based quality valueâ€‰â‰¥â€‰30 using a leave-one-out experiment compared with 74.6% haplotypes for the HPRC-only panel (Methods). These global improvements are mirrored by improvements of individual genes (Extended Data Fig. 4b), including HLA-DRB5, HLA-DPA1 and HLA-B (Extended Data Fig. 4c). Finally, we asked what performance could potentially be achieved for growing reference panels and therefore used the full reference panel, including samples to be genotyped. Here Locityper predicts haplotypes with average quality value of 45.8, suggesting that sequence resolution of more reference haplotypes will aid future re-genotyping of challenging medically relevant genes, with applications to disease cohorts.

Major histocompatibility complex

Given the disease relevance and complexity of the 5-Mb MHC region^32,33,34 (Fig. 3a), we annotated 27â€“33 human leukocyte antigen (HLA) genes and 140â€“146 non-HLA genes or pseudogenes along with the associated repeat content of the 130 complete or near-complete MHC haplotypes (Supplementary Table 47). While 99.2% (357 of 360) of the HLA alleles agree with classical typing results³⁵ (Supplementary Tables 48 and 49), we resolved a total of 826 incomplete HLA allele annotations in the IPD-IMGT/HLA reference database³⁶ (Supplementary Table 50), including 112 sequences from the HLA-DRB loci, important for vaccine response and autoimmune disease^37,38. We detected 170 SVs absent from reported reference haplotypes^39,40 (Supplementary Table 51), including a deletion of HLA-DPA2 (HG03807, haplotype 1).

**Fig. 3: Structurally variable regions of the MHC locus.**

The observed MHC class II haplotypes reflect the established DR group system (Fig. 3b and Supplementary Table 52) and comprise representatives of DR5, DR8 and DR9, which have not previously been analysed in detail^39,40. In this system, the functional DRB3, DRB4 and DRB5 genes differentially associate across the DR groups, with DR1 and DR8 groups uniquely lacking either of them. Repeat element analyses (Supplementary Figs. 46â€“48; see Methods) suggest that DR8 arose from an intrachromosomal deletion mediated by 150â€‰bp of sequence homology between HLA-DRB1 and HLA-DRB3 on the DR3/5/6 haplotype, as previously reported⁴¹ (Fig. 3c). DR1 is most likely derived by recombination between DR2 and DR4/7/9 (Fig. 3c and Supplementary Figs. 46 and 49). Finally, our catalogue of solitary HLA-DRB exon sequences⁴² includes refined copy number estimates (for example, three solitary HLA-DRB exon 1 sequences instead of one in the HLA-DRB9 region of DR1), as well as identification of a polymorphic, solitary exon 10â€‰kb 3â€² of HLA-DRB1 (Fig. 3b; see Methods).

Similarly, we characterized the RCCX (STK19 (R), C4 (C), CYP21 (C) and TNX (X)) multi-allelic cluster (Fig. 3d, Supplementary Table 53 and Supplementary Fig. 50), in which phasing and variant classification has been challenging due to extensive sequence homology⁴³. Tandem duplications (aka RCCX bi-modules) are the most abundant (74.6% or nâ€‰=â€‰97), with mono-modules and tri-modules comparable in frequency (13.1% (nâ€‰=â€‰17) and 12.3% (nâ€‰=â€‰16), respectively; Supplementary Fig. 50). Resolved haplotypes also facilitate the detection of interlocus gene conversion events critical for RCCX evolution⁴⁴, such as two haplotypes with a tri-modular RCCX with two functional CYP21A2 copies, one mono-modular and one bi-modular haplotype with no functional CYP21A2 genes; and one tri-modular haplotype with a unique configuration where C4B precedes C4A and carries two CYP21A2 copies, one of which being non-functional (Fig. 3d). We suggest that the latter haplotype was generated by introduction of a nonsense mutation and two gene conversion events, converting CYP21A1P into CYP21A2 and C4A into a C4B that now unusually encodes the Rodgers blood group epitope. We also identified seven novel C4 amino acid variants (Supplementary Figs. 51 and 52).

Next, we evaluated the performance of Locityper across 19 MHC protein-coding genes and 14 pseudogenes. Across all 33 loci, Locityper correctly predicted gene alleles in 81.0% cases when restricting to a limited HPRC-only reference panel (45 individuals)¹. Inclusion of our assemblies (nâ€‰=â€‰107 individuals or 214 phased haplotypes) increased accuracy to 86.3% (leave-one-out experiment) and 97.1% (full panel leveraging all 214 phased haplotypes; Extended Data Fig. 4c), underscoring the value of accurate phased assemblies for the interpretation of short-read data.

Finally, we tested whether the established HLA class II DR group nomenclature could be recapitulated using unbiased, sequence-based analysis. Applying a pangenomic multiscale approach, PGR-TK⁴⁵ (Fig. 3e), to a subset of our genomes (nâ€‰=â€‰55) as well as T2T-CHM13 (ref. ⁴), we identified 63 conserved blocks greater than 6â€‰kb. Multiscale hierarchical clustering of the haplotypes perfectly reconstituted the traditional DR group system in the region around HLA-DRB1 (Fig. 3e). However, we also observed additional diversified subgroups indicating the possibility for a more fine-grained future classification of HLA-DR haplotypes or utility in the context of GWAS, especially when coupled with the improved targeted genotyping ability (Extended Data Fig. 4c).

Complex structural polymorphisms

Long-read-assembled genomes significantly enhance the detection and characterization of complex structural variants (CSVs) defined here as a single event composed of simple SVs spanning more than one repair junction. Because CSV breakpoints are often located in repetitive sequences, including SDs and MEIs^46,47,48,49, we recently updated PAV⁸ to identify CSVs embedded in large complex repeats such as SDs (Methods). Using this method against the T2T-CHM13 reference genome, we found on average 72 CSVs per genome⁵⁰ (range of 51â€“91; Supplementary Table 54; see Data availability). Across all genomes, we identified 1,247 CSVs with 128 distinct complex reference signatures⁵⁰, consistent with known CSVs derived from diverse individuals⁵¹. We found that 27% of CSVs have locally duplicated sequences, and 38% have local inversions. Many of the complex structures that we identified are mediated by SDs, such as INVDUP-INV-DEL (174 CSVs and 92% SDs), DEL-INV-DEL (34 CSVs and 21% SDs) and INVDUP-INV-INVDUP (8 CSVs and 75% SDs) where DEL is a reference deletion, INV is an inverted sequence that is not duplicated and INVDUP is a duplicated inversion (one copy in each orientation)⁵⁰. As an example, we highlight two CSVs involving NOTCH2NL and NBPF, genes implicated in the expansion of the human brain during evolution⁸, as well as a core duplicon associated with genomic instability⁵². Although the full structures could not be resolved by previous optical mapping or sequencing experiments, we can distinguish three distinct haplotype structures, including a reference haplotype (13.7% allele frequency), a 930-kb CSV (DEL-INV-DEL) inverting NBPF8 and deleting NOTCH2NLR and NBPF26 (35.9% allele frequency; Fig. 4a), and a 513-kb CSV with a distal template switch replacing NBPF8 with NBPF9 (50.8% allele frequency; Supplementary Fig. 53).

**Fig. 4: Complex SVs in human populations.**

As a second example, the structurally complex region containing SMN1 and SMN2 gene copies is associated with spinalÂ muscular atrophy, and successful ASO-mediated gene therapies involve SMN2 (refs. ^53,54). The genes are embedded in a large SD region (approximately 1.5â€‰Mb) that has been almost impossible to fully sequence resolve despite the advances of the past two decades^1,2,8 (Supplementary Fig. 54). We successfully assembled, validated and characterized 101 haplotypes to fully resolve the structure and copy number of SMN1/2, SERF1A/B, NAIP and GTF2H2/C (Methods). We found that 48% (nâ€‰=â€‰48) of haplotypes carry exactly two copies of SMN1/2, SERF1A/B and GTF2H2/C, whereas NAIP is present mostly in a single copy. We highlight 11 human haplotypes showing increasing complexity (Fig. 4bâ€“d). We specifically distinguished functional SMN1 and SMN2 copies based on our assemblies (Supplementary Fig. 55) and compared them with the short-read-based genotyping methods Parascopy and SMNCopyNumberCaller (Methods). For individuals with two fully assembled haplotypes (nâ€‰=â€‰31), predicted SMN1/2 copy numbers matched perfectly among the three methods (Supplementary Fig. 56). Our analysis shows that 98 haplotypes carry the ancestral SMN1 copy but three do not and are potentially disease-risk loci that may have arisen as a result of interlocus gene conversion (Fig. 4e and Supplementary Fig. 57).

Finally, we analysed the complex amylase locus spanning 212.5â€‰kb on chromosome 1 (GRCh38; chr. 1: 103554220â€“103766732) and containing genes AMY2B, AMY2A, AMY1A, AMY1B and AMY1C⁵⁵ (Fig. 4f). From 65 sequence-resolved genomes, we identified 39 distinct amylase haplotypes, capturing approximately 83% of the haplotypes in the population (Supplementary Table 55 and Supplementary Figs. 58 and 59), 35 of which were supported by both Verkko and optical genome mapping de novo assemblies. The length of these amylase haplotypes ranges from 111â€‰kb (H1^a.1 and H1^a.2) to 582â€‰kb (H11.1; Fig. 4f), including those that are structurally identical to the GRCh38 (H3^r.1) and T2T-CHM13 (H7.3) assemblies. Among these, four are common: H1^a.1 (nâ€‰=â€‰14), H3^r.1 (nâ€‰=â€‰13), H3^r.2 (nâ€‰=â€‰19) and H3^r.4 (nâ€‰=â€‰22; constituting 57% of all genomes), whereas 23 are singletons. We identified nine haplotypes previously supported only by optical genome mapping data and fully sequence resolved the largest haplotype (H11.1; 11 AMY1 (8.8â€‰kb) copies)^55,56,57 (Fig. 4f).

Centromeres

Human centromeres are among the most mutable genomic regions and are composed of tandemly repeating Î±-satellite DNA organized into higher-order repeats (HORs) spanning up to several megabases on each chromosome⁵⁸. It has been estimated that approximately 22% of centromeres vary by over 1.5-fold in length, and approximately 30% of them vary in their structure⁵⁹. To understand the genetic and epigenetic centromeric variation in these 65 individuals, we first assessed contiguity and accuracy using two assembly algorithms (Methods). We identified 822 Verkko centromeres and 777 hifiasm centromeres that were completely and accurately assembled. Only 28.3% were correctly assembled by both assemblers, with Verkko and hifiasm uniquely resolving a similar subset (37.7% and 34.1%, respectively). We combined these two datasets into a non-redundant set of 1,246 completely and accurately assembled centromeres (approximately 52 centromeres per chromosome and approximately 19.5 centromeres per genome, on average; Extended Data Fig. 5a and Supplementary Tables 56 and 57).

We first measured the variation in the length of the centromeric Î±-satellite HOR array (or arrays) on each chromosome. Although active centromeric Î±-satellite HOR arrays are, on average, 2.3â€‰Mb in length, there is considerable variation, including outliers (Fig. 5a, Supplementary Table 57 and Supplementary Figs. 60 and 61). For example, the active Î±-satellite HOR arrays from chromosomes 3, 4, 10, 13â€“16, 21 and the Y chromosome are consistently smaller, whereas those on chromosomes 1, 11 and 18 are larger than average (Supplementary Fig. 61). Among the 1,246 centromeres, we identified 4,153 new Î±-satellite HOR variants and novel active Î±-satellite HOR array organizations (Fig. 5b and Supplementary Figs. 62 and 63). On chromosome 1, for example, we identified an insertion of monomeric Î±-satellite into the D1Z7 Î±-satellite HOR array, effectively splitting the Î±-satellite into two distinct HOR arrays (Fig. 5b). A similar bifurcation event also occurred on the centromeres of chromosomes 12 and 19, generating two Î±-satellite HOR arrays where there typically is only one (Fig. 5b,c). In addition, we found novel Î±-satellite HOR array organizations for chromosomes 6 and 10 that differ from the CHM1 and CHM13 arrays on those chromosomes⁵⁹ (Fig. 5b and Supplementary Fig. 62b,c). These array organizations, which are the most common in our dataset, are primarily composed of either 18-monomer Î±-satellite HORs (chromosome 6) or 6-monomer Î±-satellite HORs (chromosome 10).

**Fig. 5: Variation in the sequence, structure and methylation pattern among 1,246 human centromeres.**

To determine how variation in centromeric sequence and structure affects their epigenetic landscape, we assessed the CpG methylation pattern along each centromere using native ONT data. We found that all centromeres contain at least one region of hypomethylation (termed the â€˜centromere dip regionâ€™ (CDR))^58,60, which is thought to mark the site of the kinetochore. However, in many cases, such as on chromosomes 6, 15 and 19, there were at least two CDRs more than 80â€‰kb apart (Fig. 5b, Extended Data Fig. 5bâ€“d and Supplementary Fig. 64). This suggests the presence of a â€˜di-kinetochoreâ€™, which may form a dicentric chromosome on approximately 7% of chromosomes, but additional analyses that assess the location of the centromeric histone H3 variant, CENP-A, will need to be performed to confirm these putative kinetochore sites. We generated sequence identity heatmaps of each centromere and found that the CDR often resides within the most highly identical regions of the Î±-satellite HOR arrays (Fig. 5c and Extended Data Fig. 5d). Even when the Î±-satellite HOR array is split into two arrays, such as on chromosome 19, the CDR associates with the array containing some of the most highly identical Î±-satellite HORs (Extended Data Fig. 5d). This suggests that the kinetochore may track with actively homogenizing Î±-satellite HOR sequences in response to a co-evolution between centromeric DNA and proteins⁶¹.

MEI investigation in many of the Î±-satellite HOR arrays (Methods) revealed that approximately 30% contained at least one MEI. In total, we identified 89 unique polymorphic insertions with varying allele frequencies (Supplementary Table 58), with L1HS being the most prevalent (58%), followed by Alu elements (41%) and SVAs (1%). The D2Z1 Î±-satellite HOR array on chromosome 2 was highly enriched with MEIs (Fig. 5d), with at least one L1HS and/or Alu insertion in 80% of haplotypes (Supplementary Fig. 65). Although L1HS insertions or duplications were the most common, occurring on average three times per array, three unique Alu insertions (two AluYb8 and one AluYa5) were also present, albeit with low allele frequency. Nearly all insertions, as well as their duplications, were located outside of the CDRs and typically towards the periphery. However, one AluYb8 insertion (NA20509 (H1)) was located between two CDRs and appeared to â€˜breakâ€™ a single CDR into two, whereas a pair of L1HSs were found on either side of a CDR in two haplotypes (NA19331 (H1) and NA19650 (H1)), possibly acting as boundaries that restrict CDR and CENP-A chromatin movement, as previously suggested⁶².

Discussion

LRS and assembly have enabled both the full resolution of a human genome sequence⁴ and fundamentally deepened our understanding of human genetic diversity^1,8,13,63. The development of a human pangenome reference^1,64 requires ideally completely phased and assembled diverse genomes. Although hundreds of genomes are being assembled as part of international efforts⁶⁵, practically, few are yet truly T2T. Meanwhile, pangenome augmentation methods based on shallow long-read data have been used to capture variants with lower allele frequencies⁶⁶. Nevertheless, algorithms and technology have advanced significantly, and we have demonstrated that more than 99% of the human genome can be accurately phased and assembled by focusing on 65 diverse humans (130 haplotypes). We characterized regions previously excluded or collapsed^1,2, including centromeres, biomedically complex regions such as SMN1/SMN2, the MHC and thousands of more complex SV patterns.

Combining our assemblies with previous HPRC assemblies to create a reference set, we were able to reconstruct a genome from short reads to an average base error of about 0.00158% (quality value of 48). This process detects 26,115 SVs per genome on average from short-read sequence data and notably now recovers more rare SVs (allele frequencyâ€‰<â€‰1%) than direct variant discovery from short reads. This advance was made possible by improvements in assembly quality, the larger sample size, improved versions of the Minigraph-Cactus and PanGenie applications, and the switch to the more complete T2T-CHM13 reference genome. As the number of HPRC genomes increases to several hundreds and they reach T2T status⁶⁵, genotyping accuracy will probably improve further. This, in turn, will make disease-association studies from short reads considerably more powerful for complex variation.

Using our assembly method, we fully assembled 1,246 centromeres â€” 42% of all possible centromeres in these individuals. As expected, we observed considerable variation in the content and length of the Î±-satellite HOR array (up to 37-fold for chromosome 10) consistent with its higher mutation rate and more rapid evolutionary turnover^2,59. We have also documented recent Alu, L1 and SVA retrotransposition into the Î±-satellite HORs and showed that these may be used to tag HOR expansions on particular human haplotypes. Using the CDR^58,60 as a marker of kinetochore attachment, we have shown considerable variation in the location across human centromeres and remarkably that 7% of human chromosomes show evidence of two or more putative kinetochores (that is, di-kinetochores) in lymphoblastoid cell lines. The significance of both MEIs and di-kinetochore on chromosome segregation or missegregation will need to be experimentally assessed, and these phased genomes (and their corresponding cell lines) provide the foundation for such future work.

Finally, from a technical perspective, application of two independent assembly algorithms, hifiasm (ultra-long) and Verkko, nearly doubled the number of sequence-resolved centromeres. Although the two methods were strongly complementary for centromeres, Verkko was clearly superior for the Y chromosome (Supplementary Fig. 26c). As the performance of both Verkko and hifiasm has been shown to be very similar for large portions of the euchromatin¹⁰, there is benefit in applying both assembly algorithms to resolve the most structurally complex regions of the genome until a tool combining the strengths of both methods becomes available.

Methods

Sample selection

A total of 65 diverse humans were included in the current study. The majority of the individuals (63 of 65) originated from the 1kGP sample set¹¹, one (NA21487) from the International HapMap Project⁶⁷ and one (NA24385, also called HG002) commonly used for benchmarking by the Genome in a Bottle (GIAB) Consortium⁶⁸ was included in all analyses with publicly available data from other efforts (Supplementary Tables 1â€“4, 6 and 7). Individuals were selected to maximize genetic diversity and Y chromosome lineages (Supplementary Methods).