Summary
Over the last two decades, the human reference genome has undergone multiple updates as we complete a linear representation of our genome. Two versions of human references are currently used in the biomedical literature, GRCh37/hg19 and GRCh38. Conversions between these versions are critical for quality control, imputation, and association analysis. In the present study, we show that single-nucleotide variants (SNVs) in regions inverted between different builds of the reference genome are often mishandled bioinformatically. Depending on the array type, SNVs are found in approximately 2–5 Mb of the genome that are inverted between reference builds. Coordinate conversions of these variants are mishandled by both the TOPMed imputation server as well as routine in-house quality control pipelines, leading to underrecognized downstream analytical consequences. Specifically, we observe that undetected allelic conversion errors for palindromic (i.e., A/T or C/G) variants in these inverted regions would destabilize the local haplotype structure, leading to loss of imputation accuracy and power in association analyses. Though only a small proportion of the genome is affected, these regions include important disease susceptibility variants that would be affected. For example, the p value of a known locus associated with prostate cancer on chromosome 10 (chr10) would drop from 2.86 × 10−7 to 0.0011 in a case-control analysis of 20,286 Africans and African Americans (10,643 cases and 9,643 controls). We devise a straight-forward heuristic based on the popular tool, liftOver, that can easily detect and correct these variants in the inverted regions between genome builds to locally improve imputation accuracy.
Keywords: genome build, bioinformatics, reference genome, genetic associations, imputation
Genotype imputation infers genetic variation that are not experimentally observed. We identified a common informatic error that leads to poor imputation in regions that are inverted across different versions of the reference genome. We provided a heuristics to identify affected variants so that they can be imputed optimally.
Introduction
In the ensuing 13 years since the completion of the human genome project, the human reference genome assembly has undergone at least 18 major updates and numerous patches.1,2 The latest genome build from the Genome Reference Consortium (GRC), GRCh38, was released in 20131 and was most recently patched in 2019. Reference genome build and variant position are the most fundamental pieces of information that are reported in genetic association studies, providing researchers with the proper context when trying to interpret, replicate, or meta-analyze reported associations. However, at the time of this writing, over 8 years since the last major update of the human reference genome assembly, both GRCh38 and the previous reference build (GRCh37/hg19, released in 2009) continue to co-exist in literature and commonly used databases, such as the Genome Aggregation Database (gnomAD3) and the Polygenic Score Catalog.4 While there is a movement toward utilizing GRCh38 as the main reference genome assembly, many datasets remained in GRCh37/hg19 due to the wealth of information generated in this coordinate system that is utilized by numerous computational tools for downstream analysis.5 As such, there is a constant need to update the coordinate systems of these legacy datasets to ensure compatibility with data produced by the latest genotyping or sequencing platforms.
Re-mapping or re-alignment of genetic datasets to a different reference build is computationally expensive. Bioinformatic conversions between assembly builds, using tools such as liftOver from the UCSC Genome Browser,6 were developed to effortlessly harmonize different datasets and enable analyses essential for genetic discovery. In the case of liftOver, the conversion process utilizes a chain file that provides a mapping of contiguous positions from one genome build to another. Other similar tools also exist, such as CrossMap and Remap.7,8 Genotype imputation,9 an essential tool for genome-wide association studies (GWAS), requires the input data to be coded in the same genomic coordinates with identical alleles as that of the imputation reference panel, usually on the forward strand. Note that we use “forward strand” as synonymous with the “plus” strand, defined as the strand with its 5′ end at the tip of the short arm of a chromosome, which is different from the forward strand in the context of dbSNP submissions.10 To allow input genetic datasets encoded in either of the two prevailing genome builds to be imputed using the TOPMed reference panel in GRCh38, the imputation server11,12 internally converts GRCh37 into the GRCh38 coordinates before imputation can proceed.
In the current study, we observed an error in the conversion between reference genome builds by the TOPMed imputation server. This error is localized specifically to regions that are inverted between GRCh37/hg19 and GRCh38/hg38, causing input genotyped single-nucleotide variants (SNVs) to be dropped and then imputed at lower quality. We term these regions between-builds inverted sequence (BBIS) regions. However, our observed issue is beyond a potential bioinformatic error in the TOPMed imputation server. Furthermore, manual conversion of genomic coordinates to GRCh38 in the input files using tools like liftOver prior to imputation does not completely resolve this issue. Because the conversion involves inverted sequences between genome builds, the flipped alleles, particularly for palindromic SNVs (i.e., A/T transversion or C/G transversion variants), often escape detection. Imputation of such GWAS dataset would be hampered by destabilized local haplotype structure, leading to poorer quality and decreased power in association testing. To overcome this underrecognized analysis problem, we developed a heuristic built on top of liftOver. This heuristic converts the base pair (bp) immediately before and after the focal SNV to deduce whether the SNV is found in an inverted region and to suggest whether the reverse complement of the SNV alleles should be used to maintain encoding on the forward strand. We showed that our approach can detect and correct all impacted SNVs in the array platforms we tested. In an empirical analysis using prostate cancer as an example, our approach would identify important associations that would otherwise be missed due to poor imputation quality.
Material and methods
GWAS datasets and statistical analysis
We used four prostate cancer GWAS datasets to examine the effect of imputation accuracy affected by SNV sites that fall within the BBIS regions. These four datasets were ELLIPSE OncoArray (African ancestry, 4,231 cases/3,953 controls on Illumina Consortium-OncoArray, abbreviated as “ONCO-AAPC”), AAPC GWAS (4,822 cases/4,642 controls on Illumina Human1M-Duo BeadChip, abbreviated as “AAPC1M”), California and Uganda Prostate Cancer Study (1,590 cases/1,048 controls on Illumina H3Africa consortium array, abbreviated as “AAPC-H3”), and ELLIPSE OncoArray (Hispanic ethnicity, 1,192 cases/1,052 controls on Illumina Consortium-OncoArray, abbreviated as “ONCO-LAPC”). Details of the study description, data processing and quality control filtering, and models for association testing can be found in a previous publication.13 Briefly, association with prostate cancer risk was estimated in each dataset using logistic regression adjusting for study, age, and top 10 principal components. Per-allele ORs and standard errors from three AAPC association results were meta-analyzed using fixed-effect inverse-variance weighting. The analyses of these datasets for association with disease phenotype has been approved by the institutional review boards of University of Southern California.
Imputation via TOPMed imputation server
Imputation-ready GWAS datasets in either GRCh37 or GRCh38 genome build were submitted to the TOPMed imputation server (https://imputation.biodatacatalyst.nhlbi.nih.gov/) for quality control (QC) and imputation. The imputation pipeline used was michigan-imputationserver-1.5.7, the imputation software was minimac4-1.0.2, and the phasing software was eagle-2.4. Imputation results were then downloaded from the server to assess the imputation quality for common variants in the BBIS region. Imputation quality for each variant was assessed using the metric Rsq, the estimated value of the squared correlation between imputed genotypes and the true, unobserved genotypes,14 as recorded in the info files output by the imputation server.
Triple-liftOver script
Triple-liftOver is a PERL script that takes an input file in PLINK bim format and converts the genomic coordinate for three consecutive bases at each chromosomal position between genome builds using UCSC’s tool liftOver (https://genome.ucsc.edu/cgi-bin/hgLiftOver). It outputs the new positions in the destination build with a category column indicating whether a variant is found in a BBIS region (Figure 1). Variants that cannot be lifted over (in the unmapped output of liftOver) or are no longer on the same chromosome are excluded from the output file. For a focal SNV to qualify as an inverted site, the heuristic requires either the succeeding base in the old build to become the preceding base in the new build or the preceding base in the old build to become the succeeding base in the new build. The triple-liftOver program does not use any allele information, nor does it verify these alleles against the reference alleles in the source and destination genome reference builds. This procedure finds the inverted sites between source and target genome reference builds based on the relative position change between the focal SNV and its two neighboring bases.
Figure 1.
Schematic of the triple-liftOver heuristic
To illustrate how the triple-liftOver approach works, a SNV is assumed to reside at base i with reference allele T on the forward strand in the old build. Its preceding base (i − 1) is A and its succeeding base (i + 1) is C. When this segment of the sequence falls into a BBIS region, the prior forward strand would become the reverse strand in the new build. The corresponding SNV site would be at base (j) with reference allele A (complementary to T) in the new build. Its new preceding base G (j − 1) is the original succeeding base C (i + 1) on the opposite strand. Similarly, its new succeeding base T (j + 1) is the original preceding base A (i − 1) on the opposite strand. Our heuristic relies on the exact inversion of either adjacent bp to the focal position to identify SNVs found in the BBIS region. As a toy example, assume that the ATC sequence is at positions 101 to 103 in the old genome build with the SNV at position 102, and these three nucleotides become 200, 199, and 198 after being converted to the new genome build. In the new build, the bp originally preceding the SNV (A) now has a position 1-bp later than the SNV position (200 versus 199, thus becomes a succeeding bp), while the succeeding bp (C) in the old build now is 1-bp earlier than the SNV location (198 versus 199, thus becomes a preceding bp). Then, the algorithm deduces that the SNV is at an inverted site based on this behavior.
Merging SNVs inverted between genome builds into contiguous stretches
To better characterize the extent of these stretches of inverted sequence between genome builds, we merged consecutive inverted SNV sites identified by triple-liftOver on each chromosome that are within 250 kb of each other. The genomic coordinates for each stretch span the first to the last SNV. A stretch defined by a single SNV flanked by two SNVs that are not found to be in regions inverted between genome builds would be assigned a length of 1 bp. Note that the span of these stretches calculated in this manner is likely to be underestimated due to potential errors near the boundaries and would differ across array platforms due to differences in SNV content.
Trait-associated SNVs
The GRCh37 build files from GWAS catalog (http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/gwasCatalog.txt.gz) and the Global Biobank Engine (https://biobankengine.stanford.edu/downloads) were used to identify the inverted sites for trait-association SNVs. For GWAS catalog file, we downloaded 331 M variant-trait pairs and examined 162,001 unique single-nucleotide sites across chromosome 1 (chr1) to chr22, X and Y. For the Global Biobank Engine, there were genome-wide summary statistics for ∼3,800 traits, and we retained all SNVs with p < 1 × 10−8 for any trait. In total, there are 2,912 traits with at least 1 SNV associated with p < 1 × 10−8, for a total of ∼2.1 M SNV-trait pairs and 298,556 unique chromosomal locations.
Results
An approach to identify SNVs located in the BBIS regions
As of freeze 8 (r2) of the TOPMed imputation server (last accessed on August 31st, 2022), the server allows internal conversion of genome build so users can submit GWAS datasets for imputation without explicitly converting the coordinate from GRCh37 to GRCh38. We observed that this practice results in a number of directly genotyped SNVs in the input dataset being labeled “typed only” after the server converts the genome build, suggesting that these SNVs are not found in the TOPMed imputation reference panel despite being common in the population. This also created instances of imputed variants immediately adjacent to the “typed only” variant in the output with complementary alleles (Table 1).
Table 1.
Examples of unexpected typed-only SNVs observed from the imputation using server liftOver
Input SNV information (hg19/GRCh37) | Records in TOPMed imputation output (hg38/GRCh38) | |||||||
---|---|---|---|---|---|---|---|---|
Chr | Pos | A1 | A2 | SNPID | SNV entry | Type | REF | ALT |
1 | 145095477 | A | G | rs28549707 | chr1:120177450:G:A | typed only | – | – |
chr1:120177451:C:T | imputed | C | T | |||||
1 | 144474542 | C | T | rs10907360 | chr1:120959671:T:C | typed only | – | – |
chr1:120959672:A:G | imputed | A | G | |||||
1 | 145755813 | C | A | rs10157535 | chr1:145679248:A:C | typed only | – | – |
chr1:145679249:T:G | imputed | T | G |
These directly genotyped SNVs become “typed-only” SNVs; meanwhile, a SNV with complementary alleles is imputed at the immediate adjacent location to the “typed-only” variant.
If the genome build is updated using UCSC’s tool liftOver prior to submission to the imputation server, we found that liftOver maps these SNVs to a coordinate that is off by 1 bp compared with the mapped coordinate from the imputation server. Therefore, because of the 1-bp shift in the implemented build conversion in TOPMed server, these SNVs could not be matched to their counterparts in the reference panel and would need to be imputed back by the server.
Genome-wide examinations of this occurrence using several GWAS datasets indicated that this is not a general phenomenon across the genome. The vast majority (>99.8%) of the SNVs found on the GWAS arrays that we examined would be correctly mapped from GRCh37 to GRCh38 by the imputation server. However, we noticed when the conversion does fail, these SNVs are not randomly scattered along the genome but tend to cluster into regions (Figure S1). Since the imputed versions of these SNVs tend to have complementary alleles (e.g., rs28549707 from Table 1), we suspected that in these regions, orientations are reversed between GRCh37 and GRCh38. Specifically, the designated reference alleles for these SNVs in GRCh37 reside on the forward (plus) strands, but in GRCh38, they are on the reverse (minus) strands. Previous genome-wide alignments of GRCh38 to GRCh37 reference revealed 11 Mb (0.37% of total length) of inverted sequences. Though we could not locate a complete record of all inversions, the 10 longest inverted regions previously reported in Schneider et al.1 coincided with our mapping of SNVs erroneously converted by the imputation server. We thus termed these regions as BBIS regions.
In principle, one approach to rescue SNVs in BBIS region is to perform in-house conversion of input data from GRCh37 to GRCh38 prior to imputation. Since liftOver does not utilize the allele information, the converted dataset in GRCh38 can be submitted to TOPMed imputation server for QC checks, followed by inspection and correction of SNVs with strand flips, as needed. Although this approach corrects non-palindromic SNVs where potential strand flips are easy to detect, it does not systematically detect and correct palindromic SNVs.
Having recognized this specific issue, we devised a heuristic that we termed triple-liftOver that could identify SNVs that fall within BBIS regions. In essence, we convert the genome build using liftOver not just for the bp of the SNV of interest (i) but also for the bp before (i − 1) and after the SNV (i + 1). If the sequence is inverted around this SNV location (j) in the new build, the succeeding base will become the preceding base, and the preceding base will become the succeeding base (Figure 1). To account for the rare event that the 3-bp sequence is no longer contiguous in the new genome reference build, we lessen the constraint to flag a site as inverted if either of the neighboring bases show the orientation change.
Validation of triple-liftOver
We validated the triple-liftOver approach using three GWAS datasets for prostate cancer13 that were genotyped on three different Illumina platforms: Human1M-Duo (AAPC1M), Consortium-OncoArray (ONCO-AAPC), and H3Africa (AAPC-H3) (material and methods). Using all non-palindromic SNVs on each array platform, we aimed to identify the proportion of server-identified strand flips that would be detected using the triple-liftOver approach. Across the three Illumina arrays, our approach identified all non-palindromic SNVs (733 for Human1M-Duo, 410 for Consortium-OncoArray, and 1,325 for H3Africa) flagged by the imputation server as strand flips. Given the 100% sensitivity in identifying non-palindromic SNVs residing in the BBIS regions, we used our approach to further identify 33, 53, and 59 palindromic SNVs for the Human1M-Duo, Consortium-OncoArray, and H3Africa arrays, respectively, that would escape detection by the imputation server (Table S1). The detected palindromic and non-palindromic SNVs in BBIS regions cluster into 36, 25, and 51 stretches spanned by consecutive markers on each of Human1M-Duo, Consortium-OncoArray, and H3Africa, respectively, together covering approximately 2.7–5.4 Mb in length, consisting of 501–1,578 markers found on an array (Table S2; material and methods).
We further validated the detected BBIS-region SNVs by checking the reference allele in the human reference GRCh37 and GRCh38. If a SNV identified by triple-liftOver is not actually found in the BBIS region, we would expect either of the two alleles for the SNV in GRCh37 to match the reference allele in GRCh38. In practice, we observed that for all 1,927 unique SNV sites found in BBIS regions across the three arrays, either the reference allele or the alternative allele of the SNV in GRCh37 is complementary to the reference allele in GRCh38. We note that the Illumina array designs are biased toward non-palindromic SNVs to avoid strand confusions; hence, we detected relatively fewer palindromic SNVs compared with the non-palindromic SNVs that were localized in BBIS regions. We expect our approach to identify a greater number of palindromic SNVs that could impact downstream analyses when the data are genotyped on other array platforms without this bias.
Triple-liftOver takes a variant-centric approach: an input list of SNV coordinates is used to identify a subset exhibiting behaviors consistent with the SNV falling within the BBIS regions. To evaluate how many SNVs across the genome may fall into BBIS regions, we interrogated each biallelic SNV from IMPUTE2 1000 Genomes phase 3 legend file (https://mathgen.stats.ox.ac.uk/impute/1000GP_Phase3.html) with triple-liftOver. Out of 80,978,435 SNVs with unique chromosome locations in GRCh37 that can be lifted over to the same chromosome in GRCh38, our program identified 208,930 as inverted sites (0.258%) (Figure S1; Table S3 and S4). Of these 208,930 inverted sites, we found that for 99.99% of them (208,903 out of 208,930), one of the two alleles of the SNV in GRCh37 is complementary to the reference allele in GRCh38. For the remaining 27 SNVs (0.01%), neither the reference allele nor the alternative allele of the SNV is complementary to the reference allele in GRCh38, but sequence alignment search using BLAT15 of nearby sequences shows that the sequences are indeed inverted, suggesting an annotation error for the SNV allele or a sequencing error in one of the reference genome builds.
Impact of undetected SNVs in BBIS regions on genotype imputation
We next examined the impact on imputation quality due to these SNVs falling within BBIS regions. BBIS regions are sparse, and the proportion of SNVs falling in these regions are low (∼0.26%–0.37% of the human genome1); therefore, uncorrected SNVs that are inverted between genome builds do not result in an appreciable impact on the overall imputation quality (Figure S2). Instead, their impact on imputation, particularly for the common variants, is confined more locally. To better visualize the local impact on imputation accuracy, we grouped nearby (not necessarily contiguous) inverted SNVs that were detected by triple-liftOver into 18, 15, and 27 regions across the genome for each of the three arrays we examined, spanning ∼2.2–3.8 Mb and comprising 475–1,375 SNVs. We then extended each region by 500 kb to systematically examine the imputation quality locally within the BBIS region and the surrounding non-inverted regions.
We evaluated three possible approaches to prepare the input genetic data for imputation: (1) submitting the array data in GRCh37 and relying on the TOPMed imputation server to convert the coordinates into GRCh38, (2) manually converting the coordinates into GRCh38 using liftOver followed by manual correction of detectable strand issues for non-palindromic SNVs, or (3) using triple-liftOver to systematically identify SNVs falling in BBIS regions and flip the strand for all detected SNVs, including both palindromic and non-palindromic SNVs. We examined Rsq output by the imputation server to compare imputation accuracy across the three approaches.
Using the region around 46 Mb on chr10 from OncoArray (ONCO-AAPC) as an illustrative example, we describe in more detail the impact of three approaches to dealing with SNVs in BBIS regions when imputing a dataset in GRCh37. In approach 1, all genotyped SNVs (n = 69) falling within this chr10 BBIS region were dropped and then re-imputed with relatively poor Rsq (Figure 2A), as the TOPMed server genome build conversion in these regions would result in a 1-bp shift, as described earlier (Table 1). By correcting for strand flips at non-palindromic SNVs (n = 58), approach 2 improved imputation quality for directly genotyped SNVs as they became recognized by the imputation server. However, the imputation quality for other SNVs and the overall imputation quality in the region remained poor (Figure 2B). The first approach yielded a mean Rsq 0.74 compared with 0.67 in the second approach (p < 2.2 × 10−16 by Wilcoxon rank-sum test), which failed to correct palindromic SNVs (n = 12; Figure 2D). The lower Rsq overall in approach 2 suggests that the uncorrected palindromic SNVs disrupt local haplotype structures due to their genotypes being incompatible for these SNVs. In approach 3, after fixing the strand issue for these 12 SNVs as informed by triple-liftOver, the overall imputation quality is increased to mean Rsq of 0.93 across the 3,903 common variants within 1.1 Mb surrounding a BBIS region (Figure 2C).
Figure 2.
Three scenarios of imputation for the 46-Mb region (GRCh38) on chr10 in ONCO-AAPC
We compared three reasonable approaches to impute this 1.1-Mb region (chr10:45500727–46605935), where the MSMB gene (chr10:46033313–46046269) resides: (A) rely on the imputation server to convert input array data in GRCh37 to GRCh38 prior to imputation, (B) manually convert the GRCh37 coordinates to GRCh38 and fix all detectable strand flips (i.e., for all non-palindromic SNVs) prior to imputation, and (C) manually convert GRCh37 coordinates to GRCh38 and fix strands for all SNVs detected by triple-liftOver prior to imputation. For (A)–(C), focusing on the variants that passed server QC check, we merged inverted sites identified through triple-liftOver into regions if they are within 250 kb of each other and then expanded the region for another 500-kb upstream and downstream for a better overview of the region. We restricted the variants in each plot to be either the common variants (minor allele frequency [MAF] ≥ 0.01) in scenario (C) or an inverted site. In (D), we show a violin plot of Rsq distribution as a measure of imputation accuracy for each scenario. Note that due to the 1-bp shift caused by server liftOver, inverted SNV sites are considered imputed in scenario (A) but are considered genotyped in scenarios (B) and (C).
Improvement in imputation accuracy was generally observed for each of the regions we examined across the three array platforms, though the magnitude of the improvement depends on the number of palindromic and non-palindromic SNVs falling within an inverted region and how well the imputation algorithm can impute these SNVs without their actual genotypes. We also observed qualitatively similar patterns using the OncoArray dataset from Latinos (ONCO-LAPC), suggesting that the adverse effect caused by these inverted SNVs is not a phenomenon unique to the African and African American populations we examined (Figure S3–S6).
Impact of undetected SNVs in BBIS regions on downstream association testing
Despite the low prevalence of SNVs found in BBIS regions, their impact on phenotypic associations is appreciable. The BBIS region on chr10 includes the MSMB gene (chr10:46033313–46046269 on GRCh38), which contains a consistently replicated susceptibility signal, rs10993994, for prostate cancer13 (Figure 2). We tested a palindromic SNV within this locus, rs10763546, for its association with prostate cancer in 10,643 cases and 9,643 controls. This variant was directly genotyped on one of the three Illumina arrays (ONCO-AAPC) but had to be imputed on the other two array platforms. When submitted for imputation in hg19 (approach 1), rs10763546 was imputed poorly in all three array platforms (due to the 1-bp error in genome build conversion; Rsq ∼0.57–0.66 for rs10763546). As a result, we found only little evidence of association (odds ratio [OR] = 1.11, pmeta = 0.0011 after meta-analysis across ONCO-AAPC, AAPC1M, and AAPC-H3; Table S5), which would not be declared genome-wide significant in a standard GWAS. If we converted the genome build manually using liftOver prior to submission for imputation (approach 2), the palindromic SNV rs10763546 would not be detected as a strand flip, resulting in relatively poor imputation, even for a directly genotyped variant on the OncoArray (Rsq = 0.80). Association testing in ONCO-AAPC resulted in a null association for rs10763546 (OR = 1.04, p = 0.42; Table S5) and only a marginally improved meta-analysis association signal (OR = 1.10, pmeta = 0.00020; Table S5). Finally, when we applied triple-liftOver to identify and correct all SNVs requiring strand flips in this region (approach 3), the imputation accuracy for this SNV and the local region is improved (Rsq = 1 for rs10763546), resulting in a nearly genome-wide significant meta-analysis result of OR = 1.14 and p = 2.86 × 10−7. We repeated the same analysis for this SNV in ONCO-LAPC (1,192 cases and 1,052 controls). The version with fixing all strand issues (approach 3) yields an OR = 1.41 and p = 6.52 × 10−8 compared with OR = 1.19 and p = 0.045 from the server liftOver version (approach 1; Table S5). Taken together, our proof-of-principle analysis demonstrated that the ability to detect or further fine-map this associated locus would be hampered simply due to a change in reference genome build.
We also examined all of the trait-associated SNVs found in the GWAS catalog16 and the Global Biobank Engine,17 which is primarily based on UK Biobank data (material and methods). For the GWAS catalog, we examined 162,000 unique SNVs, among which 151 SNVs are found in the BBIS regions (∼0.09%, in a total of 277 variant-trait pairs; Table S6). In the Global Biobank Engine, we obtained 2.1 M SNV variant-trait pairs that would collapse into 300,000 unique SNVs, and 409 SNVs would be in the BBIS regions (∼0.14%, in a total of 3,385 variant-trait pairs; Table S7). These estimates are broadly consistent with the estimated proportion of SNVs falling into the BBIS regions (∼0.26%–0.37% of the human genome1). Functional annotation of SNVs in BBIS regions revealed 1,961 non-synonymous variants and 678 substitutions in 54 genes that were predicted to be deleterious by Sift and PolyPhen. Among non-coding SNVs, 187 variants had CADD scores greater than 20, corresponding to the top 1% most deleterious substitutions in the genome (Table S8; Figure S7). Taken together, we demonstrate that BBIS regions harbor variants of significance to multiple phenotypes and that ignoring allelic errors within these regions could lead to missed opportunities in identifying a genetic association.
Discussion
Our current investigation was motivated by an anomaly where directly genotyped common SNVs on the array appeared to be absent from the imputation reference panel using the TOPMed imputation server. Delving into this issue further revealed a systematic problem caused by genomic regions that are inverted between different builds of the human reference, which is largely overlooked in standard bioinformatic analysis. We termed these regions BBIS regions. We showed that standard QC and processing pipelines used to prepare genetic datasets for imputation without accounting for SNVs within the BBIS regions will lower the imputation quality locally. Since SNVs in BBIS regions represent a small proportion of the genome, these minor inconsistencies are often ignored and dropped from analysis. Ignoring these issues, however, could also cause functional, trait-associated variants to be undetected.
Awareness of this analytical issue and reliable detection of the affected SNVs allow pipelines to be modified to properly reflect the strand and alleles before imputation. Identifying inverted SNVs could also aid data harmonization in meta-analysis where individual results came from different genome build versions (e.g., phase3-imputed versus TOPMed-imputed results). Failing to detect allele incompatibility for non-palindromic SNVs could result in SNVs being meta-analyzed as multi-allelic variants and palindromic SNVs being meta-analyzed incorrectly.
We focused the evaluation of our tool, triple-liftOver, on the TOPMed imputation server because it is the current state-of-the-art imputation server. However, we expect our findings to be relevant for users of other imputation services18,19,20 or for those who perform custom imputations.21,22,23 In fact, given the routine use of liftOver for converting genomic coordinates, analytical intricacies due to regions inverted between genome builds have likely gone generally unrecognized. Our light-weight algorithm is built on top of liftOver and can be easily incorporated into bioinformatic pipelines already utilizing liftOver and its chain files. Several other approaches are possible, such as through comparing the differences between the source strand and the mapped strand information for each SNV after updating the genomic coordinates using remap8 (https://www.ncbi.nlm.nih.gov/genome/tools/remap).
There are a few issues that our analysis has yet to address. For one, even though the field has not completely converted to using GRCh38 as the reference, a new human genome reference T2T-CHM13 has been released.24 A detailed comparison of overlapping regions between GRCh38 and T2T-CHM13 is needed to reveal the extent of the BBIS regions between these two builds. But given a chain file between the two builds, our method should readily identify variants in the BBIS regions. Moreover, we restricted our demonstrations to SNVs, although insertions or deletions (indels) and other structural variants may also be affected. In principle, if the boundaries of the structural variants are called reliably, the same heuristic can be used to determine if the structural variants also need to be inverted onto the complementary strand, but this has not been thoroughly tested. Moreover, the changes in the human genome reference assembly across builds could be far more complex than what we have realized in some regions. Our variant-based triple-liftOver approach relies on liftOver to convert the coordinates of bp around the focal SNV site to infer whether the focal SNV is found in an inverted region. If the SNV resides in a region that has changed dramatically across genome builds, our approach may not yet be able to tackle those complicated scenarios. Finally, even though in our demonstrations we focused on the conversion between GRCh37 to GRCh38 in humans, our heuristic would be equally effective for other genome builds and in other species as long as there is a uniform way to convert coordinates across genome builds. Given the simplicity of this approach, there is little computational cost to apply triple-liftOver in standard QC pipelines.
Acknowledgment
This work was supported by research grants from the National Institutes of Health (R35GM142783 and R01HF011646 to C.W.K.C., R01CA257328, U19CA214253, and U01CA164973 to C.A.H., and K99CA246076 to L.K.). J.L.C. is supported by the Viterbi Merit Fellowship through the University of Southern California. Computation for this work was supported by University of Southern California’s Center for Advanced Research Computing (CARC; https://carc.usc.edu).
Declaration of interests
The authors declare no competing interests.
Footnotes
Supplemental information can be found online at https://doi.org/10.1016/j.xhgg.2022.100159.
Web resources
1000 Genomes phase 3 legend file, https://mathgen.stats.ox.ac.uk/impute/1000GP_Phase3.html.
GWAS Catalog, http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/gwasCatalog.txt.gz.
Global Biobank Engine, https://biobankengine.stanford.edu/downloads.
Remap, https://www.ncbi.nlm.nih.gov/genome/tools/remap.
TOPMed Imputation Server, https://imputation.biodatacatalyst.nhlbi.nih.gov/
UCSC Genome Browser’s liftOver, https://genome.ucsc.edu/cgi-bin/hgLiftOver.
Supplemental information
We compared three reasonable approaches to imputation across each of the BBIS regions detected: (A) rely on the imputation server to convert input array data in GRCh37 to GRCh38 prior to imputation; (B) manually convert the GRCh37 coordinates to GRCh38 and fix all detectable strand flips (i.e. for all non-palindromic SNVs) prior to imputation; (C) manually convert GRCh37 coordinates to GRCh38 and fix strands for all SNVs detected by triple-liftOver prior to imputation. In (D) we show violin plot of Rsq distribution as a measure of imputation accuracy for each scenario. We restricted the variants in each plot to be either the common variants (MAF ≥ 0.01) in scenario (C) or an inverted site. Regions with only one inverted site were excluded from these region plots. Due to the one bp shift caused by server liftOver, inverted SNV sites are considered imputed in scenario (A) but are considered genotyped in scenarios (B) and (C)
Please refer to the figure legend of Figure S3 for descriptions
Please refer to the figure legend of Figure S3 for descriptions
Please refer to the figure legend of Figure S3 for descriptions
Data and code availability
The triple-liftOver script is publicly available at https://github.com/GraceSheng/triple-liftOver. Individual-level African American and Latino prostate cancer cases and control data are available through dbGaP (dbGaP: phs000306.v4.p1 and phs001391.v1.p1).
References
- 1.Schneider V.A., Graves-Lindsay T., Howe K., Bouk N., Chen H.-C., Kitts P.A., Murphy T.D., Pruitt K.D., Thibaud-Nissen F., Albracht D., et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 2017;27:849–864. doi: 10.1101/gr.213611.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.International Human Genome Sequencing Consortium Finishing the euchromatic sequence of the human genome. Nature. 2004;431:931–945. doi: 10.1038/nature03001. [DOI] [PubMed] [Google Scholar]
- 3.Karczewski K.J., Francioli L.C., Tiao G., Cummings B.B., Alföldi J., Wang Q., Collins R.L., Laricchia K.M., Ganna A., Birnbaum D.P., et al. The mutational constraint spectrum quantified from variation in 141, 456 humans. Nature. 2020;581:434–443. doi: 10.1038/s41586-020-2308-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Lambert S.A., Gil L., Jupp S., Ritchie S.C., Xu Y., Buniello A., McMahon A., Abraham G., Chapman M., Parkinson H., et al. The Polygenic Score Catalog as an open database for reproducibility and systematic evaluation. Nat. Genet. 2021;53:420–425. doi: 10.1038/s41588-021-00783-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Guo Y., Dai Y., Yu H., Zhao S., Samuels D.C., Shyr Y. Improvements and impacts of GRCh38 human reference on high throughput sequencing data analysis. Genomics. 2017;109:83–90. doi: 10.1016/j.ygeno.2017.01.005. [DOI] [PubMed] [Google Scholar]
- 6.Kent W.J., Sugnet C.W., Furey T.S., Roskin K.M., Pringle T.H., Zahler A.M., Haussler D. The human genome browser at UCSC. Genome Res. 2002;12:996–1006. doi: 10.1101/gr.229102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Zhao H., Sun Z., Wang J., Huang H., Kocher J.-P., Wang L. CrossMap: a versatile tool for coordinate conversion between genome assemblies. Bioinformatics. 2014;30:1006–1007. doi: 10.1093/bioinformatics/btt730. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.NCBI Resource Coordinators Database resources of the national center for biotechnology information. Nucleic Acids Res. 2016;44:D7–D19. doi: 10.1093/nar/gkv1290. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Li Y., Willer C., Sanna S., Abecasis G. Genotype imputation. Annu. Rev. Genomics Hum. Genet. 2009;10:387–406. doi: 10.1146/annurev.genom.9.081307.164242. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Nelson S.C., Doheny K.F., Laurie C.C., Mirel D.B. Is ‘forward’ the same aother adventures in SNP allele nomenclature. Trends Genet. 2012;28:361–363. doi: 10.1016/j.tig.2012.05.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Kowalski M.H., Qian H., Hou Z., Rosen J.D., Tapia A.L., Shan Y., Jain D., Argos M., Arnett D.K., et al. Use of >100,000 NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium whole genome sequences improves imputation quality and detection of rare variant associations in admixed African and Hispanic/Latino populations. Barsh GS. PLoS Genet. 2019;15:e1008500. doi: 10.1371/journal.pgen.1008500. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.NHLBI. Taliun D., Harris D.N., Kessler M.D., Carlson J., Szpiech Z.A., Torres R., Taliun S.A.G., Corvelo A., Gogarten S.M., Kang H.M., et al. Sequencing of 53, 831 diverse genomes from the NHLBI TOPMed Program. Nature. 2021;590:290–299. doi: 10.1038/s41586-021-03205-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Conti D.V., Darst B.F., Moss L.C., Saunders E.J., Sheng X., Chou A., Schumacher F.R., Olama A.A.A., Benlloch S., Dadaev T., et al. Trans-ancestry genome-wide association meta-analysis of prostate cancer identifies new susceptibility loci and informs genetic risk prediction. Nat. Genet. 2021;53:65–75. doi: 10.1038/s41588-020-00748-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Fuchsberger C., Abecasis G.R., Hinds D.A. minimac2: faster genotype imputation. Bioinformatics. 2015;31:782–784. doi: 10.1093/bioinformatics/btu704. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Kent W.J. BLAT--the BLAST-like alignment tool Genome. Res. 2002:656–664. doi: 10.1101/gr.229202. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Buniello A., MacArthur J.A.L., Cerezo M., Harris L.W., Hayhurst J., Malangone C., McMahon A., Morales J., Mountjoy E., Sollis E., et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 2019;47:D1005–D1012. doi: 10.1093/nar/gky1120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.McInnes G., Tanigawa Y., DeBoever C., Lavertu A., Olivieri J.E., Aguirre M., Rivas M.A. Global Biobank Engine: enabling genotype-phenotype browsing for biobank summary statistics. Bioinformatics. 2019;35:2495–2497. doi: 10.1093/bioinformatics/bty999. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Das S., Forer L., Schönherr S., Sidore C., Locke A.E., Kwong A., Vrieze S.I., Chew E.Y., Levy S., McGue M., et al. Next-generation genotype imputation service and methods. Nat. Genet. 2016;48:1284–1287. doi: 10.1038/ng.3656. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.McCarthy S., Das S., Kretzschmar W., Delaneau O., Wood A.R., Teumer A., Kang H.M., Fuchsberger C., Danecek P., Sharp K., et al. A reference panel of 64, 976 haplotypes for genotype imputation. Nat. Genet. 2016;48:1279–1283. doi: 10.1038/ng.3643. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Gao Y., Zhang C., Yuan L., Ling Y., Wang X., Liu C., Pan Y., Zhang X., Ma X., Wang Y., et al. PGG.Han: the Han Chinese genome database and analysis platform. Nucleic Acids Res. 2020;48:D971–D976. doi: 10.1093/nar/gkz829. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Xu Z.M., Rüeger S., Zwyer M., Brites D., Hiza H., Reinhard M., Rutaihwa L., Borrell S., Isihaka F., Temba H., et al. Using population-specific add-on polymorphisms to improve genotype imputation in underrepresented populations. Schönhuth A. PLoS Comput. Biol. 2022;18:e1009628. doi: 10.1371/journal.pcbi.1009628. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Herzig A.F., Velo-Suárez L., Frex Consortium. Consortium F.G.R., Dina C., Redon R., Deleuze J.-F., Génin E. Can imputation in a European country be improved by local reference panels? The example of France. Genetics. 2022 doi: 10.1101/2022.02.17.480829. [DOI] [Google Scholar]
- 23.Lin M., Caberto C., Wan P., Li Y., Lum-Jones A., Tiirikainen M., Pooler L., Nakamura B., Sheng X., Porcel J., et al. Population-specific reference panels are crucial for genetic analyses: an example of the CREBRF locus in Native Hawaiians. Hum. Mol. Genet. 2020;29:2275–2284. doi: 10.1093/hmg/ddaa083. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Nurk S., Koren S., Rhie A., Rautiainen M., Bzikadze A.V., Mikheenko A., Vollger M.R., Altemose N., Uralsky L., Gershman A., et al. The complete sequence of a human genome. Science. 2022;376:44–53. doi: 10.1126/science.abj6987. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
We compared three reasonable approaches to imputation across each of the BBIS regions detected: (A) rely on the imputation server to convert input array data in GRCh37 to GRCh38 prior to imputation; (B) manually convert the GRCh37 coordinates to GRCh38 and fix all detectable strand flips (i.e. for all non-palindromic SNVs) prior to imputation; (C) manually convert GRCh37 coordinates to GRCh38 and fix strands for all SNVs detected by triple-liftOver prior to imputation. In (D) we show violin plot of Rsq distribution as a measure of imputation accuracy for each scenario. We restricted the variants in each plot to be either the common variants (MAF ≥ 0.01) in scenario (C) or an inverted site. Regions with only one inverted site were excluded from these region plots. Due to the one bp shift caused by server liftOver, inverted SNV sites are considered imputed in scenario (A) but are considered genotyped in scenarios (B) and (C)
Please refer to the figure legend of Figure S3 for descriptions
Please refer to the figure legend of Figure S3 for descriptions
Please refer to the figure legend of Figure S3 for descriptions
Data Availability Statement
The triple-liftOver script is publicly available at https://github.com/GraceSheng/triple-liftOver. Individual-level African American and Latino prostate cancer cases and control data are available through dbGaP (dbGaP: phs000306.v4.p1 and phs001391.v1.p1).