Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Mar 8:17:120.
doi: 10.1186/s12859-016-0943-7.

Improving cell mixture deconvolution by identifying optimal DNA methylation libraries (IDOL)

Affiliations

Improving cell mixture deconvolution by identifying optimal DNA methylation libraries (IDOL)

Devin C Koestler et al. BMC Bioinformatics. .

Abstract

Background: Confounding due to cellular heterogeneity represents one of the foremost challenges currently facing Epigenome-Wide Association Studies (EWAS). Statistical methods leveraging the tissue-specificity of DNA methylation for deconvoluting the cellular mixture of heterogenous biospecimens offer a promising solution, however the performance of such methods depends entirely on the library of methylation markers being used for deconvolution. Here, we introduce a novel algorithm for Identifying Optimal Libraries (IDOL) that dynamically scans a candidate set of cell-specific methylation markers to find libraries that optimize the accuracy of cell fraction estimates obtained from cell mixture deconvolution.

Results: Application of IDOL to training set consisting of samples with both whole-blood DNA methylation data (Illumina HumanMethylation450 BeadArray (HM450)) and flow cytometry measurements of cell composition revealed an optimized library comprised of 300 CpG sites. When compared existing libraries, the library identified by IDOL demonstrated significantly better overall discrimination of the entire immune cell landscape (p = 0.038), and resulted in improved discrimination of 14 out of the 15 pairs of leukocyte subtypes. Estimates of cell composition across the samples in the training set using the IDOL library were highly correlated with their respective flow cytometry measurements, with all cell-specific R (2)>0.99 and root mean square errors (RMSEs) ranging from [0.97 % to 1.33 %] across leukocyte subtypes. Independent validation of the optimized IDOL library using two additional HM450 data sets showed similarly strong prediction performance, with all cell-specific R (2)>0.90 and R M S E<4.00 %. In simulation studies, adjustments for cell composition using the IDOL library resulted in uniformly lower false positive rates compared to competing libraries, while also demonstrating an improved capacity to explain epigenome-wide variation in DNA methylation within two large publicly available HM450 data sets.

Conclusions: Despite consisting of half as many CpGs compared to existing libraries for whole blood mixture deconvolution, the optimized IDOL library identified herein resulted in outstanding prediction performance across all considered data sets and demonstrated potential to improve the operating characteristics of EWAS involving adjustments for cell distribution. In addition to providing the EWAS community with an optimized library for whole blood mixture deconvolution, our work establishes a systematic and generalizable framework for the assembly of libraries that improve the accuracy of cell mixture deconvolution.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Impact of L-DMR library on the accuracy of cell composition estimation. a, b Hierarchical clustering heat maps of the mean methylation signatures of isolated leukocyte subtypes [3] using (a) the top 600 ANOVA-ranked L-DMRs (TopANOVA library) and (b) the 600 L-DMRs that uniquely distinguish each cell type from all other cell types (EstimateCellCounts default library). Column dendrograms are colored to reflect the cell-lineage of leukocyte subtypes: lymphocytes (pink) and myeloid-derived cells (blue). c Image plot showing the difference in the dispersion separability criterion (DSC) between the EstimateCellCounts and TopANOVA libraries. For a given pair of leukocyte subtypes, larger values of DSC difference (shades of blue) indicate better discrimination associated with the EstimateCellCounts library, whereas smaller values of DSC difference (shades of red) indicate better discrimination associated with the TopANOVA library. d Scatterplots of the CMD predicted and FACS cell fractions for the n=6 AdultMixed samples. Dashed lines indicate the line of unity, dotted lines represent the fitted regression lines based on cell predictions obtained using the TopANOVA library, and solid lines represent the fitted regression lines based on cell predictions obtained using the EstimateCellCounts library. e Cell-specific prediction performance for the AdultMixed samples based on the TopANOVA and EstimateCellCounts libraries
Fig. 2
Fig. 2
Conceptual illustration of the IDOL algorithm. a Schematic diagram showing each step of IDOL. b, c Illustration of the scheme for updating the selection probabilities of L-DMRs. d Conceptual depiction of the L-DMR selection probabilities as a function of the sequential progression of IDOL. At iteration 0, L-DMRs have an equal probability of being selected for inclusion in the randomly assembled L-DMR subset. At each sequential iteration of IDOL (i.e., moving from left to right), the selection probabilities for L-DMRs are updated in a manner proportion to their contribution to prediction performance; selection probabilities for L-DMRs that contribute favorably to prediction performance are increased (increasing shades of green), whereas the selection probabilities for those that hinder prediction performance are decreased (increasing shades of red). Upon algorithm termination, the J L-DMRs with the largest selection probabilities are taken to represent the optimal L-DMR library. e, f Plots showing mean RMSE (M¯) and coefficient of determination (R¯2) respectively, as a function of sequential progression of the the IDOL algorithm
Fig. 3
Fig. 3
Results obtained from applying IDOL to the training set. a Stacked bar plots showing the FACS measured fractions of granulocytes (Gran), monocytes (Mono), natural-killer cell (NK), B cells (Bcell), CD8T lymphocytes (CD8T), and CD4T lymphocytes (CD4T) across the 6 training samples. b Hierarchical clustering heat map of the mean methylation signature of leukocyte cell-types (columns) based on the 300 optimized L-DMRs (rows) identified by IDOL. The column dendrogram is colored to reflect the cell lineage of the leukocyte subtypes, where lymphocyte-derived subtypes are colored pink and myeloid-derived cell types are colored blue. c Scatterplots of FACS measured cell fractions (x-axes) and predicted cell proportions obtained using the optimized IDOL library (y-axes). Dotted lines indicate the line of unity and colored lines represent the regression line fit to the FACS measured cell fractions and predicted cell fractions. d Overlap between IDOL and EstimateCellCounts libraries. e Image plot showing the difference in the dispersion separability criterion (DSC) between the IDOL and EstimateCellCounts libraries for discriminating specific pairs of leukocyte subtypes. For a given pair of leukocytes, larger values of DSC difference (shades of blue) indicate better discrimination associated with the IDOL library, whereas smaller values of DSC difference (shades of red) indicate better discrimination associated with the EstimateCellCounts library. f Histogram showing the results of a permutation-based testing procedure for examining the difference in the overall DSC between the IDOL and EstimateCellCounts libraries
Fig. 4
Fig. 4
Results obtained from applying the optimal IDOL library to the testing sets. a Stacked bar plots showing the cell type fractions for each testing set sample. b Scatter plots of the true reconstructed mixture fractions (x-axes) and the predicted cell fractions obtained using the optimized IDOL library (y-axes). Circles indicate Method A samples and squares indicate Method B samples. Dotted lines indicate the line of unity and colored lines represent the regression line fit to the true reconstructed mixture fractions and predicted cell fractions. c Box plots showing the predicted cell (%) − observed cell (%) across leukocyte cell types, where blue boxes represent estimates obtained from the optimal IDOL library and red boxes represent estimates obtained from the EstimateCellCounts library. (d, top panel) Estimated false discovery rate (FDR) for a two-group comparison of DNA methylation as a function of the dissimilarity in the cellular distribution between groups (x-axes). Colored lines represent different approaches for cell composition adjustment. (d, bottom panel) Difference in the FDR between the EstimateCellCounts and IDOL libraries where points above the dotted line indicate that the EstimateCellCounts library resulted in more false positive results compared to the IDOL library. e Mean difference in the FDR for varying sample sizes when cell mixture was adjusted using cell fractions estimates from the EstimateCellCounts and IDOL libraries. Bars represent the 95 % bootstrap confidence intervals for each point estimate. Points to the right of the dotted line indicate that the EstimateCellCounts library resulted in more false positive results compared to the IDOL librarys
Fig. 5
Fig. 5
Cell mixture deconvolution of the Liu and Hannum blood data sets using the IDOL and EstimateCellCounts libraries. a, b Scatter plots of the predicted cell type fractions obtained using EstimateCellCounts library (x-axes) and the IDOL library (y-axes) for the Liu and the Hannum data sets, respectively. c, d Distribution of the difference in the R 2 computed from the IDOL and EstimateCellCounts libraries for the (c) Liu and (d) Hannum data sets. e, f Estimated number of additional samples needed (y-axis, left) and approximate additional cost (y-axis, right) as a function of the desired difference in DNA methylation to be detected (x-axis) when correction for cell mixture was carried out using the EstimateCellCounts library. Variance estimates were obtained from the (e) Liu and (d) Hannum data sets

Similar articles

Cited by

References

    1. Rakyan VK, Down TA, Balding DJ, Beck S. Epigenome-wide association studies for common human diseases. Nat Rev Genet. 2011;12(8):529–41. doi: 10.1038/nrg3000. - DOI - PMC - PubMed
    1. Adalsteinsson BT, Gudnason H, Aspelund T, Harris TB, Launer LJ, Eiriksdottir G, Smith AV, Gudnason V. Heterogeneity in white blood cells has potential to confound dna methylation measurements. PLoS ONE. 2012;7(10):46705. doi: 10.1371/journal.pone.0046705. - DOI - PMC - PubMed
    1. Reinius LE, Acevedo N, Joerink M, Pershagen G, Dahln SE, Greco D, Sderhll C, Scheynius A, Kere J. Differential dna methylation in purified human blood cells: implications for cell lineage and studies on disease susceptibility. PLoS ONE. 2012;7(7):41361. doi: 10.1371/journal.pone.0041361. - DOI - PMC - PubMed
    1. Koestler DC, Marsit CJ, Christensen BC, Accomando W, Langevin SM, Houseman EA, Nelson HH, Karagas MR, Wiencke JK, Kelsey KT. Peripheral blood immune cell methylation profiles are associated with nonhematopoietic cancers. Cancer Epidemiol Biomarkers Prev. 2012;21(8):1293–302. doi: 10.1158/1055-9965.EPI-12-0361. - DOI - PMC - PubMed
    1. Lam LL, Emberly E, Fraser HB, Neumann SM, Chen E, Miller GE, Kobor MS. Factors underlying variable dna methylation in a human community cohort. Proc Natl Acad Sci U S A. 2012;109 Suppl 2:17253–60. doi: 10.1073/pnas.1121249109. - DOI - PMC - PubMed

Publication types