Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jul 22;16(1):6726.
doi: 10.1038/s41467-025-61650-z.

Machine learning in Alzheimer's disease genetics

Matthew Bracher-Smith #  1   2 Federico Melograna #  3   4 Brittany Ulm #  5   6 Céline Bellenguez  7 Benjamin Grenier-Boley  7 Diane Duroux  4 Alejo J Nevado  6 Peter Holmans  2 Betty M Tijms  8   9 Marc Hulsman  8   9 Itziar de Rojas  10   11 Rafael Campos-Martin  12   13 Sven van der Lee  8   9 Atahualpa Castillo  2 Fahri Küçükali  14   15 Oliver Peters  16   17 Anja Schneider  18   19 Martin Dichgans  20   21   22 Dan Rujescu  23 Norbert Scherbaum  24 Jürgen Deckert  25 Steffi Riedel-Heller  26 Lucrezia Hausner  27 Laura Molina-Porcel  28   29 Emrah Düzel  30   31 Timo Grimmer  32 Jens Wiltfang  33   34   35 Stefanie Heilmann-Heimbach  36 Susanne Moebus  37 Thomas Tegos  38 Nikolaos Scarmeas  39   40 Oriol Dols-Icardo  41   42 Fermin Moreno  42   43   44 Jordi Pérez-Tur  42   45 María J Bullido  42   46   47   48 Pau Pastor  49   50 Raquel Sánchez-Valle  51 Victoria Álvarez  52   53 Mercè Boada  42   54 Pablo García-González  55 Raquel Puerta  55 Pablo Mir  42   56 Luis M Real  57   58 Gerard Piñol-Ripoll  59   60 Jose María García-Alberca  42   58   61 Eloy Rodriguez-Rodriguez  42   62 Hilkka Soininen  63 Sami Heikkinen  64 Alexandre de Mendonça  65 Shima Mehrabian  66 Latchezar Traykov  67 Jakub Hort  68 Martin Vyhnalek  69   70 Nicolai Sandau  71 Jesper Qvist Thomassen  71 Yolande A L Pijnenburg  8 Henne Holstege  8   72 John van Swieten  73 Inez Ramakers  74 Frans Verhey  74 Philip Scheltens  8 Caroline Graff  75 Goran Papenberg  76 Vilmantas Giedraitis  77 Julie Williams  1   2 Philippe Amouyel  7 Anne Boland  78 Jean-François Deleuze  78 Gael Nicolas  79 Carole Dufouil  80   81 Florence Pasquier  82 Olivier Hanon  83 Stéphanie Debette  84   85 Edna Grünblatt  86   87   88 Julius Popp  89   90   91 Roberta Ghidoni  92 Daniela Galimberti  93   94 Beatrice Arosio  95   96 Patrizia Mecocci  97   98 Vincenzo Solfrizzi  99 Lucilla Parnetti  100 Alessio Squassina  101 Lucio Tremolizzo  102 Barbara Borroni  92   103   104 Michael Wagner  18   105 Benedetta Nacmias  106   107 Marco Spallazzi  108 Davide Seripa  109 Innocenzo Rainero  110 Antonio Daniele  111   112 Fabrizio Piras  113 Carlo Masullo  114 Giacomina Rossi  115 Frank Jessen  18   116   117 Patrick Kehoe  118 Tsolaki Magda  38   119 Pascual Sánchez-Juan  42   120 Kristel Sleegers  14   15 Martin Ingelsson  121   122   123 Mikko Hiltunen  64 Rebecca Sims  2 Wiesje van der Flier  8 Ole A Andreassen  124   125 Agustín Ruiz  42   126   127 Alfredo Ramirez  12   18   105   128   129 EADBRuth Frikke-Schmidt  71   130 Najaf Amin  5 Gennady Roshchupkin  131   132 Jean-Charles Lambert  7 Kristel Van Steen  133   134 Cornelia van Duijn  135   136   137 Valentina Escott-Price  138   139
Collaborators, Affiliations

Machine learning in Alzheimer's disease genetics

Matthew Bracher-Smith et al. Nat Commun. .

Abstract

Traditional statistical approaches have advanced our understanding of the genetics of complex diseases, yet are limited to linear additive models. Here we applied machine learning (ML) to genome-wide data from 41,686 individuals in the largest European consortium on Alzheimer's disease (AD) to investigate the effectiveness of various ML algorithms in replicating known findings, discovering novel loci, and predicting individuals at risk. We utilised Gradient Boosting Machines (GBMs), biological pathway-informed Neural Networks (NNs), and Model-based Multifactor Dimensionality Reduction (MB-MDR) models. ML approaches successfully captured all genome-wide significant genetic variants identified in the training set and 22% of associations from larger meta-analyses. They highlight 6 novel loci which replicate in an external dataset, including variants which map to ARHGAP25, LY6H, COG7, SOD1 and ZNF597. They further identify novel association in AP4E1, refining the genetic landscape of the known SPPL2A locus. Our results demonstrate that machine learning methods can achieve predictive performance comparable to classical approaches in genetic epidemiology and have the potential to uncover novel loci that remain undetected by traditional GWAS. These insights provide a complementary avenue for advancing the understanding of AD genetics.

PubMed Disclaimer

Conflict of interest statement

Competing interests: Outside the submitted work, T.G. received consulting fees from AbbVie, Alector, Anavex, Biogen, BMS; Cogthera, Eli Lilly, Functional Neuromodulation, Grifols, Iqvia, Janssen, Noselab, Novo Nordisk, NuiCare, Orphanzyme, Roche Diagnostics, Roche Pharma, UCB, and Vivoryon; lecture fees from Biogen, Eisai, Grifols, Medical Tribune, Novo Nordisk, Roche Pharma, Schwabe, and Synlab; and has received grants to his institution from Biogen, Eisai, and Roche Diagnostics. N.A. received funding from GSK. All other authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Methods overview.
Data was separated into an initial balanced random split before model selection (cross-validation and hyperparameter tuning) in the training split (a). All models were subsequently evaluated for association (annotation, enrichment analysis, interaction testing and replication; b) and prediction (AUC and correlations; c). Interaction tests report p-values from the Wald test in logistic regression (two-sided) as standard, after correction for multiple testing The full pipeline was run four times per model to assess robustness. For prediction, AUC values, statistical tests, and correlation analyses are based on the initial train-test split. For association, variants were prioritised if they appeared in the top SNP selection in at least two repeats.
Fig. 2
Fig. 2. Prediction from ML models in the test split for the most predictive models trained with the APOE region included.
The top two most predictive approaches (GBM, PRS) were not significantly different by AUC, as measured by DeLong’s test, though prediction from MB-MDRC 1 d was significantly below other methods (a). Bars show the AUC from a single test split for each model, where whiskers are 95% CIs from the pROC package. Unadjusted p-values from tests (ns: not significant, ****: p < 0.0005) are annotated on panel a for DeLong’s two-sided test for correlated ROC curve; see supplementary Data 3 for exact values. Model predictions showed strong correlation, though correlation of the ranks is lower (b). Distributions of covariate-adjusted predictions for the most predictive approaches are similar but show a more distinct multimodal distribution for GBMs (c), NNs (d) and MB-MDRC 1d (f) compared to PRS (e), illustrating stronger influence of APOE risk alleles on predictions. Panel (g) shows the consistency across methods for individuals’ prediction scores, where the participants in the 5% extreme tails of GBM predictions are followed across predictions from NN, PRS and MB-MDRC 1 d. Classifications metrics are given in (h). Model predictions broken down by covariates are shown in (i) and (j), where dark blue indicates predicted cases, and light blue predicted controls. Box plots in (j) show the median (center line), the 25th and 75th percentiles (box limits), and the whiskers which extend to 1.5 times the interquartile range.
Fig. 3
Fig. 3. Association in ML models.
Uniform Manifold Approximation and Projection (UMAP) of raw (unscaled) SHAP values for GBM hits highlights that APOE alleles are identified and drive prediction (a). Neural networks and gradient boosting both rank the SNPs required to derive the e2 and e4 allele status for APOE as highest, unlike traditional GWAS (b). Values for neural networks and GBMs are not based on p-values, as described below, while p-values in (b,c) (GWAS) are from a logistic regression in the training split, using a logistic regression and p-values from a two-sided Wald test as standard. Manhattan plots are given for top hits only from gradient boosting (mean absolute SHAP values), neural networks (normalised network layer weights) and MB-MDRC 1 d (−log10 p-values), where hits from different random splits of the models are shown in different colours, and all variants from a single GWAS on the train split are shown in greyscale for comparison (right hand y-axis) (c). p-values for MB-MDRC 1 d in (b, c) are derived from a two-sided permutation-based test as implemented in MDMDR,. Hits from machine learning models (see Table 2) are enrichment for known Alzheimer’s disease processes (d).
Fig. 4
Fig. 4. UpSet plot showing the overlap between ML and GWAS significant findings from the train part of the train-test split.
a Genes mapped by the SNPs highlighted by each ML approach separately. Both ML approaches and GWAS significant SNPs were identified in the same training split. b Genes that are shared among at least two train-test splits. All GWAS genome-wide significant (p ≤ 5×10−8) SNPs in the train split were also identified by ML approaches. For simplicity, genes within 500 kb of a known locus or with at least one overlapping gene with the region were annotated only by the locus, including MS4A6A, CSTF1, EPHX2, CNN2, TOMM40/NECTIN2/CLPTM1/BCL3/BCAM/APOC1/APOC2/APOC4 and DGKQ/FAM53A, which were mapped to MS4A*, CASS4, CLU, ABCA7, APOE and IDUA regions, respectively. Subplots were created using ComplexUpset version 1.3.3.

Similar articles

References

    1. Lambert J. C., Ramirez A., Grenier-Boley B., Bellenguez C. Step by step: towards a better understanding of the genetic architecture of Alzheimer’s disease. Mol. Psychiatry 1–12 (2023). - PMC - PubMed
    1. Baker, E. & Escott-Price, V. Polygenic Risk Scores in Alzheimer’s Disease: Current Applications and Future Directions. Front Digit Health2, 556191 (2020). - PMC - PubMed
    1. Nelson, R. M., Pettersson, M. E. & Carlborg, Ö A century after Fisher: time for a new paradigm in quantitative genetics. Trends Genet.29, 669–676 (2013). - PubMed
    1. Lewis C. M., Vassos E. Polygenic risk scores: from research tools to clinical instruments. Genome Med. 12 (2020) - PMC - PubMed
    1. Visscher P. M., et al. 10 Years of GWAS discovery: biology, function, and translation. Am. J. Hum. Genet. 101 (2017). - PMC - PubMed

Substances