Introduction

Advancements in machine learning have made it possible to predict disease risk based on large-scale multivariate health and medical data1,2,3,4. Machine learning models for disease onset prediction, especially those based on lifestyle, diet, and exercise habits, are expected to individually prevent diseases by forecasting the potential development of lifestyle-related diseases, such as diabetes and hypertension, by presenting individual contributing factors5. Constructing higher-performing machine learning models requires a vast amount of training data. Hence, multi-item health and medical data are accumulated worldwide from patients with chronic diseases and healthy individuals alike6,7,8.

The difficulty of data sharing and scarcity of health and medical data emphasize the importance of using a disease onset prediction model built on health checkup data collected at one site for use at other sites. In this case, the disease onset prediction model faces the challenge of dataset shift9,10,11, a problem where the probability distributions of training and test data differ (Fig. 1a), causing the test data to have in-distribution (ID) and out-of-distribution (OOD) data. The distribution difference means that one of the model assumptions, that is, training and test data distributions are equal, does not hold, leading to the model’s misclassification of the OOD test data. The problem arises when the data acquisition location for training and actual testing differ9,11. Factors affecting the dataset shift problem include regional differences in diet, lifestyle, and exercise habits, as well as discrepancies in the measurement instruments used at various sites. Such variations based on unique regional characteristics make it challenging to avoid dataset shift.

Fig. 1
figure 1

Overview of this study. (a) Dataset shift. This study used health checkup data from Hirosaki City in Aomori Prefecture, Japan, and Wakayama Prefecture, Japan, with dataset shift. The disease onset prediction model constructed from Hirosaki data has a lower prediction performance in Wakayama data than that of Hirosaki data due to the dataset shift. (b) Proposed method—out-of-distribution reject option for prediction; ODROP. In the proposed method, an out-of-distribution (OOD) detection model constructed from Hirosaki health checkup data first calculates the OOD score of each Wakayama health checkup data. The OOD score represents suitability as OOD data. Thus, data with an OOD score above a threshold are classified as OOD data (right side of OOD score histogram). Finally, a disease onset prediction model constructed from Hirosaki data predicts the in-distribution (ID) data, which are appropriate for prediction.

In prior research, most early disease onset prediction models have not addressed the dataset shift. These models have been evaluated solely on the internal dataset1,2,3,4 or reported a reduction of area under the receiver operating curve (AUROC) when applied to different sites due to OOD data12. Some studies13,14 have attempted to enhance robustness of disease onset predictions for external datasets by excluding OOD data from prediction samples. They used conformal prediction15, a statistical testing method using features transformed by a neural network encoder to identify OOD data, and then made predictions on ID data. As a result, the OOD detection and prediction models become integrated, limiting the ability to use pre-trained predictive models with high discriminative power. On the other hand, OOD detection models that operate independently of prediction models are quite popular in computer vision16,17,18,19,20. However, unlike images, where human experts can annotate ID and OOD data, electronic health records and health checkup data present challenges for such annotation. This difference makes it difficult to directly apply OOD detection models in computer vision to these types of data. Thus, methods for effectively handling OOD health and medical data derived from dataset shift remain insufficient.

The aim of this study is to explore effective methods to address the dataset shift problem in disease onset prediction models when testing health and medical data with different distributions from the training data. Our proposed approach involves a two-stage predictive method called out-of-distribution reject option for prediction (ODROP, Fig. 1b), which uses an OOD detection model to reject OOD data from a test dataset. In the first stage, an OOD detection model score the divergence between the training and test data distributions to discern the appropriateness of the test data as ID or OOD data. In the second stage, we include an option to avoid predictions for data identified as OOD. Our ODROP method derives from the known reject option method, which avoids class prediction when the classification confidence is within a certain range21,22. We refine this reject option method for OOD data caused by a dataset shift.

We used five OOD detection methods and two health checkup datasets with a dataset shift and evaluated their methods’ effectiveness in three disease onset prediction tasks, namely diabetes, hypertension, and dyslipidemia, within one year. Our evaluation considered three aspects: stability, extent of improvement in the prediction performance metrics, and the proportion of rejected samples at maximum improvement. We identified the ODROP method using a variational autoencoder (VAE) 23 as the optimal OOD detection model. In addition, we compared the patterns of prediction contribution (SHAP) 24 values between the ID and rejected OOD data groups. We discovered for the first time that the dataset shift could be classified into those considerably contributing to disease onset prediction and those that do not.

Our contributions are following: this study is the first to apply out-of-distribution (OOD) detection models to real-world health and medical data, demonstrating their effectiveness in detail. Additionally, our proposed ODROP method offers a solution to the dataset shift problem. It enhances the robustness of existing clinical prediction models against dataset shift without altering their prediction mechanisms, thereby providing a practical and impactful advancement in the field.

Results

Dataset shift between two Health checkup datasets

Several cohort studies8,25,26 have been conducted that reflect the regional characteristics of Japan. Some of these studies have multi-item health examination data, including physiological and biochemical data, such as blood and respiratory metrics; data on personal activities, such as diet, exercise habits, and daily stress; and socioeconomic data, such as educational background and work environment. In this study, we used two multi-item health checkup datasets from different regions of Japan: Hirosaki City in Aomori Prefecture8 and Wakayama Prefecture25,26. We conducted statistical tests to confirm dataset shifts between the two and plotted kernel density estimation (KDE) for each item. The results are presented in Table 1 and Fig. 2. Complete summary statistics for all items from both sites and the results of the statistical tests between the two sites can be found in Supplementary Table 1. The KDE plots in Fig. 2 visualize the distribution shifts in two health datasets. However, the overlapping regions in the distributions suggest that the Wakayama health checkup data (Wakayama data) can be divided into two groups, with one group having similar characteristics to the Hirosaki health checkup data (Hirosaki data).

Table 1 Summary statistics (mean ± std) and test p-values for main items in Hirosaki and Wakayama health checkups.
Fig. 2
figure 2

Kernel density estimation plot in the main items of Hirosaki and Wakayama health checkups.

Baseline evaluation of Hirosaki and Wakayama health checkup data

We confirmed the occurrence of the dataset shift problem: whether the predictive performance metrics in the Wakayama health checkup data decreased compared to the Hirosaki health checkup data, which is the training base for the disease onset prediction models. We compared the mean receiver operating characteristic (ROC) curve from fivefold cross-validation at Hirosaki with the ROC curve for the Wakayama health checkup data in Fig. 3. The precision-recall (PR) curves are shown in Supplementary Fig. 1. The Wakayama health checkup AUROC is lower in the three disease onset prediction tasks compared to the Hirosaki mean AUROC, with decreases of 0.11 for diabetes, 0.09 for dyslipidemia, and 0.02 for hypertension. Similarly, PRAUC decreased for all tasks by 0.116, 0.253, and 0.012 for diabetes, dyslipidemia, and hypertension, respectively. Hypertension has the smallest decline in AUROC and PRAUC values. Hereafter, the mean AUROC from fivefold cross-validation is referred to as the Hirosaki AUROC baseline, and that from the Wakayama health checkup data is referred to as the Wakayama AUROC baseline (the same applies to PRAUC).

Fig. 3
figure 3

Comparison of AUROC baselines between Hirosaki and Wakayama health checkups. The results of the fivefold cross-validation ROC curves for each disease onset prediction task conducted in Hirosaki, along with their mean ± std ROC curves, compared to the ROC curve results from Wakayama health checkup data. The values in parentheses represent the AUROC values. (Left): prediction of diabetes onset within 1 year. (Center): prediction of dyslipidemia onset within 1 year. (Right): prediction of hypertension onset within 1 year.

Rejection rate evaluation

We used the rejection rate for ODROP evaluation in health and medical data, which is the proportion of OOD data rejected from all test data. We assessed five OOD detection methods: VAE reconstruction loss (VAE reconstruction)27, neural network ensemble std (ensemble std)28, neural network ensemble epistemic (ensemble epistemic)28, neural network energy (energy)29, Gaussian mixture based energy measurement (GEM)30 for diabetes, hypertension, and dyslipidemia onset prediction within one year. The rejection curve31 evaluates the extent of prediction metric improvement (AUROC or PRAUC on the y-axis) with the rejection rate (x-axis). The 0% rejection rate represents “baseline,” which is the prediction metric value for all the test data. The result of 0% rejection rate, indicating the absence of ODROP method, corresponds to prediction metrics reported in previous studies. Increasing the rejection rate from 0% allows for the gradual exclusion of the OOD test data. We confirmed that subsequent exclusion led to a stepwise improvement in the predictive performance metrics of the model. In addition, to evaluate the stability of the prediction metric improvement when increasing the rejection rate, we evaluated the rank correlation coefficient between the prediction performance metric and rejection rate. The rank correlation coefficient is positive if the ODROP method improves the prediction performance metrics from the baseline at an increased rejection rate. In addition, the larger the coefficient, the more stable and consistent the improvement.

Internal validation using Hirosaki health checkups

For internal validation, we used the proposed ODROP method on Hirosaki health checkup data, which do not exhibit a dataset shift, and evaluated it using fivefold cross validation. The results for the AUROC across the three disease onset prediction tasks are shown in Fig. 4, and the PRAUC results in Supplementary Fig. 2.

Fig. 4
figure 4

AUROC-rejection rate rank correlation coefficients and AUROC-rejection curves in Hirosaki health checkup. (a) Diabetes mellitus (DM), (b) dyslipidemia (DysL), (c) hypertension (HTN). Left bar plot: the mean ± std of rank correlation coefficient between rejection rate and AUROC. Right plot: AUROC-rejection curve. Y-axis is AUROC value (mean ± std) and x-axis is rejection rate. In (a,c), VAE reconstruction method showed a positive and considerable rank correlation coefficient, indicating a nearly monotonic improvement trend. VAE reconstruction method also demonstrated the greatest improvement from the baseline AUROC at a 0% rejection rate in (a,b). (c) An improvement extent nearly equivalent to that of ensemble epistemic method, which had the largest improvement range.

From the bar graphs showing the rank correlation coefficients in Fig. 4a, we confirmed that VAE reconstruction was positive for diabetes; energy and ensemble std were positive for dyslipidemia; and GEM, energy, ensemble std, and VAE reconstruction were positive for hypertension. In Fig. 4b, the methods that improved the mean AUROC from the baseline were VAE reconstruction for diabetes; ensemble epistemic, ensemble std, and VAE reconstruction for dyslipidemia; and GEM, ensemble epistemic, ensemble std, and VAE reconstruction for hypertension. This indicates that these methods effectively improve the prediction performance metrics when rejecting OOD data. The method that showed the greatest improvement in mean AUROC from baseline was VAE reconstruction for diabetes and dyslipidemia and ensemble epistemic for hypertension. The maximum mean AUROC is 0.916 (rejection rate: 24.0%), 0.808 (33.2%), and 0.848 (38.4%) for diabetes, dyslipidemia, and hypertension, respectively. The maximum extent of AUROC improvement was 0.015 for diabetes, 0.017 for dyslipidemia, and 0.021 for hypertension. VAE reconstruction was the only method that indicated a tendency for AUROC improvement across the three disease onset prediction tasks.

External validation using Wakayama health checkups

We used five OOD detection methods, namely VAE reconstruction, ensemble epistemic, ensemble std, energy, and GEM, and applied each ODROP approach to the Wakayama health checkups, which had a dataset shift between the Hirosaki health checkups. The results for the AUROC across the three disease onset prediction tasks are shown in Fig. 5, and the PRAUC results in Supplementary Fig. 3. For diabetes and dyslipidemia, VAE reconstruction method yielded positive rank correlation coefficients for the AUROC. The ensemble epistemic and ensemble std method were positive for hypertension. VAE reconstruction method also demonstrated positive rank correlations for PRAUC in diabetes and hypertension, suggesting it consistently improved the predictive performance metrics.

Fig. 5
figure 5

AUROC-rejection rate rank correlation coefficients and AUROC-rejection curves in Hirosaki health checkup. (a) Diabetes mellitus (DM), (b) dyslipidemia (DysL), (c) hypertension (HTN). VAE reconstruction was the only method with a positive rank correlation coefficient in (a,b), showing a stable improvement in AUROC through the rejection curve. In (c), ensemble epistemic and ensemble std had positive coefficients, with the rejection curve confirming an upward trend in AUROC.

In Fig. 5, only VAE reconstruction method is shown to improve AUROC for diabetes, reaching a peak of 0.90 at 31.1% rejection rate, marking a 0.1 improvement over the Wakayama baseline. For dyslipidemia, VAE reconstruction method improved AUROC at a lower rejection rate than the ensemble epistemic, maintaining around 0.75 and peaking at 0.76. For hypertension, methods using neural network ensembles, ensemble std and epistemic show similar improvements in AUROC, with VAE reconstruction method maintaining near-baseline performance. In the three diseases investigated, the energy method, which was initially developed for image-based OOD detection, did not improve the AUROC scores but progressively improved the PRAUC scores and is a notable finding of this study. Additionally, the GEM method, an advanced version of the energy model, consistently underperforms the energy method in both predictive performance metrics. This indicates that the advancements in image-domain methods do not always correlate with improved outcomes.

These findings suggest that VAE reconstruction is the most suitable OOD detection method for the ODROP approach because of its considerable improvement in predictive performance metrics, lower rejection rates during improvement, and stable enhancement across various rejection rates, particularly during gradual increases in the rejection rate.

Discovery of dataset shift for contributing to disease onset prediction model by SHAP clustering

To identify the items that considerably impact disease onset prediction owing to the dataset shift, we used SHAP24 values, which quantitatively represent the contribution of each predictor to the model’s output. Differences in the SHAP value patterns between the ID and OOD data groups, can help determine which items cause considerable dataset shifts that affect disease onset prediction.

We show the clustering result using VAE reconstruction as an OOD detection method for ODROP method in predicting diabetes onset within one year (Fig. 6a). The clustering of each item was split into two clusters: one with a high tendency for absolute SHAP values, notably HbA1c, and the other with lower values across the remaining items. The clustering of each record for diabetes onset within one year was split into two groups based on the HbA1c SHAP values, which were identified as the ID and OOD data groups based on the labels assigned. The actual HbA1c values for the ID and OOD groups (Fig. 6b), reveal that the OOD group has relatively lower HbA1c levels than the ID group. Thus, this dataset shift in HbA1c is considerable for the model predicting diabetes onset within 1 year. The results of SHAP clustering for individuals diagnosed with dyslipidemia or hypertension within a year of the Wakayama health checkup data are provided in Supplementary Figs. 4A and B, respectively.

Fig. 6
figure 6

Dataset shift in diabetes onset within one-year records for diabetes onset prediction model. (a) SHAP clustering for diabetes onset within one-year records in Wakayama health checkups. This figure shows a hierarchical clustering analysis using SHAP values from a 1-year diabetes onset prediction model for individuals from Wakayama health checkup data who developed diabetes within 1 year. A colormap represents the magnitude of the SHAP values calculated by the prediction model, with the vertical axis listing the Wakayama health checkup data of individuals who developed diabetes within one year. The horizontal axis without an index column shows the names of each examination item used in the prediction model, whereas an index column is IDs and OOD labels based on the VAE reconstruction loss threshold at the rejection rate of 31.1%, where AUROC was maximized in the rejection curve. (b) HbA1c Levels in one-year diabetes onset Wakayama ID and OOD data (mean ± std). The HbA1c value, which showed the most pronounced pattern differences between ID and OOD in SHAP Clustering, was presented as mean ± std for both ID and OOD data.

Discussion

This study demonstrates that the proposed ODROP method can improve predictive performance metrics from the baseline in disease onset predictions across two health checkup datasets with different regional characteristics within the same country. This approach offers a viable solution to the dataset shift problem by addressing the issue of discrepancies between the predictive performance at the model training location and the actual application site9,11. Evaluation of the three perspectives revealed that the ODROP method using VAE reconstruction as the OOD detection method was optimal. In addition, we analyzed the SHAP value patterns of the disease onset prediction model and discovered, for the first time, that datasets from different regions included dataset shifts that considerably impacted disease onset prediction and those that did not.

We showed that the ODROP method could improve the prediction metrics of diabetes, dyslipidemia, and hypertension onset within one year when using the Wakayama and Hirosaki health checkup data as the test and training data, respectively. The VAE reconstruction for diabetes prediction and ensemble epistemic ODROP method for hypertension prediction considerably improved the AUROC scores, reaching 0.90 and 0.875, respectively. These improvements matched or exceeded Hirosaki’s baseline performance. Thus, the ODROP method can adequately address the dataset shift problem in disease onset prediction within one year. These results also suggest that the Wakayama health checkup data, affected by dataset shifts, contained groups similar and dissimilar to the Hirosaki health checkup data. The ODROP method effectively isolates and predicts similar groups, improving the predictive metric performance. This indicates the potential effectiveness of the ODROP method in other regions with test datasets comprising groups similar and dissimilar to the training data, providing a viable solution to the dataset shift problem in health data analytics.

Internal and external validations were conducted to explore the most appropriate OOD detection method for health and medical data using the ODROP method. Internal validation demonstrated improved predictive performance metrics for all three disease onset predictions. The VAE reconstruction ODROP method showed superior stability and magnitude of improvement in the AUROC, suggesting its effectiveness even when applied within the same location as the training dataset. In the external validation, the VAE reconstruction ODROP method uniquely and consistently improved the AUROC for diabetes and dyslipidemia onset predictions, although it maintained the AUROC baseline for hypertension onset prediction within one year. These results suggest VAE reconstruction as the most effective and optimal OOD detection method in the ODROP approach for health and medical data, considering its stable improvement in predictive performance metrics and considerable improvement range. As an unsupervised learning model that does not require a target variable, VAE allows for flexible applications across multiple prediction tasks without retraining the neural network classifier for each task. This versatility gives the VAE an advantage over neural network classifier-based OOD detection methods (ensemble epistemic, ensemble std, energy, and GEM), enabling more efficient deployment of the ODROP approach across various predictive scenarios. Energy and GEM, initially developed for image-based OOD detection, underperform compared with other methods in structured data, including health and medical data. The lack of superior results suggests that image-based OOD detection models do not always translate well to structured data. This highlights the need for new benchmarks tailored to structured datasets, particularly health and medical datasets.

The proposed method has two advantages. First, the OOD detection model operates independently of the predictive model. This allows for the straightforward addition of an OOD detection model to existing medical or clinical prediction models using structured data, facilitating improvements without modifying existing prediction models. This integration can also address dataset shift and provide more reliable prediction outcomes without altering the original models. Second, the ODROP method does not require dataset sharing between training and testing sites when constructing the OOD detection model. Previous approaches to addressing dataset shift assumed simultaneous access to training and test data32,33, a challenging requirement for health and medical data owing to privacy concerns. Thus, the ODROP method is a practical solution to address dataset shift without data sharing.

Furthermore, we compared the SHAP clustering patterns of item contributions between the ID and OOD groups in patients who developed diabetes within 1 year. Dataset shifts can be classified into two: those that considerably impact predictions and those that do not. Previous studies have systematized dataset shifts by starting with a covariate shift10. In contrast, this study is the first to focus on dataset shifts in terms of their contribution to the prediction model. Identifying items that cause considerable dataset shifts for predictive models is crucial because these identifications could lead to the standardization of measurement instruments across multiple hospital sites and practical measures for addressing dataset shifts.

One limitation of the proposed ODROP method is that it cannot provide prediction results for all test data and requires predictive models optimized for data from each testing site. Although domain adaptation and generalization techniques34,35 have been explored for constructing such models, they require retraining neural network models, necessitating large sample sizes and data sharing across sites for fine-tuning. Thus, the selection or combination of these techniques or our method for appropriate manner is of importance to achieve effective prediction in clinical settings.

The development of the ODROP method employing an OOD detection model enabled reliable and accurate predictions across health and medical datasets affected by dataset shift. This study first evaluated multiple OOD detection methods in health and medical data, assessing improvements in predictive performance metrics considering stability, magnitude, and rejection rate in three disease onset prediction tasks. Accordingly, we demonstrated that VAE reconstruction is the optimal OOD detection method for health and medical data. Our ODROP method provides a general solution to the dataset shift problem because it enhances the robustness of existing clinical prediction models against dataset shift without modifying the prediction mechanism.

Methods

Data

We used health checkup data from the Iwaki Health Promotion Study8 from 2005 to 2020 and the Wakayama Study25,26 from 2018 to 2019. These datasets are comprehensive, encompassing over 2000 items, including physiological and biochemical data such as blood and respiratory metrics, personal lifestyle data such as diet and stress, and socioenvironmental data such as education and employment, showcasing diverse regional characteristics within Japan. Of the 383 common items between the two datasets, we selected 334 items with less than 50% missing data in both datasets and had data available for at least two consecutive years (Fig. 7). We conducted statistical tests between the Hirosaki and Wakayama health checkup data across 334 items and 3 additional items representing labels indicating the onset of diabetes, dyslipidemia, and hypertension within one year. For continuous variables, we used Welch’s t-test, whereas for discrete variables, we used the χ2 test and Fisher’s exact test following Cochran’s rule. This study was approved by the Hirosaki University Faculty of Medicine Ethics Committee (annual approval, latest approval number: 2023-007-1) and conducted in accordance with the Declaration of Helsinki. Written informed consent was obtained from all participants.

Fig. 7
figure 7

Data preprocessing flow.

ODORP flow

The pseudocode of the ODROP method is shown in Fig. 8 (Algorithm 1). In the ODROP method, an OOD detection model and a prediction model are first trained. Subsequently, for each test data point, an OOD score is calculated using the OOD detection model. If this score is below a predefined threshold, predictions are made using the prediction model. Further details on the OOD score and the prediction model are discussed in the following sections.

Fig. 8
figure 8

Pseudocode of ODROP framework.

OOD detection model

Machine learning models assume that the test data come from the same distribution as the training data and may not perform accurately on OOD test data that deviate from the training data distribution. Identifying OOD data is crucial and is referred to as OOD detection16,18. OOD detection models compute an OOD score indicating the “likelihood” that the input data is OOD. Each input datum is classified as ID if the OOD score is below a certain threshold and OOD otherwise.

OOD detection models have evolved considerably and are categorized into generative and classification model-based approaches16,18. Traditionally, these models are benchmarked using existing image databases and manually separated into ID and OOD datasets to assess the binary classification performance (OOD-AUROC, OOD-PRAUC) 17,20. Recently, classification-model-based approaches have been proposed in the image domain29,30,36, building on the foundations established by generative model-based methods37,38, reflecting advancements in accurately identifying OOD data. However, tabular data requires advanced domain knowledge of experts to distinguish ID and OOD datasets, and they have not been benchmarked, particularly health and medical data. In this study, we employed representative OOD detection methods such as the generative model-based VAE23,27, the neural network classification model-based ensemble method28, and GEM30, a method developed based on neural network energy29, recently developed and proposed in the field of imaging as an OOD detection model. Table 2 lists the name of each OOD detection method, its OOD score, and the calculation method.

Table 2 OOD detection method.

The definitions of each OOD score (VAE reconstruction loss, ensemble std, ensemble epistemic, energy, and GEM scores) are as follows:

VAE reconstruction loss (VAE reconstruction) score

$${\text{Score}}_{{{\text{Reconstruction}}}} ({\mathbf{x}}) \triangleq {\mathbb{E}}_{{z\sim q^{{{\text{VAE}}}} (z|{\mathbf{x}})}} \left[ {\left\| {{\mathbf{x}} - f_{{{\text{VAE}}}} (z)} \right\|^{2} } \right]$$
(1)

where x is the m-dimensional input vector. (\(q^{{{\text{VAE}}}}\), \(f_{{{\text{VAE}}}}\)) are the encoder and decoder model obtained using VAE, respectively. The score is calculated by taking the expected value after 10-samplings according to the encoder model \(q^{{{\text{VAE}}}}\).

Ensemble prediction probability standard deviation (ensemble std) score

$${\text{Score}}_{{{\text{std}}}} ({\mathbf{x}}) \triangleq \sqrt {\frac{1}{M}\sum\limits_{i = 1}^{M} {(p_{i} (} {\mathbf{x}}) - p_{{{\text{ensemble}}}} ({\mathbf{x}}))^{2} }$$
(2)
$$p_{{{\text{ensemble}}}} ({\mathbf{x}}) \triangleq \frac{1}{M}\sum\limits_{i = 1}^{M} {p_{i} } ({\mathbf{x}})$$
(3)

where M is the number of neural network ensemble models and pi (x) is the prediction probability when x is the m-dimensional input vector.

Ensemble epistemic uncertainty (ensemble epistemic) score

$${\text{Score}}_{{{\text{epistemic}}}} ({\mathbf{x}}) \triangleq u_{{{\text{total}}}} ({\mathbf{x}}) - u_{{{\text{aleatoric}}}} ({\mathbf{x}})$$
(4)
$$u_{{{\text{total}}}} ({\mathbf{x}}) \triangleq - \sum\limits_{y \in Y} {\left( {\frac{1}{M}\sum\limits_{i = 1}^{M} p (y|f_{i} ,{\mathbf{x}})} \right)} \log_{2} \left( {\frac{1}{M}\sum\limits_{i = 1}^{M} p (y|f_{i} ,{\mathbf{x}})} \right)$$
(5)
$$u_{{{\text{aleatoric}}}} ({\mathbf{x}}) \triangleq - \frac{1}{M}\sum\limits_{i = 1}^{M} {\sum\limits_{y \in Y} p } (y|f_{i} ,{\mathbf{x}})\log_{2} p(y|f_{i} ,{\mathbf{x}})$$
(6)

where x is an m-dimensional input feature vector, Y is the label space, M is the number of neural network ensemble models, and fi represents each ensemble model.

Energy score

The Helmholtz free energy in deep neural networks is given as follows:

$${\text{Score}}_{{{\text{Energy}}}} ({\mathbf{x}}) \triangleq - T\log \left( {\sum\limits_{j = 1}^{K} {\exp } \left( {\frac{{f_{j} ({\mathbf{x}})}}{T}} \right)} \right)$$
(7)

where x it the m-dimensional input feature vector, T is the temperature parameter, and K is the number of maximum classes. This Energy Score can be calculated easily using the LogSumExp operator. In this case, K = 2, because we used it for binary classification. In addition, T = 1 was used.

GEM (Gaussian mixture based energy measurement) score

$${\text{Score}}_{{{\text{GEM}}}} ({\mathbf{x}}) \triangleq - \log \left( {\sum\limits_{j = 1}^{K} {\exp } (f_{j} ({\mathbf{x}}|{{\varvec{\uptheta}}}))} \right)$$
(8)

where x is the m-dimensional input feature vector.

$$f_{j} (x|\theta ) \triangleq - \frac{1}{2}(h(x|\theta ) - \hat{\mu }_{j} )^{{\text{T}}} \hat{\Sigma }^{ - 1} (h(x|\theta ) - \hat{\mu }_{j} )$$
(9)
$$\hat{\mu }_{j} \triangleq \frac{1}{{|{\mathcal{I}}_{j} |}}\sum\limits_{{i \in {\mathcal{I}}_{j} }} h (x_{i} |{{\varvec{\uptheta}}})$$
(10)
$$\hat{\Sigma } \triangleq \frac{1}{{\sum\limits_{j = 1}^{K} | {\mathcal{I}}_{j} |}}\sum\limits_{j = 1}^{K} {\sum\limits_{{i \in {\mathcal{I}}_{j} }} {\left( {h({\mathbf{x}}_{i} |{{\varvec{\uptheta}}}) - \widehat{{{\varvec{\upmu}}}}_{j} } \right)} } \left( {h({\mathbf{x}}_{i} |{{\varvec{\uptheta}}}) - \widehat{{{\varvec{\upmu}}}}_{j} } \right)^{{\text{T}}}$$
(11)

where h(x;θ) is the m-dimensional output feature vector calculated using neural network model f. We assume that this feature vector space follows a K-class conditional multivariate Gaussian distribution. The distribution parameters \((\hat{\mu }_{j} ,\hat{\Sigma })\) for each class are defined by the training data. \({\mathcal{I}}_{j}\) is the index set belonging to class j in the training data, and \(|{\mathcal{I}}_{j} |\) is the number of elements.

We used all 334 features from the Hirosaki health checkup data to train the OOD detection models. The VAE model had a hidden layer size of 200, latent dimension of 75, learning rate of 1e-03, and maximum epoch of 400. The hidden layers of the NN Classification model were 200 and 50, batch size was 32, learning rate was 1e−03, maximum epoch was 100, and disease onset labels within a year were the target variables. For the ensemble method, five NN Classification models were trained using different seed values.

Development of disease onset prediction models within 1 year

Disease onset within 1 year labels

Diabetes, hypertension, and dyslipidemia were selected as lifestyle-related diseases. We assigned '1' for individuals diagnosed with the specified disease within one year from the measurement year and '0' otherwise. Diagnostic criteria39,40,41,42 for determining disease onset were based on specific medical standards, as listed in Supplementary Table 2. Data with missing items were excluded to ensure accurate labeling of disease onset (Fig. 7).

Training of disease onset prediction model

We used Hirosaki health checkup data as the training data and developed three binary classification models using XGBoost43 for each disease onset prediction model within a year. We performed feature selection using recursive feature elimination44 and narrowed down all 334 features to the most relevant 20 features, given in Supplementary Table 3, for each model, XGBoost parameters were optimized using a grid search, as shown in Supplementary Table 4. The selected parameters are same for all disease prediction model— n_estimators: 6, min_child_weight: 1, and max depth: 6.

Evaluation of OOD detection models in ODROP method

OOD detection models calculate OOD scores, which indicate the extent to which data are OOD. Scores below a threshold are classified as ID, and those above as OOD. We used a rejection rate metric to evaluate the OOD detection model independent of the OOD score threshold. This rejection rate metric R measures the proportion of rejected test data (excluded from the prediction) based on the OOD score.

$$R = \frac{{N_{{{\text{OOD}}}} }}{{N_{{{\text{ID}}}} + N_{{{\text{OOD}}}} }}$$
(12)

First, we varied the OOD score threshold to gradually reduce it. We then constructed a rejection curve31 by plotting the rejection rate at each OOD score threshold on the horizontal axis and the corresponding prediction performance metric on the vertical axis. An upward trend in the rejection curve indicates improved prediction performance metrics for the test data, including the dataset shift. In this study, we used the AUROC and PRAUC as predictive performance metrics to conduct a qualitative evaluation of the most effective OOD detection model based on the improvement range and rejection rate at the maximum improvement observed in the rejection curve. We applied this approach to predict the onset of diabetes, hypertension, and dyslipidemia within one year. Additionally, we quantitatively assessed the rank correlation coefficient between the rejection rate and performance metrics by employing Kendall’s tau rank correlation coefficient to evaluate the performance improvement stability by increasing the rejection rate. A positive coefficient indicates a progressive improvement in predictive performance with increasing rejection rate; higher values suggest a more stable improvement. We used a maximum rejection rate of 40% to calculate the rejection curve and rank the correlation coefficient.

Discovery of dataset shift for disease onset prediction model

To identify important dataset shift items for the disease onset prediction model, we conducted hierarchical clustering using SHAP24,45, highlighting the contribution of each item in the prediction model. Hierarchical clustering was applied to the Wakayama health checkup data, in which each disease occurred within one year, using the Ward aggregation and Euclidean distance. We then created ID and OOD data labels using the OOD score at the rejection rate, considering the maximum improvement in the AUROC-rejection curve as the threshold.