Out-of-distribution reject option method for dataset shift problem in early disease onset prediction

Tosaki, Taisei; Uchino, Eiichiro; Kojima, Ryosuke; Mineharu, Yohei; Okamoto, Yuji; Arita, Mikio; Miyai, Nobuyuki; Tamada, Yoshinori; Mikami, Tatsuya; Murashita, Koichi; Nakaji, Shigeyuki; Okuno, Yasushi

doi:10.1038/s41598-025-01811-8

Download PDF

Article
Open access
Published: 02 June 2025

Out-of-distribution reject option method for dataset shift problem in early disease onset prediction

Scientific Reports volumeÂ 15, ArticleÂ number:Â 19240 (2025) Cite this article

1022 Accesses
3 Altmetric
Metrics details

Subjects

Abstract

Machine learning is increasingly used to predict lifestyle-related disease onset using health and medical data. However, its predictive accuracy for use is often hindered by dataset shift, which refers to discrepancies in data distribution between the training and testing datasets. This issue leads to the misclassification of out-of-distribution (OOD) data. To diminish dataset shift in real-world settings, this paper proposes the out-of-distribution reject option for prediction (ODROP). This method integrates an OOD detection model to preclude OOD data from the prediction phase. We used two real-world health checkup datasets (Hirosaki and Wakayama) with dataset shift, across three disease onset prediction tasks: diabetes, dyslipidemia, and hypertension. Both components of ODROP methodâ€”the OOD detection model and the prediction modelâ€”were trained on the Hirosaki dataset. We assessed the effectiveness of ODROP on the Wakayama dataset using AUROC-rejection rate curve plot. In the five OOD detection approaches (the variational autoencoder, neural network ensemble std, neural network ensemble epistemic, neural network energy, and neural network Gaussian mixture based energy measurement), the variational autoencoder method demonstrated notably higher stability and a greater improvement in AUROC. For example, in the Wakayama dataset, the AUROC for diabetes onset increased from 0.80 without ODROP to 0.90 at a 31.1% rejection rate, and for dyslipidemia, it improved from 0.70 without ODROP to 0.76 at a 34% rejection rate. In addition, we were able to categorize dataset shifts into two types using SHAP clusteringâ€”those that considerably affect predictions and those that do not. We expect that this classification will help standardize measuring instruments. This study is the first to apply OOD detection to actual health and medical data, demonstrating its potential to substantially improve the accuracy and reliability of disease prediction models amidst dataset shift.

Dyslipidemia and its associated factors among community adults located in Shangcheng district, Zhejiang province

Article Open access 21 February 2024

The moderating effect of physical activity on the association between screen-based behaviors and chronic diseases

Article Open access 05 September 2022

Comparison of obesity indicators for predicting cardiovascular risk factors and multimorbidity among the Chinese population based on ROC analysis

Article Open access 09 September 2024

Introduction

Advancements in machine learning have made it possible to predict disease risk based on large-scale multivariate health and medical data^1,2,3,4. Machine learning models for disease onset prediction, especially those based on lifestyle, diet, and exercise habits, are expected to individually prevent diseases by forecasting the potential development of lifestyle-related diseases, such as diabetes and hypertension, by presenting individual contributing factors⁵. Constructing higher-performing machine learning models requires a vast amount of training data. Hence, multi-item health and medical data are accumulated worldwide from patients with chronic diseases and healthy individuals alike^6,7,8.

The difficulty of data sharing and scarcity of health and medical data emphasize the importance of using a disease onset prediction model built on health checkup data collected at one site for use at other sites. In this case, the disease onset prediction model faces the challenge of dataset shift^9,10,11, a problem where the probability distributions of training and test data differ (Fig.Â 1a), causing the test data to have in-distribution (ID) and out-of-distribution (OOD) data. The distribution difference means that one of the model assumptions, that is, training and test data distributions are equal, does not hold, leading to the modelâ€™s misclassification of the OOD test data. The problem arises when the data acquisition location for training and actual testing differ^9,11. Factors affecting the dataset shift problem include regional differences in diet, lifestyle, and exercise habits, as well as discrepancies in the measurement instruments used at various sites. Such variations based on unique regional characteristics make it challenging to avoid dataset shift.

In prior research, most early disease onset prediction models have not addressed the dataset shift. These models have been evaluated solely on the internal dataset^1,2,3,4 or reported a reduction of area under the receiver operating curve (AUROC) when applied to different sites due to OOD data¹². Some studies^13,14 have attempted to enhance robustness of disease onset predictions for external datasets by excluding OOD data from prediction samples. They used conformal prediction¹⁵, a statistical testing method using features transformed by a neural network encoder to identify OOD data, and then made predictions on ID data. As a result, the OOD detection and prediction models become integrated, limiting the ability to use pre-trained predictive models with high discriminative power. On the other hand, OOD detection models that operate independently of prediction models are quite popular in computer vision^{16,17,18,19,20}. However, unlike images, where human experts can annotate ID and OOD data, electronic health records and health checkup data present challenges for such annotation. This difference makes it difficult to directly apply OOD detection models in computer vision to these types of data. Thus, methods for effectively handling OOD health and medical data derived from dataset shift remain insufficient.

The aim of this study is to explore effective methods to address the dataset shift problem in disease onset prediction models when testing health and medical data with different distributions from the training data. Our proposed approach involves a two-stage predictive method called out-of-distribution reject option for prediction (ODROP, Fig.Â 1b), which uses an OOD detection model to reject OOD data from a test dataset. In the first stage, an OOD detection model score the divergence between the training and test data distributions to discern the appropriateness of the test data as ID or OOD data. In the second stage, we include an option to avoid predictions for data identified as OOD. Our ODROP method derives from the known reject option method, which avoids class prediction when the classification confidence is within a certain range^21,22. We refine this reject option method for OOD data caused by a dataset shift.

We used five OOD detection methods and two health checkup datasets with a dataset shift and evaluated their methodsâ€™ effectiveness in three disease onset prediction tasks, namely diabetes, hypertension, and dyslipidemia, within one year. Our evaluation considered three aspects: stability, extent of improvement in the prediction performance metrics, and the proportion of rejected samples at maximum improvement. We identified the ODROP method using a variational autoencoder (VAE) ²³ as the optimal OOD detection model. In addition, we compared the patterns of prediction contribution (SHAP) ²⁴ values between the ID and rejected OOD data groups. We discovered for the first time that the dataset shift could be classified into those considerably contributing to disease onset prediction and those that do not.

Our contributions are following: this study is the first to apply out-of-distribution (OOD) detection models to real-world health and medical data, demonstrating their effectiveness in detail. Additionally, our proposed ODROP method offers a solution to the dataset shift problem. It enhances the robustness of existing clinical prediction models against dataset shift without altering their prediction mechanisms, thereby providing a practical and impactful advancement in the field.

Results

Dataset shift between two Health checkup datasets

Several cohort studies^8,25,26 have been conducted that reflect the regional characteristics of Japan. Some of these studies have multi-item health examination data, including physiological and biochemical data, such as blood and respiratory metrics; data on personal activities, such as diet, exercise habits, and daily stress; and socioeconomic data, such as educational background and work environment. In this study, we used two multi-item health checkup datasets from different regions of Japan: Hirosaki City in Aomori Prefecture⁸ and Wakayama Prefecture^25,26. We conducted statistical tests to confirm dataset shifts between the two and plotted kernel density estimation (KDE) for each item. The results are presented in Table 1 and Fig.Â 2. Complete summary statistics for all items from both sites and the results of the statistical tests between the two sites can be found in Supplementary Table 1. The KDE plots in Fig.Â 2 visualize the distribution shifts in two health datasets. However, the overlapping regions in the distributions suggest that the Wakayama health checkup data (Wakayama data) can be divided into two groups, with one group having similar characteristics to the Hirosaki health checkup data (Hirosaki data).

Table 1 Summary statistics (meanâ€‰Â±â€‰std) and test p-values for main items in Hirosaki and Wakayama health checkups.

Full size table

Baseline evaluation of Hirosaki and Wakayama health checkup data

We confirmed the occurrence of the dataset shift problem: whether the predictive performance metrics in the Wakayama health checkup data decreased compared to the Hirosaki health checkup data, which is the training base for the disease onset prediction models. We compared the mean receiver operating characteristic (ROC) curve from fivefold cross-validation at Hirosaki with the ROC curve for the Wakayama health checkup data in Fig.Â 3. The precision-recall (PR) curves are shown in Supplementary Fig.Â 1. The Wakayama health checkup AUROC is lower in the three disease onset prediction tasks compared to the Hirosaki mean AUROC, with decreases of 0.11 for diabetes, 0.09 for dyslipidemia, and 0.02 for hypertension. Similarly, PRAUC decreased for all tasks by 0.116, 0.253, and 0.012 for diabetes, dyslipidemia, and hypertension, respectively. Hypertension has the smallest decline in AUROC and PRAUC values. Hereafter, the mean AUROC from fivefold cross-validation is referred to as the Hirosaki AUROC baseline, and that from the Wakayama health checkup data is referred to as the Wakayama AUROC baseline (the same applies to PRAUC).

Rejection rate evaluation

We used the rejection rate for ODROP evaluation in health and medical data, which is the proportion of OOD data rejected from all test data. We assessed five OOD detection methods: VAE reconstruction loss (VAE reconstruction)²⁷, neural network ensemble std (ensemble std)²⁸, neural network ensemble epistemic (ensemble epistemic)²⁸, neural network energy (energy)²⁹, Gaussian mixture based energy measurement (GEM)³⁰ for diabetes, hypertension, and dyslipidemia onset prediction within one year. The rejection curve³¹ evaluates the extent of prediction metric improvement (AUROC or PRAUC on the y-axis) with the rejection rate (x-axis). The 0% rejection rate represents â€œbaseline,â€ which is the prediction metric value for all the test data. The result of 0% rejection rate, indicating the absence of ODROP method, corresponds to prediction metrics reported in previous studies. Increasing the rejection rate from 0% allows for the gradual exclusion of the OOD test data. We confirmed that subsequent exclusion led to a stepwise improvement in the predictive performance metrics of the model. In addition, to evaluate the stability of the prediction metric improvement when increasing the rejection rate, we evaluated the rank correlation coefficient between the prediction performance metric and rejection rate. The rank correlation coefficient is positive if the ODROP method improves the prediction performance metrics from the baseline at an increased rejection rate. In addition, the larger the coefficient, the more stable and consistent the improvement.

Internal validation using Hirosaki health checkups

For internal validation, we used the proposed ODROP method on Hirosaki health checkup data, which do not exhibit a dataset shift, and evaluated it using fivefold cross validation. The results for the AUROC across the three disease onset prediction tasks are shown in Fig.Â 4, and the PRAUC results in Supplementary Fig.Â 2.

From the bar graphs showing the rank correlation coefficients in Fig.Â 4a, we confirmed that VAE reconstruction was positive for diabetes; energy and ensemble std were positive for dyslipidemia; and GEM, energy, ensemble std, and VAE reconstruction were positive for hypertension. In Fig.Â 4b, the methods that improved the mean AUROC from the baseline were VAE reconstruction for diabetes; ensemble epistemic, ensemble std, and VAE reconstruction for dyslipidemia; and GEM, ensemble epistemic, ensemble std, and VAE reconstruction for hypertension. This indicates that these methods effectively improve the prediction performance metrics when rejecting OOD data. The method that showed the greatest improvement in mean AUROC from baseline was VAE reconstruction for diabetes and dyslipidemia and ensemble epistemic for hypertension. The maximum mean AUROC is 0.916 (rejection rate: 24.0%), 0.808 (33.2%), and 0.848 (38.4%) for diabetes, dyslipidemia, and hypertension, respectively. The maximum extent of AUROC improvement was 0.015 for diabetes, 0.017 for dyslipidemia, and 0.021 for hypertension. VAE reconstruction was the only method that indicated a tendency for AUROC improvement across the three disease onset prediction tasks.

External validation using Wakayama health checkups

We used five OOD detection methods, namely VAE reconstruction, ensemble epistemic, ensemble std, energy, and GEM, and applied each ODROP approach to the Wakayama health checkups, which had a dataset shift between the Hirosaki health checkups. The results for the AUROC across the three disease onset prediction tasks are shown in Fig.Â 5, and the PRAUC results in Supplementary Fig.Â 3. For diabetes and dyslipidemia, VAE reconstruction method yielded positive rank correlation coefficients for the AUROC. The ensemble epistemic and ensemble std method were positive for hypertension. VAE reconstruction method also demonstrated positive rank correlations for PRAUC in diabetes and hypertension, suggesting it consistently improved the predictive performance metrics.

In Fig.Â 5, only VAE reconstruction method is shown to improve AUROC for diabetes, reaching a peak of 0.90 at 31.1% rejection rate, marking a 0.1 improvement over the Wakayama baseline. For dyslipidemia, VAE reconstruction method improved AUROC at a lower rejection rate than the ensemble epistemic, maintaining around 0.75 and peaking at 0.76. For hypertension, methods using neural network ensembles, ensemble std and epistemic show similar improvements in AUROC, with VAE reconstruction method maintaining near-baseline performance. In the three diseases investigated, the energy method, which was initially developed for image-based OOD detection, did not improve the AUROC scores but progressively improved the PRAUC scores and is a notable finding of this study. Additionally, the GEM method, an advanced version of the energy model, consistently underperforms the energy method in both predictive performance metrics. This indicates that the advancements in image-domain methods do not always correlate with improved outcomes.

These findings suggest that VAE reconstruction is the most suitable OOD detection method for the ODROP approach because of its considerable improvement in predictive performance metrics, lower rejection rates during improvement, and stable enhancement across various rejection rates, particularly during gradual increases in the rejection rate.

Discovery of dataset shift for contributing to disease onset prediction model by SHAP clustering

To identify the items that considerably impact disease onset prediction owing to the dataset shift, we used SHAP²⁴ values, which quantitatively represent the contribution of each predictor to the modelâ€™s output. Differences in the SHAP value patterns between the ID and OOD data groups, can help determine which items cause considerable dataset shifts that affect disease onset prediction.

We show the clustering result using VAE reconstruction as an OOD detection method for ODROP method in predicting diabetes onset within one year (Fig.Â 6a). The clustering of each item was split into two clusters: one with a high tendency for absolute SHAP values, notably HbA1c, and the other with lower values across the remaining items. The clustering of each record for diabetes onset within one year was split into two groups based on the HbA1c SHAP values, which were identified as the ID and OOD data groups based on the labels assigned. The actual HbA1c values for the ID and OOD groups (Fig.Â 6b), reveal that the OOD group has relatively lower HbA1c levels than the ID group. Thus, this dataset shift in HbA1c is considerable for the model predicting diabetes onset within 1 year. The results of SHAP clustering for individuals diagnosed with dyslipidemia or hypertension within a year of the Wakayama health checkup data are provided in Supplementary Figs.Â 4A and B, respectively.

Discussion

This study demonstrates that the proposed ODROP method can improve predictive performance metrics from the baseline in disease onset predictions across two health checkup datasets with different regional characteristics within the same country. This approach offers a viable solution to the dataset shift problem by addressing the issue of discrepancies between the predictive performance at the model training location and the actual application site^9,11. Evaluation of the three perspectives revealed that the ODROP method using VAE reconstruction as the OOD detection method was optimal. In addition, we analyzed the SHAP value patterns of the disease onset prediction model and discovered, for the first time, that datasets from different regions included dataset shifts that considerably impacted disease onset prediction and those that did not.

We showed that the ODROP method could improve the prediction metrics of diabetes, dyslipidemia, and hypertension onset within one year when using the Wakayama and Hirosaki health checkup data as the test and training data, respectively. The VAE reconstruction for diabetes prediction and ensemble epistemic ODROP method for hypertension prediction considerably improved the AUROC scores, reaching 0.90 and 0.875, respectively. These improvements matched or exceeded Hirosakiâ€™s baseline performance. Thus, the ODROP method can adequately address the dataset shift problem in disease onset prediction within one year. These results also suggest that the Wakayama health checkup data, affected by dataset shifts, contained groups similar and dissimilar to the Hirosaki health checkup data. The ODROP method effectively isolates and predicts similar groups, improving the predictive metric performance. This indicates the potential effectiveness of the ODROP method in other regions with test datasets comprising groups similar and dissimilar to the training data, providing a viable solution to the dataset shift problem in health data analytics.

Internal and external validations were conducted to explore the most appropriate OOD detection method for health and medical data using the ODROP method. Internal validation demonstrated improved predictive performance metrics for all three disease onset predictions. The VAE reconstruction ODROP method showed superior stability and magnitude of improvement in the AUROC, suggesting its effectiveness even when applied within the same location as the training dataset. In the external validation, the VAE reconstruction ODROP method uniquely and consistently improved the AUROC for diabetes and dyslipidemia onset predictions, although it maintained the AUROC baseline for hypertension onset prediction within one year. These results suggest VAE reconstruction as the most effective and optimal OOD detection method in the ODROP approach for health and medical data, considering its stable improvement in predictive performance metrics and considerable improvement range. As an unsupervised learning model that does not require a target variable, VAE allows for flexible applications across multiple prediction tasks without retraining the neural network classifier for each task. This versatility gives the VAE an advantage over neural network classifier-based OOD detection methods (ensemble epistemic, ensemble std, energy, and GEM), enabling more efficient deployment of the ODROP approach across various predictive scenarios. Energy and GEM, initially developed for image-based OOD detection, underperform compared with other methods in structured data, including health and medical data. The lack of superior results suggests that image-based OOD detection models do not always translate well to structured data. This highlights the need for new benchmarks tailored to structured datasets, particularly health and medical datasets.

The proposed method has two advantages. First, the OOD detection model operates independently of the predictive model. This allows for the straightforward addition of an OOD detection model to existing medical or clinical prediction models using structured data, facilitating improvements without modifying existing prediction models. This integration can also address dataset shift and provide more reliable prediction outcomes without altering the original models. Second, the ODROP method does not require dataset sharing between training and testing sites when constructing the OOD detection model. Previous approaches to addressing dataset shift assumed simultaneous access to training and test data^32,33, a challenging requirement for health and medical data owing to privacy concerns. Thus, the ODROP method is a practical solution to address dataset shift without data sharing.

Furthermore, we compared the SHAP clustering patterns of item contributions between the ID and OOD groups in patients who developed diabetes within 1 year. Dataset shifts can be classified into two: those that considerably impact predictions and those that do not. Previous studies have systematized dataset shifts by starting with a covariate shift¹⁰. In contrast, this study is the first to focus on dataset shifts in terms of their contribution to the prediction model. Identifying items that cause considerable dataset shifts for predictive models is crucial because these identifications could lead to the standardization of measurement instruments across multiple hospital sites and practical measures for addressing dataset shifts.

One limitation of the proposed ODROP method is that it cannot provide prediction results for all test data and requires predictive models optimized for data from each testing site. Although domain adaptation and generalization techniques^34,35 have been explored for constructing such models, they require retraining neural network models, necessitating large sample sizes and data sharing across sites for fine-tuning. Thus, the selection or combination of these techniques or our method for appropriate manner is of importance to achieve effective prediction in clinical settings.

The development of the ODROP method employing an OOD detection model enabled reliable and accurate predictions across health and medical datasets affected by dataset shift. This study first evaluated multiple OOD detection methods in health and medical data, assessing improvements in predictive performance metrics considering stability, magnitude, and rejection rate in three disease onset prediction tasks. Accordingly, we demonstrated that VAE reconstruction is the optimal OOD detection method for health and medical data. Our ODROP method provides a general solution to the dataset shift problem because it enhances the robustness of existing clinical prediction models against dataset shift without modifying the prediction mechanism.

Methods

Data

We used health checkup data from the Iwaki Health Promotion Study⁸ from 2005 to 2020 and the Wakayama Study^25,26 from 2018 to 2019. These datasets are comprehensive, encompassing over 2000 items, including physiological and biochemical data such as blood and respiratory metrics, personal lifestyle data such as diet and stress, and socioenvironmental data such as education and employment, showcasing diverse regional characteristics within Japan. Of the 383 common items between the two datasets, we selected 334 items with less than 50% missing data in both datasets and had data available for at least two consecutive years (Fig.Â 7). We conducted statistical tests between the Hirosaki and Wakayama health checkup data across 334 items and 3 additional items representing labels indicating the onset of diabetes, dyslipidemia, and hypertension within one year. For continuous variables, we used Welchâ€™s t-test, whereas for discrete variables, we used the Ï‡² test and Fisherâ€™s exact test following Cochranâ€™s rule. This study was approved by the Hirosaki University Faculty of Medicine Ethics Committee (annual approval, latest approval number: 2023-007-1) and conducted in accordance with the Declaration of Helsinki. Written informed consent was obtained from all participants.

ODORP flow

The pseudocode of the ODROP method is shown in Fig.Â 8 (Algorithm 1). In the ODROP method, an OOD detection model and a prediction model are first trained. Subsequently, for each test data point, an OOD score is calculated using the OOD detection model. If this score is below a predefined threshold, predictions are made using the prediction model. Further details on the OOD score and the prediction model are discussed in the following sections.

OOD detection model

Machine learning models assume that the test data come from the same distribution as the training data and may not perform accurately on OOD test data that deviate from the training data distribution. Identifying OOD data is crucial and is referred to as OOD detection^16,18. OOD detection models compute an OOD score indicating the â€œlikelihoodâ€ that the input data is OOD. Each input datum is classified as ID if the OOD score is below a certain threshold and OOD otherwise.

OOD detection models have evolved considerably and are categorized into generative and classification model-based approaches^16,18. Traditionally, these models are benchmarked using existing image databases and manually separated into ID and OOD datasets to assess the binary classification performance (OOD-AUROC, OOD-PRAUC) ^17,20. Recently, classification-model-based approaches have been proposed in the image domain^29,30,36, building on the foundations established by generative model-based methods^37,38, reflecting advancements in accurately identifying OOD data. However, tabular data requires advanced domain knowledge of experts to distinguish ID and OOD datasets, and they have not been benchmarked, particularly health and medical data. In this study, we employed representative OOD detection methods such as the generative model-based VAE^23,27, the neural network classification model-based ensemble method²⁸, and GEM³⁰, a method developed based on neural network energy²⁹, recently developed and proposed in the field of imaging as an OOD detection model. Table 2 lists the name of each OOD detection method, its OOD score, and the calculation method.

Table 2 OOD detection method.

Full size table

The definitions of each OOD score (VAE reconstruction loss, ensemble std, ensemble epistemic, energy, and GEM scores) are as follows:

VAE reconstruction loss (VAE reconstruction) score

$${\text{Score}}_{{{\text{Reconstruction}}}} ({\mathbf{x}}) \triangleq {\mathbb{E}}_{{z\sim q^{{{\text{VAE}}}} (z|{\mathbf{x}})}} \left[ {\left\| {{\mathbf{x}} - f_{{{\text{VAE}}}} (z)} \right\|^{2} } \right]$$

(1)

where x is the m-dimensional input vector. ($q^{{{\text{VAE}}}}$, $f_{{{\text{VAE}}}}$) are the encoder and decoder model obtained using VAE, respectively. The score is calculated by taking the expected value after 10-samplings according to the encoder model $q^{{{\text{VAE}}}}$.

Ensemble prediction probability standard deviation (ensemble std) score

$${\text{Score}}_{{{\text{std}}}} ({\mathbf{x}}) \triangleq \sqrt {\frac{1}{M}\sum\limits_{i = 1}^{M} {(p_{i} (} {\mathbf{x}}) - p_{{{\text{ensemble}}}} ({\mathbf{x}}))^{2} }$$

(2)

$$p_{{{\text{ensemble}}}} ({\mathbf{x}}) \triangleq \frac{1}{M}\sum\limits_{i = 1}^{M} {p_{i} } ({\mathbf{x}})$$

(3)

where M is the number of neural network ensemble models and p_i (x) is the prediction probability when x is the m-dimensional input vector.

Ensemble epistemic uncertainty (ensemble epistemic) score

$${\text{Score}}_{{{\text{epistemic}}}} ({\mathbf{x}}) \triangleq u_{{{\text{total}}}} ({\mathbf{x}}) - u_{{{\text{aleatoric}}}} ({\mathbf{x}})$$

(4)

$$u_{{{\text{total}}}} ({\mathbf{x}}) \triangleq - \sum\limits_{y \in Y} {\left( {\frac{1}{M}\sum\limits_{i = 1}^{M} p (y|f_{i} ,{\mathbf{x}})} \right)} \log_{2} \left( {\frac{1}{M}\sum\limits_{i = 1}^{M} p (y|f_{i} ,{\mathbf{x}})} \right)$$

(5)

$$u_{{{\text{aleatoric}}}} ({\mathbf{x}}) \triangleq - \frac{1}{M}\sum\limits_{i = 1}^{M} {\sum\limits_{y \in Y} p } (y|f_{i} ,{\mathbf{x}})\log_{2} p(y|f_{i} ,{\mathbf{x}})$$

(6)

where x is an m-dimensional input feature vector, Y is the label space, M is the number of neural network ensemble models, and f_i represents each ensemble model.

Energy score

The Helmholtz free energy in deep neural networks is given as follows:

$${\text{Score}}_{{{\text{Energy}}}} ({\mathbf{x}}) \triangleq - T\log \left( {\sum\limits_{j = 1}^{K} {\exp } \left( {\frac{{f_{j} ({\mathbf{x}})}}{T}} \right)} \right)$$

(7)

where x it the m-dimensional input feature vector, T is the temperature parameter, and K is the number of maximum classes. This Energy Score can be calculated easily using the LogSumExp operator. In this case, Kâ€‰=â€‰2, because we used it for binary classification. In addition, Tâ€‰=â€‰1 was used.

GEM (Gaussian mixture based energy measurement) score

$${\text{Score}}_{{{\text{GEM}}}} ({\mathbf{x}}) \triangleq - \log \left( {\sum\limits_{j = 1}^{K} {\exp } (f_{j} ({\mathbf{x}}|{{\varvec{\uptheta}}}))} \right)$$

(8)

where x is the m-dimensional input feature vector.

$$f_{j} (x|\theta ) \triangleq - \frac{1}{2}(h(x|\theta ) - \hat{\mu }_{j} )^{{\text{T}}} \hat{\Sigma }^{ - 1} (h(x|\theta ) - \hat{\mu }_{j} )$$

(9)

$$\hat{\mu }_{j} \triangleq \frac{1}{{|{\mathcal{I}}_{j} |}}\sum\limits_{{i \in {\mathcal{I}}_{j} }} h (x_{i} |{{\varvec{\uptheta}}})$$

(10)

$$\hat{\Sigma } \triangleq \frac{1}{{\sum\limits_{j = 1}^{K} | {\mathcal{I}}_{j} |}}\sum\limits_{j = 1}^{K} {\sum\limits_{{i \in {\mathcal{I}}_{j} }} {\left( {h({\mathbf{x}}_{i} |{{\varvec{\uptheta}}}) - \widehat{{{\varvec{\upmu}}}}_{j} } \right)} } \left( {h({\mathbf{x}}_{i} |{{\varvec{\uptheta}}}) - \widehat{{{\varvec{\upmu}}}}_{j} } \right)^{{\text{T}}}$$

(11)

where h(x;Î¸) is the m-dimensional output feature vector calculated using neural network model f. We assume that this feature vector space follows a K-class conditional multivariate Gaussian distribution. The distribution parameters $(\hat{\mu }_{j} ,\hat{\Sigma })$ for each class are defined by the training data. ${\mathcal{I}}_{j}$ is the index set belonging to class j in the training data, and $|{\mathcal{I}}_{j} |$ is the number of elements.

We used all 334 features from the Hirosaki health checkup data to train the OOD detection models. The VAE model had a hidden layer size of 200, latent dimension of 75, learning rate of 1e-03, and maximum epoch of 400. The hidden layers of the NN Classification model were 200 and 50, batch size was 32, learning rate was 1eâˆ’03, maximum epoch was 100, and disease onset labels within a year were the target variables. For the ensemble method, five NN Classification models were trained using different seed values.

Development of disease onset prediction models within 1 year

Disease onset within 1 year labels

Diabetes, hypertension, and dyslipidemia were selected as lifestyle-related diseases. We assigned '1' for individuals diagnosed with the specified disease within one year from the measurement year and '0' otherwise. Diagnostic criteria^39,40,41,42 for determining disease onset were based on specific medical standards, as listed in Supplementary Table 2. Data with missing items were excluded to ensure accurate labeling of disease onset (Fig.Â 7).

Training of disease onset prediction model

We used Hirosaki health checkup data as the training data and developed three binary classification models using XGBoost⁴³ for each disease onset prediction model within a year. We performed feature selection using recursive feature elimination⁴⁴ and narrowed down all 334 features to the most relevant 20 features, given in Supplementary Table 3, for each model, XGBoost parameters were optimized using a grid search, as shown in Supplementary Table 4. The selected parameters are same for all disease prediction modelâ€” n_estimators: 6, min_child_weight: 1, and max depth: 6.

Evaluation of OOD detection models in ODROP method

OOD detection models calculate OOD scores, which indicate the extent to which data are OOD. Scores below a threshold are classified as ID, and those above as OOD. We used a rejection rate metric to evaluate the OOD detection model independent of the OOD score threshold. This rejection rate metric R measures the proportion of rejected test data (excluded from the prediction) based on the OOD score.

$$R = \frac{{N_{{{\text{OOD}}}} }}{{N_{{{\text{ID}}}} + N_{{{\text{OOD}}}} }}$$

(12)

First, we varied the OOD score threshold to gradually reduce it. We then constructed a rejection curve³¹ by plotting the rejection rate at each OOD score threshold on the horizontal axis and the corresponding prediction performance metric on the vertical axis. An upward trend in the rejection curve indicates improved prediction performance metrics for the test data, including the dataset shift. In this study, we used the AUROC and PRAUC as predictive performance metrics to conduct a qualitative evaluation of the most effective OOD detection model based on the improvement range and rejection rate at the maximum improvement observed in the rejection curve. We applied this approach to predict the onset of diabetes, hypertension, and dyslipidemia within one year. Additionally, we quantitatively assessed the rank correlation coefficient between the rejection rate and performance metrics by employing Kendallâ€™s tau rank correlation coefficient to evaluate the performance improvement stability by increasing the rejection rate. A positive coefficient indicates a progressive improvement in predictive performance with increasing rejection rate; higher values suggest a more stable improvement. We used a maximum rejection rate of 40% to calculate the rejection curve and rank the correlation coefficient.

Discovery of dataset shift for disease onset prediction model

To identify important dataset shift items for the disease onset prediction model, we conducted hierarchical clustering using SHAP^24,45, highlighting the contribution of each item in the prediction model. Hierarchical clustering was applied to the Wakayama health checkup data, in which each disease occurred within one year, using the Ward aggregation and Euclidean distance. We then created ID and OOD data labels using the OOD score at the rejection rate, considering the maximum improvement in the AUROC-rejection curve as the threshold.

Data availability

The health checkup data used were collected from the Iwaki Health Promotion Project and the Wakayama study and were anonymized, and transferred to a secure data center with access restrictions. Anonymized data are available only to researchers for academic purposes who meet the access criteria provided by the Hirosaki University Faculty of Medicine (e-mail: [email protected]), which requires approval from the ethics review committees of the Hirosaki University Faculty of Medicine and the researcherâ€™s affiliated institutions. Additional data are available upon reasonable request from the corresponding author.

Code availability

The code for OOD detection in tabular data we used includes https://github.com/clinfo/OOD4Tab.

References

SchÃ¼ssler-Fiorenza Rose, S. M. et al. A longitudinal big data approach for precision health. Nat. Med. 25, 792â€“804 (2019).
Uematsu, H., Yamashita, K., Kunisawa, S., Otsubo, T. & Imanaka, Y. Prediction of pneumonia hospitalization in adults using health checkup data. PLoS ONE 12, e0180159 (2017).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Kawasoe, M. et al. Development of a risk prediction score for hypertension incidence using Japanese health checkup data. Hypertens. Res. 45, 730â€“740 (2022).
ArticleÂ PubMedÂ Google ScholarÂ
Choi, Y., An, J., Ryu, S. & Kim, J. Development and evaluation of machine learning-based high-cost prediction model using health check-up data by the National Health Insurance Service of Korea. Int. J. Environ. Res. Public Health 19, 13672 (2022).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Nakamura, K. et al. Individual health-disease phase diagrams for disease prevention based on machine learning. J. Biomed. Inform. 144, 104448 (2023).
ArticleÂ PubMedÂ Google ScholarÂ
Chuang, S.-Y., Chen, C.-H. & Chou, P. Prevalence of metabolic syndrome in a large health check-up population in Taiwan. J. Chin. Med. Assoc. 67, 611â€“620 (2004).
PubMedÂ Google ScholarÂ
Chung, S. J. et al. Metabolic syndrome and visceral obesity as risk factors for reflux oesophagitis: a cross-sectional caseâ€“control study of 7078 Koreans undergoing health check-ups. Gut 57, 1360â€“1365 (2008).
ArticleÂ PubMedÂ CASÂ Google ScholarÂ
Nakaji, S. et al. Social innovation for life expectancy extension utilizing a platform-centered system used in the Iwaki health promotion project: A protocol paper. SAGE Open Med. 9, 20503121211002610 (2021).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
QuiÃ±onero-Candela, et al. Dataset Shift in Machine Learning (MIT Press, 2009).
Google ScholarÂ
Shimodaira, H. Improving predictive inference under covariate shift by weighting the log-likelihood function. J. Stat. Plann. Inference 90, 227â€“244 (2000).
ArticleÂ MathSciNetÂ MATHÂ Google ScholarÂ
Chen, R. J. et al. Algorithm Fairness in AI for Medicine and Healthcare. Preprint http://arxiv.org/abs/2110.00603 (2022).
Placido, D. et al. A deep learning algorithm to predict risk of pancreatic cancer from disease trajectories. Nat. Med. 29, 1113â€“1122 (2023).
ArticleÂ PubMedÂ PubMed CentralÂ CASÂ Google ScholarÂ
Shashikumar, S. P., Wardi, G., Malhotra, A. & Nemati, S. Artificial intelligence sepsis prediction algorithm learns to say â€œI donâ€™t knowâ€. npj Digit. Med. 4, 1â€“9 (2021).
Boussina, A. et al. Impact of a deep learning sepsis prediction model on quality of care and survival. npj Digit. Med. 7, 1â€“9 (2024).
Vovk, V., Gammerman, A. & Shafer, G. Algorithmic Learning in a Random World. (Springer, 2022). https://doi.org/10.1007/978-3-031-06649-8.
Yang, J., Zhou, K., Li, Y. & Liu, Z. Generalized Out-of-Distribution Detection: A Survey. Preprint http://arxiv.org/abs/2110.11334 (2022).
Cao, T., Huang, C.-W., Hui, D. Y.-T. & Cohen, J. P. A Benchmark of Medical Out of Distribution Detection. Preprint http://arxiv.org/abs/2007.04250 (2020).
Salehi, M. et al. A Unified Survey on Anomaly, Novelty, Open-Set, and Out-of-Distribution Detection: Solutions and Future Challenges. Preprint http://arxiv.org/abs/2110.14051 (2022).
Gawlikowski, J. et al. A Survey of Uncertainty in Deep Neural Networks. Preprint http://arxiv.org/abs/2107.03342 (2022).
Kirchheim, K., Filax, M. & Ortmeier, F. PyTorch-OOD: A library for out-of-distribution detection based on PyTorch. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 4350â€“4359 (IEEE, 2022). https://doi.org/10.1109/CVPRW56347.2022.00481.
Bishop, C. M. Pattern Recognition and Machine Learning (Springer, 2006).
MATHÂ Google ScholarÂ
Hendrickx, K., Perini, L., Van der Plas, D., Meert, W. & Davis, J. Machine Learning with a Reject Option: A Survey. Preprint http://arxiv.org/abs/2107.11277 (2021).
Kingma, D. P. & Welling, M. Auto-Encoding Variational Bayes. Preprint https://doi.org/10.48550/arXiv.1312.6114 (2022).
Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems. Vol. 30 (Curran Associates, Inc., 2017).
Zhang, Y. et al. Muscle mass reduction, low muscle strength, and their combination are associated with arterial stiffness in community-dwelling elderly population: The Wakayama Study. J. Hum . Hypertens. 35, 446â€“454 (2021).
ArticleÂ PubMedÂ CASÂ Google ScholarÂ
Zhang, Y., Miyai, N., Utsumi, M., Miyashita, K. & Arita, M. Spot urinary sodium-to-potassium ratio is associated with blood pressure levels in healthy adolescents: The Wakayama Study. J. Hum. Hypertens. 38, 238â€“244 (2024).
ArticleÂ PubMedÂ Google ScholarÂ
An, J. Variational autoencoder based anomaly detection using reconstruction probability. Spec. Lect. IE 2, 1 (2015).
Lakshminarayanan, B., Pritzel, A. & Blundell, C. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. Preprint http://arxiv.org/abs/1612.01474 (2017).
Liu, W., Wang, X., Owens, J. D. & Li, Y. Energy-Based Out-of-Distribution Detection. Preprint http://arxiv.org/abs/2010.03759 (2021).
Morteza, P. & Li, Y. Provable Guarantees for Understanding Out-of-distribution Detection. Preprint http://arxiv.org/abs/2112.00787 (2021).
Nadeem, M. S. A., Zucker, J.-D. & Hanczar, B. Accuracy-rejection curves (ARCs) for comparing classification methods with a reject option. In Proceedings of the Third International Workshop on Machine Learning in Systems Biology. 65â€“81 (PMLR, 2009).
Sugiyama, M. et al. Direct importance estimation for covariate shift adaptation. Ann. Inst. Stat. Math. 60, 699â€“746 (2008).
ArticleÂ MathSciNetÂ MATHÂ Google ScholarÂ
Sugiyama, M., Suzuki, T. & Kanamori, T. Density Ratio Estimation in Machine Learning. (Cambridge University Press, 2012). https://doi.org/10.1017/CBO9781139035613.
Guo, L. L. et al. Evaluation of domain generalization and adaptation on improving model robustness to temporal dataset shift in clinical medicine. Sci. Rep. 12, 2726 (2022).
ArticleÂ ADSÂ PubMedÂ PubMed CentralÂ CASÂ Google ScholarÂ
Zhang, T., Chen, M. & Bui, A. A. T. AdaDiag: Adversarial domain adaptation of diagnostic prediction with clinical event sequences. J. Biomed. Inform. 134, 104168 (2022).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Huang, R., Geng, A. & Li, Y. On the Importance of Gradients for Detecting Distributional Shifts in the Wild. Preprint http://arxiv.org/abs/2110.00218 (2021).
Ren, J. et al. Likelihood Ratios for Out-of-Distribution Detection. Preprint http://arxiv.org/abs/1906.02845 (2019).
Xiao, Z., Yan, Q. & Amit, Y. Likelihood Regret: An Out-of-Distribution Detection Score for Variational Auto-Encoder. Preprint http://arxiv.org/abs/2003.02977 (2020).
Diagnosis and Classification of Diabetes Mellitus. Diabetes Care 36, S67â€“S74 (2013).
ArticleÂ Google ScholarÂ
Kinoshita, M. et al. Japan Atherosclerosis Society (JAS) guidelines for prevention of atherosclerotic cardiovascular diseases 2017. J. Atheroscler. Thromb. 25, 846â€“984 (2018).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Whelton, P. K. et al. 2017 ACC/AHA/AAPA/ABC/ACPM/AGS/APhA/ASH/ASPC/NMA/PCNA Guideline for the prevention, detection, evaluation, and management of high blood pressure in adults: A report of the American College of Cardiology/American Heart Association Task Force on Clinical Practice Guidelines. Hypertension 71, e13â€“e115 (2018).
PubMedÂ CASÂ Google ScholarÂ
Umemura, S. et al. The Japanese Society of Hypertension guidelines for the management of hypertension (JSH 2019). Hypertens. Res. 42, 1235â€“1481 (2019).
ArticleÂ PubMedÂ Google ScholarÂ
Chen, T. & Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 785â€“794 (2016). https://doi.org/10.1145/2939672.2939785.
Guyon, I., Weston, J., Barnhill, S. & Vapnik, V. Gene selection for cancer classification using support vector machines. Mach. Learn. 46, 389â€“422 (2002).
ArticleÂ MATHÂ Google ScholarÂ
Lundberg, S. M. et al. From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2, 56â€“67 (2020).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ

Download references

Acknowledgements

This research was supported by the JST COI Program (JPMJCE1302), the JST COI-NEXT program (JPMJCA2201), and the 2023 Iwadare Scholarship Association Research Grant.

Author information

Authors and Affiliations

Graduate School of Medicine, Kyoto University, Kyoto, Japan
Taisei Tosaki,Â Eiichiro Uchino,Â Ryosuke Kojima,Â Yohei Mineharu,Â Yuji OkamotoÂ &Â Yasushi Okuno
Graduate School of Health and Nursing Science, Wakayama Medical University, Wakayama, Japan
Mikio AritaÂ &Â Nobuyuki Miyai
Research Center for Health-Medical Data Science, Graduate School of Medicine, Hirosaki University, Hirosaki, Japan
Yoshinori Tamada
Innovation Center for Health Promotion, Graduate School of Medicine, Hirosaki University, Hirosaki, Japan
Tatsuya MikamiÂ &Â Shigeyuki Nakaji
Center of Innovation Research Initiatives Organization, Hirosaki University, Hirosaki, Japan
Koichi Murashita

Authors

Taisei Tosaki
View author publications
Search author on:PubMedÂ Google Scholar
Eiichiro Uchino
View author publications
Search author on:PubMedÂ Google Scholar
Ryosuke Kojima
View author publications
Search author on:PubMedÂ Google Scholar
Yohei Mineharu
View author publications
Search author on:PubMedÂ Google Scholar
Yuji Okamoto
View author publications
Search author on:PubMedÂ Google Scholar
Mikio Arita
View author publications
Search author on:PubMedÂ Google Scholar
Nobuyuki Miyai
View author publications
Search author on:PubMedÂ Google Scholar
Yoshinori Tamada
View author publications
Search author on:PubMedÂ Google Scholar
Tatsuya Mikami
View author publications
Search author on:PubMedÂ Google Scholar
Koichi Murashita
View author publications
Search author on:PubMedÂ Google Scholar
Shigeyuki Nakaji
View author publications
Search author on:PubMedÂ Google Scholar
Yasushi Okuno
View author publications
Search author on:PubMedÂ Google Scholar

Contributions

Hirosaki health checkup data collection: Y.T., T.M., K.M., and S.N. Wakayama health checkup data collection: M.A. and N.M. Original concept, experiments conduction, and figures preparation: T.T. Manuscript writing and revising: T.T., E.U., R.K., Y.M., Y.O, M.A., N.M., Y.T., T.M., K.M., S.N., and Y.O. All authors contributed to manuscript preparation and approved the publication of it.

Corresponding author

Correspondence to Yasushi Okuno.

Ethics declarations

Declarations

The authors declare no competing interests.

Additional information

Publisherâ€™s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the articleâ€™s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the articleâ€™s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Tosaki, T., Uchino, E., Kojima, R. et al. Out-of-distribution reject option method for dataset shift problem in early disease onset prediction. Sci Rep 15, 19240 (2025). https://doi.org/10.1038/s41598-025-01811-8

Download citation

Received: 06 June 2024
Accepted: 08 May 2025
Published: 02 June 2025
DOI: https://doi.org/10.1038/s41598-025-01811-8