Introduction

Fragility fractures, particularly hip fractures, are serious injuries that can significantly increase mortality rates1. Due to symptomatic gait imbalance and a high incidence of falls, older adult patients with cervical disease—including degenerative cervical myelopathy (DCM)—have a significantly increased risk of fragility fractures2,3,4,5. For example, one study reported that the 12-month adjusted odds of experiencing at least one fragility fracture were 1.59 times higher in patients with DCM compared to general population controls2. Preventing falls through early detection and treatment of cervical disease should be a primary solution for such older adult patients6,7,8. In addition, treating osteoporosis can be an important strategy for such patients to prevent fragility fractures.

Current guidelines for patients who have undergone spine surgery emphasize the importance of managing osteoporosis9,10. One guideline recommends that dual-energy X-ray absorptiometry (DXA) scans—the standard examination for diagnosing osteoporosis—should be considered in all patients aged > 50 years9. Randomized controlled studies have shown that screening with DXA in women aged > 70 years represents a cost-effective intervention; however, it may be unrealistic to perform DXA on all patients with spinal disease aged > 50 years11,12. Therefore, establishing a simple, low-cost, and reliable screening tool for osteoporosis in patients with degenerative spinal disease—especially those with cervical disease—is needed13.

In this context, we developed a deep learning algorithm that can detect osteopenia/osteoporosis of the femur or lumbar spine using plain cervical radiography. This study aimed to validate the diagnostic yield of the deep learning algorithm and compare its diagnostic accuracy to that of spine physicians.

Methods

Study design and ethics

This cross-sectional study utilized data from patients who underwent cervical radiography and DXA. All study participants provided informed consent in accordance with the principles of the Helsinki declaration, and the study protocol was approved by the Institutional Review Board of Osaka Metropolitan University (No. 3170). All data were treated according to the Act on the Protection of Personal Information in Japan14.

Data collection

Data from patients who underwent cervical radiography and lumbar and femoral neck DXA scans were extracted from the medical records of three Japanese institutions. The selection criteria included the following: patients aged > 50 years, patients whose radiography and DXA scans were performed within 1 month, patients who had not undergone cervical surgery, and those without fresh cervical fractures or obvious metastases. Finally, 230 patients were selected for this study. Among the patients, 140 underwent both examinations as routine assessments for cervical surgery (n = 115 for DCM, n = 20 for cervical radiculopathy, and n = 5 for other conditions). A total of 90 patients who were followed for cervical disease (n = 55 for DCM, n = 35 for radiculopathy) underwent DXA scans either at their request or as part of health screening.

Definition of osteopenia/osteoporosis

Osteopenia and osteoporosis were confirmed based on T-scores from DXA scans of the femoral neck and lumbar spine. The lower T-score between the femoral neck and lumbar spine was considered the patient’s T-score. T-scores between − 1 and − 2.5 and less than − 2.5 were defined as osteopenia and osteoporosis, respectively15. To broaden patient identification as a screening tool, the algorithm defined “osteopenia/osteoporosis” as a combined category encompassing both osteopenia and osteoporosis. All data were then binarized based on the presence (T-score ≤ −1.0) or absence (T-score > −1.0) of osteopenia/osteoporosis.

Development of the deep learning algorithm

Overview of the algorithm design and development process (Fig. 1)

Fig. 1
figure 1

Overview of the algorithm development. Reference labels (osteopenia/osteoporosis: T-score ≤ − 1.0) were assigned using DXA results. Expert-annotated vertebral regions (C3–C6) were extracted from original radiographs. The algorithm was designed to automatically (1) identify C3–C6 vertebral bodies, (2) exclude vertebrae with severe degenerative changes, (3) expand the region of interest to include posterior spinous processes, and (4) calculate the final score based on the average of the selected vertebrae. A CNN (EfficientNetB2) was used to classify images as positive or negative for osteopenia/osteoporosis. Comparisons of the accuracy between the algorithm and nine spine surgeons in predicting osteopenia/osteoporosis on cervical plain radiographs were performed.

All data were randomly divided into training and test datasets (n = 200 and n = 30, respectively). The algorithm was developed to predict the binary “presence” or “absence” of osteopenia/osteoporosis using cervical radiography, via (1) a preparation process, (2) a development process, and (3) an validation process.

Preparation process

Lateral cervical plain radiographs in the training dataset were extracted as 400 × 400-pixel JPEG files from the DICOM database after personal information was removed. Three board-certified spine surgeons with 12, 15, and 26 years of experience reviewed all training data and independently evaluated the degenerative changes in each vertebral body from C3 to C6. An independent board-certified spine surgeon with 15 years of experience manually annotated the C3, C4, C5, and C6 vertebral bodies on JPEG images of training data using computer software (e-Growth Co., Ltd.; Kyoto, Japan). Raw images, annotated images, information on degenerative changes in each vertebra, and the existence of osteopenia/osteoporosis (yes or no for the indicated case) were provided to the professional engineers.

Development process

The quantity of available training data was increased by applying data augmentation techniques—such as inversion, equalization, brightness adjustment, gamma correction, histogram adjustment, noise addition, and mix-up—to the images in the training dataset. A convolutional neural network (CNN) model was constructed and trained using the following process: (1) Vertebral bodies with no or moderate vertebral degenerative changes, including the spinous process, were selected. (2) Each vertebral body was horizontally aligned, and standardized 224 × 224 pixel sub-images were extracted. (3) To prevent overfitting due to insufficient training data, the upper and lower regions, including adjacent intervertebral spaces, were masked. Using amplified images, the engineers constructed a model with a CNN architecture called EfficientNetV2-S (https://arxiv.org/abs/2104.00298) implemented with TensorFlow/Keras (PyTorch 2.7.0 + CUDA 11.8). The CNN was trained and cross-validated using a computer with a GeForce RTX 4080 graphics processing unit (NVIDIA, Santa Clara, CA, USA). The model was trained for 100 epochs with a batch size of 16, using the Adam optimizer (learning rate 0.0001) and the binary cross-entropy loss function. As internal validation, six-fold cross-validation was performed to establish the algorithm. All JPEG images were equally divided into six groups; five groups were used for training, whereas the remaining group was used for model validation. This process was repeated six times to ensure that each group was adequately assessed16. Cases of true positive (TP), false positive (FP), false negative (FN), and true negative (TN) were counted. The following parameters were then calculated: accuracy, defined as (TP + TN)/(TP + FP + FN + TN); sensitivity, defined as TP/(TP + FN); specificity, defined as TN/(TN + FP); and F1-score, defined as 2 x TP/(2 x TP + FP + FN). Finally, the algorithm which predict the presence or absence of osteopenia/osteoporosis using plain cervical radiography was established. The algorithm was designed to have the outcome by automatically performing the following four steps within the program: 1 st Identify the C3–C6 vertebral bodies, 2nd Exclude vertebrae with severe degenerative changes from analysis, 3rd Expand the region of interest to include the posterior spinous processes, and 4th Calculate the final score based on the average of the analyzed vertebral levels.

Validation process

The developed algorithm and nine spine surgeons independently predicted the binary presence or absence of osteopenia/osteoporosis in the femur or lumbar spine using cervical radiographic test data. Test data without any annotation or sampling were uploaded to the algorithm. Surgeons were not provided with any clinical information such as age or sex, but were allowed to use software functions to expand the radiographic image and control the image tone. The accuracy of the algorithm and each physician was calculated. Finally, the number of correct responses was compared between the deep learning algorithm and the spine surgeons.

Statistical analysis

The chi-square or Fisher’s exact test was used to compare categorical variables, and the Mann-Whitney U test was used to compare continuous variables. We calculated the area under the curve (AUC) in receiver operating characteristic (ROC) curve analysis to evaluate the diagnostic accuracy of the algorithm. All analyses were performed using SPSS version 23 (IBM Corp., Armonk, NY, USA). P-values < 0.05 were considered statistically significant.

Representative case presentation

A case presentation was provided to demonstrate the clinical utility of the developed algorithm. We applied the algorithm to a real-world patient case who was independent of the data used for algorithm development and validation.

Results

Demographics

The average T-score was − 1.92 ± 1.44 in the training data and − 1.63 ± 1.45 in the test data. The numbers of patients with and without osteopenia/osteoporosis were 155 (osteopenia: 83, osteoporosis: 72) and 45 respectively in the training dataset, and 22 (osteopenia: 12, osteoporosis: 10) and 8, respectively, in the test dataset (Table 1). Approximately half of the patients in each dataset were receiving current osteoporosis treatment (Table 1).

Table 1 Demographic data.

Diagnostic results of the algorithm in the training dataset

The deep learning algorithm’s diagnostic accuracy, sensitivity, specificity, and F1 score under cross-validation were 0.860, 0.890, 0.756, and 0.908, respectively (Table 2). In the ROC analysis, the AUC for predicting osteopenia/osteoporosis was 0.869 (95% confidence interval: 0.822–0.916; Fig. 2).

Table 2 Diagnostic results of the deep learning algorithm.
Fig. 2
figure 2

ROC curve of the diagnostic accuracy (training dataset, n = 200). ROC curve demonstrating the diagnostic performance of the deep learning algorithm on the training dataset (n = 200). The area under the curve (AUC) was 0.869, indicating good discriminatory ability between positive and negative cases. ROC: Receiver operating characteristic; AUC: Area under the curve.

Diagnostic results of the algorithm in the independent test dataset

The algorithm’s diagnostic accuracy, sensitivity, specificity, and F1 score in the independent test dataset were 0.800, 0.818, 0.750, and 0.857, respectively (Table 2). In the ROC analysis, the AUC for predicting osteopenia/osteoporosis was 0.858 (95% confidence interval: 0.733–0.983; Fig. 3).

Fig. 3
figure 3

ROC curve of the diagnostic accuracy (test dataset, n = 30). ROC curve illustrating the diagnostic performance of the deep learning algorithm on the test dataset (n = 30). The AUC was 0.858, indicating that the model maintained a high discriminative ability on unseen data. ROC: Receiver operating characteristic; AUC: area under the curve.

Comparison of accuracy between the deep learning algorithm and physicians

The accuracy of predictions by nine spine surgeons ranged from 0.533 to 0.700 (average: 0.606; Table 3). A comparison of correct answers between the deep learning algorithm and nine spine surgeons demonstrated that the number of correct answers by the algorithm was significantly higher than that of the spine surgeons (Table 4, p = 0.032).

Table 3 Diagnostic accuracy in the independent test dataset (n = 30).
Table 4 Comparison of the diagnostic accuracy.

Representative case

A representative case involved a 61-year-old female who underwent cervical laminoplasty for severe DCM without a prior diagnosis of osteoporosis and any fragility fracture. Her preoperative cervical radiograph was uploaded to the algorithm (Fig. 4). The algorithm indicated a potential diagnosis of osteopenia/osteoporosis. The results were explained to the patient, who then opted for a detailed examination. DXA scans revealed T-scores of −2.51 in the lumbar spine and − 1.91 in the femoral neck. She was diagnosed with osteoporosis and treated with medication for osteoporosis before surgery.

Fig. 4
figure 4

Plain cervical radiograph uploaded in the algorithm. A preoperative lateral cervical spine radiograph of a 61-year-old woman with severe degenerative cervical myelopathy. The algorithm identified a potential diagnosis of osteopenia/osteoporosis from this image, which was later confirmed by DXA. DXA: Dual-energy X-ray absorptiometry.

Discussion

The diagnostic accuracy and AUC of our deep learning algorithm for detecting osteopenia/osteoporosis using cervical radiography were 0.80 and 0.86, respectively, in the independent test dataset. The deep learning algorithm also performed significantly better than experienced spine surgeons in identifying osteopenia/osteoporosis using cervical radiography.

The current deep learning algorithm was designed to detect not only patients with osteoporosis (T-score ≤ −2.5) but also those with osteopenia (−1.0 > T-score > −2.5)17. Although fracture risk is often lower in patients with osteopenia than in those with osteoporosis, reports indicate that most fractures occur in individuals with osteopenia18. As the current deep learning algorithm was built to serve as a screening tool for patients with cervical disease, such as DCM, it was designed to detect both osteopenia and osteoporosis. From this perspective, one strength of the algorithm is its high sensitivity (0.82) in an independent dataset. However, the final diagnosis should be determined by DXA as the specificity of our algorithm (0.75) is somewhat low. Another strength of the algorithm is the quality of the training and test data, which are crucial factors for its reliability. Data were extracted from multiple institutions, the ground truth (T-scores from femur or lumbar spine DXA scans) was clear and objective, and the selection criteria were consistent19. All these characteristics might support the consideration of our algorithm as a screening tool for patients with cervical disease.

The current algorithm to detect osteopenia/osteoporosis on cervical radiography involves four steps. Initially, we attempted to create the algorithm using raw radiography data without processing. However, the outcomes did not reach a clinically useful level. Therefore, several modifications were incorporated into the development of the current algorithm: (1) to recognize the C3-C6 vertebral bodies; (2) to classify the degenerative changes of the vertebral bodies and omit those with severe degenerative changes from analysis20; (3) to expand the region of interest to include the posterior spinous processes at the indicated levels21; and (4) to base results on the average scores of the analyzed vertebral levels. Finally, the current algorithm was designed to perform this processing automatically, without requiring annotation using training data.

In alignment with current algorithms, several artificial intelligence (AI) assisted osteoporosis screening tools have been developed using various sources22,23,24,25,26,27,28,29,30. Wani et al. developed CNN models that can detect osteoporosis in knee radiographs25. Sukegawa et al. reported a deep learning model that can identify osteoporosis from dental panoramic radiographs26. Ho et al. created a deep learning model designed to infer bone mineral density data from plain pelvic radiographs28. Furthermore, Lin et al. conducted a randomized controlled trial and concluded that providing DXA screening to a high-risk group identified through AI-enabled chest radiographs can effectively diagnose more patients with osteoporosis29. When focusing on algorithms for detecting osteoporosis using AI in the field of spinal imaging, Hong N. et al. reported that their algorithm more accurately detects vertebral fractures and osteoporosis than clinicians using lateral thoracolumbar or thoracolumbosacral radiographs23,24. A similar attempt was made by Zhang B. et al. using lumbar spine radiography22. Additionally, Mao L. et al. developed an algorithm for screening primary osteopenia and osteoporosis using lumbar radiographs and patient clinical covariates30. Although there is no report of an AI model capable of identifying osteoporosis in cervical radiography, deep learning algorithms that can detect cervical canal stenosis and ossification of the posterior longitudinal ligament have already been reported31,32. Integrating these AI technologies into clinical workflows may enhance the diagnostic value of standard examinations such as plain radiography, while incurring little to no additional cost.

In terms of the implementation of AI, SPINE20, an advocacy group focused on global spine disorder awareness, released a 2024 recommendation: “SPINE20 recommends that G20 countries support ongoing research initiatives on digital technologies, including AI, regulate digital technologies, and promote evidence-based, ethical digital solutions in all aspects of spine care, to enrich patient care with high value and quality.”33 Although several barriers remain—such as ethical and legal issues, reliability, and patient privacy—implementing AI in clinical settings could improve patient care while offering cost-effective options for insurance systems.

The current study and our algorithm have several limitations. All cervical radiographic images were collected from patients in the Japanese population34. Although no major differences between Japanese and other races have been observed, minor differences (e.g. the spinal canal diameter) may be crucial parameters for the deep learning algorithm35,36. Future validation of current algorithm should involve datasets representing diverse racial backgrounds. Although we used a six-fold cross-validation technique and an independent test dataset for validation, the sample size in each dataset was relatively small, which may have resulted in low robustness of the algorithm. To overcome this limitation, a larger sample size along with robustness tests such as perturbation-based evaluation and adversarial robustness, could be ideal for creating a more precise algorithm16. In addition, the current cross-sectional study did not include clinical information such as the types and severity of cervical disease or the treatment of osteoporosis. The algorithm’s outcome might have been influenced by a history of osteoporosis treatment. When focusing on the comparison between the physician and algorithm, the potential variability in physicians’ assessment should be considered. Finally, the specificity of the current algorithm was relatively low (0.75). An international, longitudinal, large-scale study with precise clinical data is required to overcome these limitations.

In conclusion, we developed a deep learning algorithm capable of detecting patients with osteopenia/osteoporosis (T-score < −1.0) in the femur or lumbar spine using cervical radiography. The diagnostic yield of the algorithm was higher than that of experienced spine surgeons. As older adult patients with cervical diseases such as DCM have a higher risk of fragility fractures compared to those without, we believe that, after additional validation, our algorithm might serve as a low-cost screening tool for achieving early detection and timely treatment of osteoporosis in these patients.