Evaluation of a novel ensemble model for preoperative ovarian cancer diagnosis: Clinical factors, O-RADS, and deep learning radiomics.

doi:10.1016/j.tranon.2025.102335

Evaluation of a novel ensemble model for preoperative ovarian cancer diagnosis: Clinical factors, O-RADS, and deep learning radiomics.

2025 · doi:10.1016/j.tranon.2025.102335 · PMID:40048985 · PMC11928997

OA: gold CC-BY-NC-ND-4.0

📄 Open PDF Full text JSON View on PubMed View at publisher

Full text 33,993 characters · extracted from pmc-nxml · 9 sections · click to expand

Credit

Yimin Wu: Writing – review & editing, Writing – original draft, Visualization. Lifang Fan: Writing – review & editing. Haixin Shao: Data curation. Jiale Li: Data curation. Weiwei Yin: Data curation. Jing Yin: Data curation. Weiyu Zhu: Data curation. Pingyang Zhang: Project administration, Methodology. Chaoxue Zhang: Project administration, Conceptualization. Junli Wang: Project administration, Methodology, Conceptualization.

Ethics

This study was performed in line with the principles of the Declaration of Helsinki. Approval was granted by the Ethics Committee of Wuhu Second People's Hospital.

Funding

This study was supported by the scientific research project of the Second People's Hospital of Wuhu City ( LC2022C08 ), the teaching hospital scientific research project of Wannan Medical College ( WK2023JXYY133 ), and the Key Research and Achievement Transformation Program of the Wuhu Science and Technology Bureau ( 2023yf122 ).

Results

Table 1 summarizes the demographic and clinical data for the study cohorts. In the HSFY cohort, 590 patients were randomly divided into training (413 patients) and internal validation (177 patients) sets. The training set had a median age of 41 years (range 11–90), lesion diameters of 19–256 mm, and CA125 levels ranging from 2.4 to 5000 U/mL, with 349 (84.45 %) benign and 64 (15.55 %) malignant lesions. The internal validation set showed similar distributions, with a median age of 44 years (13–82), lesion diameters of 21–250 mm, and CA125 levels from 2.9 to 9855 U/mL, comprising 150 (84.7 %) benign and 27 (15.3 %) malignant lesions. The AYFY cohort (external validation) included 312 patients with a median age of 38 years (11–88), lesion diameters of 17–244 mm, and CA125 levels from 2.5 to 22,067 U/mL, of which 247 (79.15 %) were benign and 65 (20.85 %) were malignant. In the HSFY cohort, 90.5 % (532/590) of patients had CA125 levels evaluated, with a missing rate of 9.5 % (58/590); in the AYFY cohort, 94.5 % (303/312) of patients had CA125 levels evaluated, with a missing rate of 5.5 % (9/312). Table 1 Demographic characteristics of the participants. Table 1 HSFY AYFY Training set Internal validation set External validation set No. of patients N = 413 N = 177 N = 312 Age 41.0 [30.0;52.0] 44.0 [30.0;54.0] 38.0 [29.0;51.0] Range 11–90 13–82 11–88 Menopausal status: Postmenopause 100 (24.2 %) 54 (30.5 %) 75 (24.0 %) Premenopause 313 (75.8 %) 123 (69.5 %) 237 (76.0 %) CA125 19.4 [12.1;47.7] 16.8 [10.6;32.8] 25.3 [13.4;64.4] Range 2.4–5000 2.9–9855 2.5–22,067 Lesion diameter 70.0 [52.0;92.0] 67.0 [52.0;84.0] 72.0 [52.8;101] Range 19–256 21–250 17–244 Histological type: Benign (%) 349(84.45 %) 150(84.7 %) 247(79.15 %) Teratoma 163 (39.5 %) 70 (39.5 %) 69 (22.1 %) Cystadenoma 89 (21.5 %) 42 (23.7 %) 50 (16.0 %) Endometriosis 60 (14.5 %) 24 (13.6 %) 86 (27.6 %) Cyst 14 (3.39 %) 2 (1.13 %) 26 (8.33 %) Fibroma 16 (3.87 %) 10 (5.65 %) 2 (0.64 %) Thecoma 5 (1.21 %) 1 (0.56 %) 7 (2.24 %) Other 2 (0.48 %) 1 (0.56 %) 7 (2.24 %) Malignant (%) 64(15.55 %) 27(15.3 %) 65(20.85 %) Serous carcinoma 37 (8.96 %) 14 (7.91 %) 31 (9.94 %) Borderline tumor 9 (2.18 %) 5 (2.82 %) 3 (0.96 %) Endometrioid carcinoma 3 (0.73 %) 2 (1.13 %) 5 (1.60 %) Clear cell carcinoma 2 (0.48 %) 2 (1.13 %) 6 (1.92 %) Mucinous carcinoma 3 (0.73 %) 1 (0.56 %) 0 (0.00 %) Other 10 (2.42 %) 3 (1.69 %) 20 (6.41 %) Demographic characteristics of the participants. Each patient's region of interest (ROI) yielded a total of 851 radiomics features and 2048 deep learning (DL) features. Following the exclusion of radiomics features with intra-class correlation coefficients (ICCs) below 0.8, 687 radiomics features remained. Subsequent Mann-Whitney U tests revealed 17 radiomics features and 123 DL features with statistically significant p-values below 0.05. Utilizing LASSO regression, three distinct models were constructed: the Radiomics model, the DL model, and the DL-Radiomics model, each based on the respective sets of radiomics features, DL features, and the combined DL-Radiomics features. In Table 2 , it is shown that the DL-Radiomics model exhibited excellent diagnostic accuracy, achieving an AUC of 0.91 [0.87–0.94] in the internal validation group and 0.93 [0.91–0.96] in the external validation group. The process of feature selection is elaborated upon in Table S5 and Figs. S6 and S7. Table 2 Performance comparison of three models based on LASSO construction. Table 2 Model Training set ( N = 413) Internal validation set ( N = 177) External validation set ( N = 312) sensitivity specificity AUC (95 % CI) sensitivity specificity AUC (95 % CI) sensitivity specificity AUC (95 % CI) Radiomics 0.82 0.69 0.84 [0.81–0.85] 0.87 0.75 0.87 [0.82–0.90] 0.84 0.65 0.81 [0.77–0.85] DL 0.78 0.86 0.88 [0.87–0.90] 0.89 0.64 0.83 [0.79–0.88] 0.81 0.89 0.87 [0.85–0.94] Radiomics+ DL 0.87 0.84 0.93 [0.92–0.94] 0.78 0.93 0.91 [0.87–0.94] 0.86 0.87 0.93 [0.91–0.96] LASSO, Least Absolute Shrinkage and Selection Operator; DL, deep learning; AUC, Area under the receiver operating characteristic curve; 95 %CI, 95 % confidence intervals. Performance comparison of three models based on LASSO construction. LASSO, Least Absolute Shrinkage and Selection Operator; DL, deep learning; AUC, Area under the receiver operating characteristic curve; 95 %CI, 95 % confidence intervals. Prior to training the model, collinearity diagnostics were conducted to evaluate possible multicollinearity problems, utilizing the variance inflation factor (VIF) for collinearity quantification. All variables had acceptable VIF values (Table S8), allowing the retention of all variables to ensure complete clinical and imaging information. Three models were trained by sequentially adding variables: (1) Clinical model: age, menstrual status, tumor diameter, and CA-125; (2) Clinical O-RADS model: adding the Sonographer 's O-RADS assessment; (3) Ensemble model: adding DL-Radiomics model predictions. Following the training process, the models were kept constant, and the most effective hyperparameters were discovered and evaluated using both internal and external validation groups (refer to Table S9 for the optimal hyperparameters). Fig. 3 shows that the models' performance improved as more clinical features, O-RADS scores, and DL-Radiomics model predictions were included. The ensemble model showed excellent diagnostic performance, achieving an AUC of 0.97 in both internal and external validation groups. The internal validation set had a sensitivity value of 0.96 and a specificity value of 0.91, while the external validation set had a sensitivity value of 0.95 and a specificity value of 0.92. The ROC-based performance evaluation of the three models is detailed in Table S10. The Ensemble model demonstrated a significant incremental predictive value over the Clinical and Clinical O-RADS models, as confirmed by NRI and IDI metrics ( p 0.05). Decision Curve Analysis (DCA) highlighted the greater net benefit of the Ensemble model across threshold probabilities of 5 %−50 % (Fig. S12). Fig. 3 (A) Model evaluation metrics (AUC, Sensitivity, Specificity) for the clinical model, clinical O-RADS model, and ensemble model across the training set, internal validation set, and external validation set. ROC curves were generated for the clinical model, clinical O-RADS model, and ensemble model in the training set (B), internal validation set (C), and external validation set (D). Fig. 3 (A) Model evaluation metrics (AUC, Sensitivity, Specificity) for the clinical model, clinical O-RADS model, and ensemble model across the training set, internal validation set, and external validation set. ROC curves were generated for the clinical model, clinical O-RADS model, and ensemble model in the training set (B), internal validation set (C), and external validation set (D). Given the clinical challenge of diagnosing ovarian cancer when CA125 is normal and tumor size is relatively small, subgroup analyses were performed on the external validation cohort for patients with normal CA125 levels (≤35 U/mL) and tumor diameters <10 cm. In the external validation cohort, 186 out of 312 patients (44.6 %) exhibited normal CA125 levels, while 229 out of 312 patients (82.6 %) had tumor diameters <10 cm. The ensemble model demonstrated favorable diagnostic accuracy in both subgroups, as illustrated in Fig. S13. In the typical CA125 category, the AUC was 0.96, showing a sensitivity of 0.92 and specificity of 0.94. In the subgroup with tumor diameters <10 cm, the AUC was 0.97, sensitivity was 0.94, and specificity was 0.93. Furthermore, the interpretation of the ensemble model further elucidates its decision-making process. As shown in Fig. 4 A, DL-Radiomics predictions had the most significant influence on model outcomes, followed by O-RADS scores, CA125, patient age, lesion diameter, and menstrual status. Fig. 4 B–D illustrate a case of ovarian endometrioid carcinoma, where the model predicted a 96.4 % likelihood of malignancy. The corresponding waterfall chart highlights the individual contributions of each feature, offering insights into the model's transparent and interpretable prediction mechanism. Fig. 4 (A) SHAP summary plot showing the global contributions of variables to the model predictions. (B-E) Local explanation of the model for a single instance of ovarian endometrioid carcinoma: SHAP waterfall plot (B), ultrasound grayscale image (C), ultrasound color Doppler image (D), and 10×10 histopathological image (E). Fig. 4 (A) SHAP summary plot showing the global contributions of variables to the model predictions. (B-E) Local explanation of the model for a single instance of ovarian endometrioid carcinoma: SHAP waterfall plot (B), ultrasound grayscale image (C), ultrasound color Doppler image (D), and 10×10 histopathological image (E). The three sonographers demonstrated high diagnostic performance in classifying adnexal tumors. O-RADS scores were standardized to a range of 0 to 1 in order to calculate the AUC. Senior sonographers (B and C) had higher AUC, sensitivity, and specificity compared to junior sonographer A. With the assistance of the ensemble model, all three sonographers showed significant improvement in diagnostic capabilities, with junior sonographer A showing the most notable improvement. Sonographer A's AUC in the internal validation set improved from 0.80 to 0.94, with sensitivity rising from 0.80 to 0.92, and specificity from 0.70 to 0.92. Sonographer B's AUC increased from 0.86 to 0.95, sensitivity from 0.92 to 0.96, and specificity from 0.70 to 0.93. Sonographer C's AUC increased from 0.86 to 0.96, sensitivity from 0.93 to 0.97, and specificity from 0.71 to 0.93. Similar improvements were observed in the external validation set for all sonographers. Table 3 shows the diagnostic performance of sonographers using O-RADS and the ensemble model. Utilizing the ensemble model led to an 11 % rise in the mean AUC for the trio of sonographers during the internal validation phase (84 % to 95 %) and a 7.7 % increase during the external validation phase (89 % to 96.7 %), resulting in average accuracy enhancements of 14.4 % (79.3 % to 93.7 %) and 12.7 % (83.6 % to 96.3 %) correspondingly. In Table S14, the agreement assessment among sonographers is displayed, indicating a rise in kappa values from 0.65 to 0.68 to 0.76–0.82 for the internal validation set and from 0.65 to 0.68 to 0.78–0.80 for the external validation set, facilitated by the ensemble model. Table 3 Diagnostic performance of sonographers using O-RADS and ensemble models. Table 3 Training set Sonographer A Sonographer B Sonographer C O-RADS Ensemble model assistance O-RADS Ensemble model assistance O-RADS Ensemble model assistance AUC (95 %CI) 0.83(0.77–0.89) 0.95(0.91–0.99) 0.88(0.83–0.93) 0.95(0.92–0.99) 0.88(0.82–0.93) 0.96(0.92–0.99) Sensitivity 0.81 0.96 0.89 0.97 0.91 0.98 Specificity 0.79 0.92 0.81 0.92 0.83 0.92 Accuracy 0.80 0.94 0.85 0.95 0.87 0.95 Internal validation set AUC (95 %CI) 0.80(0.70–0.90) 0.94(0.87–0.99) 0.86(0.77–0.94) 0.95(0.89–0.99) 0.86(0.76–0.95) 0.96(0.90–0.99) Sensitivity 0.80 0.92 0.92 0.96 0.93 0.97 Specificity 0.70 0.92 0.70 0.93 0.71 0.93 Accuracy 0.75 0.92 0.81 0.94 0.82 0.95 External validation set AUC (95 %CI) 0.87(0.83–0.91) 0.97(0.96–0.99) 0.89(0.86–0.93) 0.98(0.96–0.99) 0.91(0.88–0.94) 0.98(0.97–0.99) Sensitivity 0.75 0.94 0.84 0.96 0.83 0.96 Specificity 0.88 0.98 0.86 0.97 0.88 0.98 Accuracy 0.81 0.96 0.85 0.96 0.85 0.97 95 %CI, 95 % confidence intervals; O-RADS, Ovarian-Adnexal Reporting and Data System; AUC, Area under the receiver operating characteristic curve. Diagnostic performance of sonographers using O-RADS and ensemble models. 95 %CI, 95 % confidence intervals; O-RADS, Ovarian-Adnexal Reporting and Data System; AUC, Area under the receiver operating characteristic curve.

Informed

As a retrospective study, informed consent was waived by the ethics committee. Patients’ records were anonymized and de-identified prior to analysis.

Materials

Approval for this retrospective study protocol was granted by the Institutional Review Board of Wuhu Hospital Affiliated to East China Normal University (Wuhu Second People's Hospital), eliminating the requirement for informed consent. Patients with adnexal lesions confirmed by surgical pathology were retrospectively collected from two independent centers. Center 1 (HSFY cohort) included 590 patients, randomly divided into a training set (70 %) and an internal validation set (30 %) for model development and evaluation. Center 2 (AYFY cohort) included 312 patients, which were used solely for external validation. Individuals were not eligible to participate in the research if they had previously been diagnosed with ovarian cancer, if the ultrasound images were of poor quality, or if there were missing clinical or ultrasound data. Clinical and ultrasound information for each participant was retrieved from medical records and de-identified for analysis. Each patient underwent a transvaginal ultrasound (TVUS) examination, with abdominal ultrasound being utilized in cases where the lesion was too large for comprehensive evaluation via TVUS. In instances where multiple lesions were identified, the most morphologically complex lesion was chosen for analysis, with preference given to the largest lesion in cases of similar characteristics. Borderline tumors were categorized as malignant. The process of patient inclusion is depicted in Fig. 1 . Fig. 1 The flowchart illustrates the patient enrollment process for the study in Center 1 and Center 2. Fig. 1 The flowchart illustrates the patient enrollment process for the study in Center 1 and Center 2. All patients underwent transvaginal ultrasound (TVUS) examination before surgical resection. The equipment used included Samsung WS80A, GE VOLUSON E8, GE VOLUSON E10, Mindray Resona 7S, and Hitachi HI VISION Preirus, with probe frequencies ranging from 3.0 to 10.0 MHz. Three sonographers, labeled as A, B, and C, each with different levels of experience in gynecological ultrasound diagnosis, evaluated all unidentified images of adnexal lesions without knowledge of clinical or pathological information. Each lesion was categorized based on the O-RADS scoring system: 2 (almost certainly benign, malignancy risk <1 %), 3 (low malignancy risk, 1–10 %), 4 (intermediate malignancy risk, 10–50 %), or 5 (high malignancy risk, ≥50 %). Two skilled sonographers, who were unaware of the final histopathological data, evaluated the transvaginal ultrasound (TVUS) images separately. The first sonographer, possessing 5 years of expertise in gynecological ultrasound, utilized ITK-Snap software (v.4.2.0) to outline the tumor boundaries and generate regions of interest (ROI) in the images. The outlined images were saved as nifti format mask files. Subsequently, after one month, 30 patients were randomly chosen for the re-delineation of the ROIs by the first sonographer and another sonographer with 10 years of experience. Interclass correlation coefficients (ICCs) were calculated to ensure delineation consistency. Radiomics characteristics were obtained using PyRadiomics (version 3.1.0; https://pyradiomics.readthedocs.io/en/latest/ ), yielding a sum of 851 features including basic statistics, shape attributes, and different texture matrices like gray level run length, size zone, co-occurrence, dependence, and neighboring gray tone difference matrices (Table S1). Deep learning (DL) features were derived using a pre-trained ResNet50 model [ 22 ] on the ImageNet dataset( http://www.image-net.org ), resulting in 2048 DL features per ROI (Appendix S2, Fig. S3). Radiomics features with inter- and intra-observer ICCs >0.8 were selected for analysis. Mann-Whitney U tests identified radiomics and DL features with p < 0.05, generating three subsets: radiomics features, DL features, and combined DL Radiomics features. To address class imbalance, SMOTE was applied during training [ 23 ]. Models based on radiomics, DL, and combined features were developed using LASSO regression and evaluated via 10-fold cross-validation. Receiver operating characteristic (ROC) curves were used to assess performance, identifying the optimal model for integration. A machine learning framework was established to integrate multi-modal information with the participation of sonographers. Prior to model construction, we computed variance inflation factor (VIF) values to eliminate highly collinear features. We then sequentially incorporated (1) age, menstrual status, tumor diameter, and CA-125; (2) O-RADS; and (3) DL radiomics predictions into an ensemble of Elastic Net, glmboost, and Random Forest (ranger) algorithms implemented in R's mlr3 environment. Each algorithm was embedded in an mlr3 pipeline comprising four preprocessing stages: encoding, imputation, z-score normalization, and the Synthetic Minority Over-sampling Technique (SMOTE). During 5-fold cross-validation, SMOTE was applied only to the training folds to prevent synthetic data from leaking into the validation folds. We then employed 5-fold cross-validation combined with a random search strategy to optimize hyperparameters, aiming to maximize the area under the receiver operating characteristic curve (AUC). The hyperparameter ranges are detailed in Table S4. To further enhance model robustness and mitigate overfitting, we repeated the entire cross-validation procedure five times with different random seeds, averaging the prediction scores from these five independent runs to form the final model output. Finally, the ensemble model's performance was evaluated on an external validation set to ensure reliable generalizability in real-world scenarios. A schematic workflow is illustrated in Fig. 2 . Fig. 2 Diagram illustrating the machine learning framework used for training and validating models. Fig. 2 Diagram illustrating the machine learning framework used for training and validating models. Predictive accuracy improvements were quantified using Integrated Discrimination Improvement (IDI) and Net Reclassification Improvement (NRI) metrics. Performance metrics, including AUC, accuracy, sensitivity, and specificity, were assessed using ROC curve analysis. Decision Curve Analysis (DCA) was employed to evaluate the clinical utility of the models, while subgroup analyses assessed their generalizability across diverse clinical scenarios. To enhance transparency in the model's decision-making process, Shapley values were used to quantify the distinct contributions of each feature. Local Shapley values highlighted feature impacts on individual samples, whereas global Shapley values provided average contributions across the dataset. Together, these analyses provided a comprehensive understanding of the model's predictive performance and interpretability. Statistical analysis was conducted using R software (version 4.4.0; http // www.R-project.org ), with a significance level set at p < 0.05 (two-tailed). Continuous data were analyzed with t -tests or Mann-Whitney U tests, and discrepancies in categorical variables were examined with Pearson's chi-square test or Fisher's exact test. The mlr3 package was utilized to create the machine learning framework, which was then assessed for model performance through the use of ROC curves. The pROC package was employed for ROC curve plotting and AUC calculation. Confidence intervals were determined through the bootstrap technique, IDI and NRI were calculated with the assistance of the 'Hmisc' software, DCA was carried out using the 'dcurves' program, and model analysis was conducted using the SHapley Additive exPlanations (SHAP) tool.

Discussion

Improving the diagnostic efficacy of abdominal or transvaginal ultrasound in the context of ovarian cancer is of substantial clinical importance, given its essential role in the multi-modal preoperative diagnostic process [ 24 , 25 ]. In this study, a new model was created and tested using transvaginal ultrasound, along with clinical factors, O-RADS scores, and DL radiomics features, to evaluate its accuracy in diagnosing ovarian cancer before surgery. Our study results indicate that the ensemble model shows better diagnostic accuracy than clinical and clinical O-RADS models for detecting ovarian cancer, and significantly improves the diagnostic proficiency of sonographers. Additionally, the ensemble model exhibited AUCs of 0.96 and 0.97 in two distinct external validation subgroups (normal CA125 subgroup and tumor diameter <10 cm subgroup), underscoring its promising utility in preoperative ovarian cancer diagnosis. Recently, the field of radiomics has seen exponential growth, showing great potential in tumor diagnosis [ 26 ]. Deep learning, specifically deep CNNs, has the ability to automatically acquire and analyze complex features for forecasting, emerging as a recent development in radiomics [ 27 ]. Xiang et al. [ 28 ] designed a deep learning model for differentiating ovarian tumors, demonstrating that expert-level performance significantly improved the diagnostic efficiency of sonographers. However, this study utilized retrospective data from two prominent oncology hospitals, wherein over half of the training set comprised malignant lesions, potentially introducing significant bias and constraining the model's generalizability. Moreover, there is insufficient conclusive proof to indicate that deep learning is inherently superior to radiomics. Thus, we included data from two comprehensive hospitals and developed three models (radiomics model, deep learning model, and DL radiomics model) based on LASSO regression. Our study results show that the DL radiomics model outperformed both standalone deep learning and radiomics models. This suggests that the integration of deep learning and radiomics can capture complementary tumor information, thereby enhancing the overall performance of the ensemble model [ 29 ]. Furthermore, a prior study conducted at a single center corroborated our results, showing that an ultrasound-based radiomics nomogram using deep learning successfully distinguished between benign and malignant ovarian tumors. Our study aimed to enhance robustness through standardized feature selection, model integration, and repeated random shuffling to reduce bias. We also validated the model using an independent cohort from different institutions. We found that combining CA125, lesion diameter, menstrual status, and patient age provided valuable diagnostic information for ovarian tumors [ 30 ]. However, sensitivity and specificity were insufficient, and performance varied across different datasets. Adding O-RADS and DL radiomics model predictions sequentially significantly improved the diagnostic capability for ovarian cancer (NRI and IDI both showed positive improvements, p < 0.05). Notably, we used the widely applied O-RADS scoring system for sonographers’ evaluations to ensure accuracy and reproducibility [ 31 ]. Previous studies indicate that O-RADS is successful in categorizing the likelihood of adnexal masses, achieving an AUC between 0.90 and 0.94 [ 31 , 32 ]. In our study, sonographers achieved good performance, with average AUCs of 0.84 and 0.89 in internal and external test datasets, respectively. The ensemble model greatly improved diagnostic accuracy, with an average 11 % increase in AUCs and a 14.4 % reduction in misdiagnosis rates in the internal validation group. Similarly, in the external validation set, average AUCs increased by 7.7 % and misdiagnosed rates decreased by 12.7 %. Furthermore, diagnostic consistency among sonographers showed a notable improvement. The ensemble model also exhibited strong diagnostic performance in subgroups with normal CA125 levels and tumor diameters <10 cm, which are often considered challenging scenarios. This result aligns with several previous studies reporting that multi-modality approaches, especially those leveraging imaging-based features, can capture subtle morphological and textural cues even when conventional clinical markers are unremarkable [ 33 , 34 ]. Such synergy between radiological and clinical data may explain why our model maintains robust performance in these atypical presentations. Nevertheless, whether this performance remains consistent across different demographic factors (e.g., patient age, ethnic background) was not fully investigated here. Additional stratified analyses in diverse populations could help clarify the underlying mechanisms and validate the model's generalizability. Our research also sought to improve the interpretability of the ensemble model. By utilizing SHAP methods, we were able to elucidate the outputs of the model, enabling clinicians to gain a deeper understanding and effectively utilize the prediction results [ 35 ]. Both individual patients and cohorts were given insights into the relative contributions of each factor within the ensemble model by the Shapley values at both local and global levels [ 36 ]. Our Shapley analysis indicates that clinical variables, such as CA125, patient age, lesion diameter, and menstrual status, show comparatively lower importance in model decision-making. We attribute this to two main factors. First, imaging-based predictors, including deep learning radiomics (DL radiomics) features and O-RADS scores, provide robust discriminatory power for differentiating benign from malignant ovarian tumors, thereby dominating the ensemble model. Second, benign and malignant lesions can overlap in certain clinical parameters, such as lesion diameter, and the heterogeneous distribution of multi-center data may further diminish the statistical weight of these features. Moreover, previous studies [ 30 , 37 , 38 ] have reported that CA125, despite its clinical significance, demonstrates relatively low sensitivity and specificity as a standalone diagnostic marker. Its levels may be elevated due to various physiological or pathological factors, including menstruation, pregnancy, endometriosis, and peritoneal inflammatory diseases. However, when integrated with imaging data, the diagnostic performance significantly improves, highlighting the complementary role of multi-modal approaches in enhancing accuracy. It is important to note that we performed checks for multicollinearity (e.g., VIF) before model building and did not exclude clinical variables during cross-validation, ensuring that they were not improperly suppressed. Hence, the lower Shapley values reflect the strong predictive capacity of imaging features within this dataset, rather than an outright dismissal of clinical variables. Notably, we observed that junior sonographers exhibited the largest improvement in diagnostic performance when using our ensemble model, consistent with previous studies demonstrating that AI assistance can bridge the performance gap between less experienced and senior practitioners [ 39 ]. Recent evidence also highlights the potential of AI-integrated tools in enabling novice users to achieve diagnostic accuracy comparable to experts, particularly in ultrasound-based assessments [ 40 ]. While our findings suggest the potential of AI to enhance diagnostic proficiency, tailored training interventions, such as short orientation sessions, structured tutorials, or user-friendly interfaces, could further assist junior sonographers in fully leveraging the model's capabilities [ 41 ]. Future studies should explore the feasibility, effectiveness, and cost-efficiency of these strategies in diverse clinical settings. Despite the encouraging results, our study has some limitations. First, although we used two independent patient cohorts for validation, these datasets came from specific medical institutions and may not fully represent broader patient populations. Greater efforts to harmonize imaging protocols and incorporate data from multiple centers could further refine the model's performance and strengthen its generalizability. Second, exploring the integration of multi-modal imaging, such as combining CT and MRI data, might enhance diagnostic accuracy and provide a more comprehensive assessment. Ongoing advances in computational power and the potential inclusion of multiomics data may also allow more personalized risk stratification in complex cases like high-grade serous ovarian cancer. Third, we have not yet conducted a formal cost-effectiveness analysis to compare our ensemble model with existing diagnostic workflows. Collecting detailed data on resource utilization and clinical outcomes in different healthcare settings would help determine the economic feasibility of widespread implementation. Finally, our model requires further optimization for practical clinical application, including real-time DL radiomics solutions that are both user-friendly and integrable into routine practice. Addressing these aspects in future studies could illuminate how best to scale up the technology and ensure that both experienced and less experienced sonographers can harness its full potential. In conclusion, the ensemble multi-modal model offers a more efficient, cost-effective, and interpretable diagnosis of ovarian cancer, significantly enhancing the diagnostic capabilities and consistency of sonographers.

Introduction

Ovarian tumors exhibit a wide spectrum of biological behaviors, ranging from benign and borderline lesions to malignant growths with diverse morphological features [ 1 , 2 ]. Among female malignancies, ovarian cancer is particularly lethal, accounting for 324,398 new diagnoses and 206,839 deaths worldwide in 2022 [ 3 ]. Most patients are diagnosed at an advanced stage, resulting in a 5-year survival rate of only 20–40 % [ 4 ]. Nevertheless, early detection and timely intervention can significantly raise this rate to 80–95 % [ 5 , 6 ]. Moreover, an estimated 2 million women undergo exploratory surgery each year for suspicious adnexal masses, with approximately 1.7 million found to be benign [ 7 ]. Notably, such benign lesions often require less invasive management and can be treated with simple surgical removal or close observation, aiming to preserve ovarian function and fertility [ 8 ]. These findings underscore the critical necessity for the advancement of more precise non-invasive diagnostic techniques. Transvaginal ultrasound (TVUS) is currently the first-line imaging modality for evaluating adnexal masses due to its lack of contraindications, clarity of images, and cost-effectiveness, outperforming CT or MRI in many routine clinical scenarios. Nonetheless, variability in sonographers' interpretations may result in inconsistent findings [ 9 , 10 ]. To address these limitations, the O-RADS scoring system provides a standardized lexicon and risk stratification approach, improving both diagnostic accuracy and reproducibility [ 11 ]. As cancer treatment paradigms evolve with advances in immunotherapy, targeted therapies, and precision medicine, accurate lesion characterization and outcome prediction have become pivotal for tailoring individualized treatment strategies [ [12] , [13] , [14] , [15] ]. Nonetheless, a subset of patients may exhibit limited response or encounter substantial toxicity, revealing the need for more effective diagnostic and prognostic tools [ 16 ]. Radiomics, which allows for high-throughput extraction of imaging features, combined with deep learning (DL) for automated feature learning, offers a new avenue for early detection and risk stratification in ovarian cancer [ 17 , 18 ]. Recent studies have explored artificial intelligence to improve clinical decision-making; for instance, Jan et al. [ 19 ] utilized CT images to build a model comparable to experienced radiologists for tumor differentiation, while Du et al.20 developed a deep learning radiomics nomogram from ultrasound data, achieving predictive performance similar to the O-RADS system. However, these approaches often lack external validation, limiting their generalizability to broader populations. In addition, although the O-RADS system has shown promise in clinical practice, it has yet to be fully integrated with deep learning radiomics and clinical data within a unified framework; most current work merely compares these methods [ [19] , [20] , [21] ]. Furthermore, few studies have investigated a holistic model that integrates clinical factors, O-RADS scores, and deep learning radiomics to precisely stratify the risk of ovarian cancer. Therefore, this study aims to develop a comprehensive model that leverages clinical variables, O-RADS scoring, and deep learning radiomics to improve the preoperative diagnosis of ovarian cancer. Moreover, we will evaluate the impact of this integrated model on enhancing sonographers' diagnostic proficiency and consistency.

Coi Statement

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: pmc-nxml ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-06-25T06:14:32.897245+00:00
unpaywall: last seen: 2026-05-21T05:10:58.409756+00:00

License: CC-BY-NC-ND-4.0