Results
We enrolled a total of 183 patients with 37 (20.22%) patients in the early stage and 146(79.78%) patients in the advanced stage. The baseline features for all cohorts are detailed in Table 1 . More baseline features for all cohorts are detailed in S1 ~ S6. Table 1 Clinical characteristics of patients in early and advanced stage Characteristics Early stage Advanced stage P value # BMI 24.14 (21.28–25.50) (31) 22.86 (20.81–25.16) (120) 0.2294 Weight (kg) 60.00 (55.00–67.75) (31) 57.00 (51.00–65.00) (120) 0.1439 Height (m) 1.60.00 (1.56–1.62) (31) 1.60 (1.55–1.61) (121) 0.7034 Age at diagnosis (year) 53.00 (48.0–58.0) (37) 62.00 (54.0–67.0) (146) 0.0002 CA125 117.05 (53.78–413.25) (36) 661.50 (206.2–1737.8) (144) < 0.0010 HE4 114.50 (67.6–276.65) (32) 404.00 (179.0–795.0) (117) < 0.0010 AFP 2.92 (1.79–3.60) (36) 2.55 (1.87–3.67) (141) 0.8749 CEA 2.13 (1.06–3.86) (36) 1.14 (0.77–1.76) (142) 0.0003 CA199 17.00 (8.71–173.5) (35) 10.59 (5.77–20.72) (142) 0.0074 A/G 1.53 (1.29–1.65) (35) 1.28 (1.09–1.52) (145) 0.0260 TA 72.10 (67.65–75.5) (35) 69.40 (63.1–73.7) (145) 0.0308 ALB 42.20 (37.98–45.25) (36) 38.80 (35.0–41.7) (145) 0.0038 GLB 28.40 (26.25–34.5) (35) 29.40 (25.5–33.2) (145) 0.8583 PAB 201.60 (149.9–233.1) (35) 146.80 (113.0–191.4) (145) 0.0011 ALT 13.00 (10.0–16.25) (36) 13.00 (9.0–19.0) (145) 0.6510 AST 16.00 (13.0–25.25) (36) 21.00 (16.0–29.0) (145) 0.0305 GGT 19.55 (13.0–28.5) (36) 18.70 (12.2–33.0) (145) 0.6332 ALP 78.00 (65.75–93.5) (36) 76.00 (63.0–93.0) (145) 0.6484 LDH 206.00 (173.5–260.5) (35) 223.00 (176.0–283.0) (145) 0.6639 AFU 19.70 (13.75–21.9) (35) 15.80 (12.9–18.8) (145) 0.0768 BUN 4.72 (3.98–5.43) (36) 4.40 (3.5–5.48) (145) 0.2358 Cr 55.00 (51.68–62.0) (36) 57.00 (49.0–67.0) (145) 0.5467 UA 272.00 (240.5–314.5) (35) 274.00 (240.0–332.0) (145) 0.8579 PT 11.20 (10.8–11.7) (36) 11.50 (10.9–12.1) (145) 0.1839 INR 0.98 (0.94–1.03) (36) 1.00 (0.95–1.06) (145) 0.1783 FIB 3.25 (2.79–4.41) (36) 4.12 (3.24–4.78) (145) 0.0649 APTT 27.85 (25.88–28.6) (36) 27.80 (25.8–29.5) (145) 0.7276 TT 17.05 (16.08–18.03) (36) 16.80 (15.5–17.6) (145) 0.0464 WBC 6.77 (5.68–7.91) (35) 6.48 (5.28–7.78) (145) 0.5476 NEUT% 69.80 (64.35–76.15) (35) 72.60 (66.1–77.5) (145) 0.4296 LYMPH% 21.20 (15.4–28.35) (35) 18.50 (13.3–24.4) (145) 0.1459 MONO% 6.00 (5.55–7.40) (35) 7.30 (6.0–8.8) (145) 0.0017 EO% 0.90 (0.45–1.80) (35) 0.90 (0.4–1.7) (145) 0.9587 BASO% 0.40 (0.2–0.5) (35) 0.30 (0.2–0.5) (145) 0.6613 NEUT 4.69 (4.04–5.62) (35) 4.71 (3.37–5.8) (145) 0.5129 LYMPH 1.46 (1.03–1.96) (35) 1.23 (0.92–1.48) (145) 0.0363 MONO 0.43 (0.34–0.13) (35) 0.47 (0.38–0.63) (145) 0.1483 EO 0.06 (0.035–0.13) (35) 0.06 (0.03–0.11) (145) 0.5243 BASO 0.02 (0.01–0.035) (35) 0.02 (0.01–0.04) (145) 0.9692 RBC 4.34 (3.99–4.50) (35) 4.16 (3.82–4.46) (145) 0.2567 Hb 123.00 (112.5–133.5) (35) 120.00 (111.0–130.0) (145) 0.4441 HCT 0.40 (0.36–37.35) (35) 0.39 (0.35–0.43) (145) 0.1887 MCV 90.40 (86.20–92.9) (35) 89.30 (86.4–92.9) (145) 0.7713 MCH 29.60 (27.75–30.45) (35) 29.10 (27.8–30.1) (145) 0.6408 MCHC 326.00 (319.5–332.5) (35) 324.00 (319.0–332.0) (145) 0.6180 RDW-CV 12.60 (12.0–13.5) (35) 12.70 (12.2–13.4) (145) 0.6971 RDW-SD 42.35 (40.4–44.03) (30) 41.60 (39.8–42.8) (137) 0.2585 PLT 274.00 (202.0–332.0) (35) 310.00 (239.0–389.0) (145) 0.0494 PCT 2.85 (2.3–3.5) (35) 3.30 (2.5–4.0) (145) 0.0583 MPV 10.20 (9.8–10.8) (35) 10.40 (9.7–11.0) (145) 0.7968 PDW 11.70 (10.6–13.6) (29) 11.50 (10.5–12.83) (29) 0.7433 P-LCR 26.45 (22.275–32.95) (30) 26.70 (21.65–32.25) (135) 0.7839 SII 531.90 (202.5–1241.80) (34) 907.64 (431.6–1866.02) (145) 0.0126 NLR 3.20 (2.25–5.02) (34) 3.96 (2.64–5.66) (145) 0.2158 PLR 121.62 (3.91–201.5) (34) 208.47 (96.64–344.08) (145) 0.0011 Histological type Serous carcinoma 19 126 Mucinous carcinoma 4 7 Endometrioid carcinoma 8 8 Clear cell carcinoma 6 5 # Brunner-Munzel test
Clinical characteristics of patients in early and advanced stage
# Brunner-Munzel test
In this study, we acquired 120 metabolic features of PET scans and extracted the same set of radiomics features for both PET (1051 features) and CT scans (1051 features). More details in Supplemental tables.
We encountered missing data for certain clinical features, as depicted in Fig. 1 C. To address this, we utilized the MICE algorithm to fill in the missing data before building the model.
In our analysis, we selected 18 clinical features, 9 PET metabolic features, 37 PET radiomics features, and 73 CT radiomics features for the subsequent modeling and analysis. Specially, the statistical distribution results, illustrated using box plots, are displayed in Fig. 1 B. Detailed statistical outcomes for the entire analysis can be found in Tables S1– S6. In particular, the statistical results for filled clinical features are provided in Table S6.
The average ROC curves derived from the training dataset are depicted in Fig. 3 A. The experimental results for 11 prediction model terms are listed in Table 2 . The confidence intervals for each key performance metric were calculated using the Bootstrap method with 1000 resamples (Efthimiou et al. 2024 ; Kang et al. 2024 ). Fig. 3 ROC curves for different models. A Average ROC curves for 11 prediction models in the training dataset. B ROC curves for 11 prediction models in the validation dataset (183 cases). C ROC curves for different data types in the validation dataset (183 cases) Table 2 Model performance evaluation results for different data sources Data type AUC Accuracy Precision Recall F1 score Clinical 0.771 (0.678, 0.848) 0.792 (0.743, 0.842) 0.851 (0.820, 0.882) 0.897 (0.849, 0.945) 0.873 (0.841, 0.903) SUV 0.697 (0.601, 0.776) 0.781 (0.760, 0.798) 0.794 (0.790, 0.798) 0.979 (0.952, 1.000) 0.877 (0.863, 0.888) CT 0.752 (0.654, 0.839) 0.814 (0.765, 0.863) 0.854 (0.824, 0.890) 0.925 (0.877, 0.966) 0.888 (0.858, 0.917) PET 0.710 (0.608, 0.804) 0.825 (0.787, 0.863) 0.839 (0.814, 0.870) 0.966 (0.932, 0.993) 0.898 (0.875, 0.920) Clinical + SUV 0.793 (0.705, 0.869) 0.798 (0.743, 0.842) 0.852 (0.821, 0.883) 0.904 (0.856, 0.945) 0.877 (0.845, 0.904) Clinical + PET 0.813 (0.726, 0.890) 0.803 (0.749, 0.847) 0.853 (0.820, 0.886) 0.911 (0.863, 0.952) 0.881 (0.848, 0.908) Clinical + CT 0.803 (0.708, 0.883) 0.820 (0.770, 0.863) 0.855 (0.824, 0.890) 0.932 (0.890, 0.973) 0.892 (0.862, 0.919) Clinical + SUV + CT + PET 0.819 (0.730, 0.896) 0.825 (0.776, 0.874) 0.870 (0.838, 0.906) 0.918 (0.870, 0.959) 0.893 (0.861, 0.923) The confidence intervals are calculated based on the Bootstrap method with a 95% confidence level The bold fonts highlighted the best results in each experiment The underlined values indicated the second-ranking results
ROC curves for different models. A Average ROC curves for 11 prediction models in the training dataset. B ROC curves for 11 prediction models in the validation dataset (183 cases). C ROC curves for different data types in the validation dataset (183 cases)
Model performance evaluation results for different data sources
The confidence intervals are calculated based on the Bootstrap method with a 95% confidence level
The bold fonts highlighted the best results in each experiment
The underlined values indicated the second-ranking results
In Fig. 3 B, the AUC metrics in SVM, NN, ElasticNet, and adaptive ensemble models were higher than 0.80. Notably, our adaptive ensemble prediction model outperformed others, displaying superior classification results across all four data types. Compared with the first four rows (Clinical, SUV, CT, and PET) in Table 1 , the clinical data presented the best performance with an AUC of 0.771. To assess the effectiveness of new features extracted from radiomics, additional experiments were conducted and are presented in the latter four rows of Table 1 . Compared to only the clinical data, increasing new data types (clinical, SUV, CT, and PET) improved prediction performance with an AUC of 0.793, 0.803, and 0.813, respectively. Particularly, the addition of PET features enhanced prediction by 5.4% compared to CT and SUV features alone. The fusion of four data types yielded the best prediction performance in terms of AUC. Notably, employing only CT data also demonstrated robust prediction performance in terms of accuracy precision, recall, and F1-score.
In this study, we screened and obtained 137 features from four groups of data. Figure 4 displayed the top 40 ranking results of 137 features according to the feature importances in the LGBM algorithm. There were 28 PET/CT radiomics features, 2 PET metabolic features, and 10 clinical features in the top 40 importance ranking features. Fig. 4 The feature importance scores in four prediction models
The feature importance scores in four prediction models
The average AUC curves in the training dataset are depicted in Fig. 5 A. It’s important to note that in the extreme assumption scenario, the OCCC and MCOC cases could not be seen in the training dataset. However, as indicated in Fig. 5 B and C, the average and best prediction performance in the test dataset (consisting of 39 cases, including 22 cases of OCCC and MCOC) were 0.764 and 0.808, respectively. The accuracy, precision, recall, and F1-score were 0.769, 0.828, 0.857, and 0.842 for the validation dataset (183 cases) based on our adaptive ensemble prediction model. From Fig. 5 C, it is evident that when compared to other machine learning algorithms, our proposed model demonstrated the best prediction performance. Fig. 5 ROC curves for different methods. A Average ROC curves for 11 models in the training dataset. B ROC curves for 11 models in the validation dataset (183 cases). C ROC curves for 11 models in the validation dataset (6th fold) (the best model based on the Adaptive Ensemble model)
ROC curves for different methods. A Average ROC curves for 11 models in the training dataset. B ROC curves for 11 models in the validation dataset (183 cases). C ROC curves for 11 models in the validation dataset (6th fold) (the best model based on the Adaptive Ensemble model)
Materials
We retrospectively collected the data of OC patients who underwent PET/CT scans between January 2012 and March 2021 in Renji Hospital Affiliated with Shanghai Jiao Tong University School (short for Renji Hospital). This retrospective study was approved by the ethics committee and informed consent was waived. The inclusion criteria for patients were as follows: (1) Patients were diagnosed with ovarian cancer by histopathologic examination. The exclusions were as follows: (1) patients who did not have complete FIGO stage; (2) patients with poor PET/CT image quality; (3) before PET/CT examination, patients who have experienced anti-tumor treatments. The clinical features and blood test indicators were collected. The systemic immune-inflammation index (SII), neutrophil to lymphocyte ratio (NLR), and platelet to lymphocyte ratio (PLR) are features of inflammation calculated through lymphocytes, neutrophils, and platelets that were also collected.
All the examinations were conducted before any treatments. The PET/CT imaging of the whole body was conducted at the Nuclear Medicine Department of Renji Hospital, using the Biograph 64 PET/CT system (Siemens Medical Systems). Every patient fasted for about 6–8 h and made sure the blood sugar level was lower than 140 mg/dl (7.7 mmol/L). All patients received an intravenous injection of 3.7–5.5 MBq/kg 18F-FDG (radiochemical purity > 95%) 60 min before the imaging. The CT scan was conducted using 120kv voltage and 140 mA current with 5 mm section thickness. The PET scan was performed after the CT scan. PET was conducted in 5–6 bed positions (3 min per bed position). All patients acquired PET and CT images. After the scan, body CT was used for attenuation correction to reconstruct PET images. Complete fusion images could be obtained through the Siemens post-processing workstation. Experienced nuclear medicine physicians would evaluate the whole body images and then issue a formal report for every patient.
Four researchers who experienced professional training in reading PET/CT images manually delineated the volume of interest (VOI) using the 3D-Slicer software ( http://www.slicer.org ). The researchers delineated the primary lesions, suspected peritoneal metastasis, and lymph node metastasis individually. Besides, the pretreatment PET features included petN, petP, and petM, which were defined according to the PET images and the official PET/CT reports. The petN1 was categorized as having suspicious lymph node metastasis. The petP1 was categorized as having suspicious peritoneum metastasis. The petM1 was categorized as suspicious distant viscera metastases outside the abdominal cavity.
The overall pipeline for building the prediction model is described in Fig. 1 A. The whole process was comprised of three steps: (1) feature collection; (2) feature selection; and (3) prediction model building. Fig. 1 The whole pipeline for building the prediction model. A The overall pipeline for modeling early stage and advanced stage. B The PET metabolic features selected distribution. C The distribution of missing data for clinical features
The whole pipeline for building the prediction model. A The overall pipeline for modeling early stage and advanced stage. B The PET metabolic features selected distribution. C The distribution of missing data for clinical features
We calculated the metabolic features of PET according to the specific VOI. For PET/CT radiomics features, we utilized the radiomics extraction function in the Pyradiomics packages to extract 1051 features based on CT or PET data, respectively. Besides, PET and CT radiomics features were extracted by using Pyradiomics package 3.0.1. The feature names and their corresponding abbreviation names are provided in Table S7.
Different metabolic features were calculated for distinct VOI, such as SUV40_to, TLG40_to, MTV40_to, and others. SUV was calculated using (decay-corrected activity [kBq] per milliliter of tissue volume)/(injected 18F-FDG activity [kBq] per gram of body mass). The methods herein for calculating SUV, MTV, and TLG were described as follows, (Lee et al. 2014 ; Kim et al. 1994 ) 1 \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${\text{SUV}} = \frac{{H_{{\text{special volume}}} \left( {{\text{Bq}}/{\text{g}}} \right)}}{{H_{{{\text{total}}}} /{\text{BMI}}\left( {{\text{kg}}/{\text{m}}^{2} } \right)}}/k$$\end{document} SUV = H special volume Bq / g H total / BMI kg / m 2 / k 2 \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${\text{MTV}} = N_{{{\text{pixel}}}} \times d$$\end{document} MTV = N pixel × d 3 \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\text{TLG}=\text{SUV}\times \text{MTV}$$\end{document} TLG = SUV × MTV 4 \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$k=\text{exp}(t\times \text{ln}2/{\text{T}}_{1/2})$$\end{document} k = exp ( t × ln 2 / T 1 / 2 ) where \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$H_{{\text{special volume}}}$$\end{document} H special volume , \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$H_{{{\text{total}}}}$$\end{document} H total , and BMI denoted the radioactivity in ROV, the total radioactivity, and body mass index (BMI), respectively. \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$N_{{{\text{pixel}}}}$$\end{document} N pixel denoted the sum of pixels in the special ROV. d denoted the physical volume per voxel. T 1/2 (default: 6588.0) denotes the half-life of the nuclide. t is the specific time interval. Especially, the T 1/2 and t can be directly extracted from the specific DICOM file from the patient. According to the different indicators (i.e., SUV 40_to, TLG 40_to, MTV 40_to, and so on) to determine the region of interest, we calculated different metabolic variables. Especially, the specific feature calculation was conducted by setting a threshold value for SUV (i.e., 40%SUVmax, 50% SUVmax, 2.5, 3.5). The statistical results are listed in Table S1.
To investigate the different roles of metabolic and other features in staging prediction, in this article, PET/CT radiomics features refer to radiomics features other than metabolic features.
In this work, we extracted four types of features from three modalities (Clinical data, CT, and PET). Three statistical test methods are employed to analyze and select continuous and discrete variables. Specifically, the Brunner Munzel test is to be utilized to analyze the continuous features due to violations of the normal distribution assumption (Brunner and Munzel 2000 ). The Fisher exact test is applied to analyze the discrete features. What’s more, Pearson correlation is used to analyze the feature correlations. As a result, we obtained 18 clinical variables, 9 pet metabolic variables, 37 PET variables, and 73 CT variables.
We divided the enrolled patients into two cohorts according to their FIGO staging results, the early stage (FIGO I and II) cohort and the advanced stage (FIGO III and IV) cohort. Before building the prediction model, we randomly divided the total dataset into a training dataset and a test dataset. Generally, we used the test dataset to validate the effectiveness of the model. We performed a tenfold cross-validation (CV) for model training and validation. All prediction results in ten-fold cross-validation are stacked and the model prediction performance.
To demonstrate the stage prediction performance in OCCC and MCOC, we redesigned experiments for splitting the training, validation, and test datasets. Similarly, we employed the same tenfold CV. Different from the experiment above, the 22 OCCC and MCOC cases (10 advanced stage and 12 early stage samples) were assigned to the test data manually. To retain the same proportion for three datasets, we randomly split the 17 cases into the test dataset. The detailed process of splitting the dataset is referred to in Fig. 2 . Fig. 2 Data collection and splitting dataset pipeline
Data collection and splitting dataset pipeline
We used ten classic machine learning algorithms to build different prediction models, including RF, ElasticNet, SVM, LR, GBM, NN, DT, XGBoost, and AutoML (Erickson et al. 2020 ). LGBM, and an adaptive ensemble model. The former ten methods were implemented by the corresponding Python packages. To demonstrate the effectiveness of the latter three types of data, we conducted eight groups of experiments.
We designed the adaptive ensemble model, combining the individual advantages of the former methods to improve the performance. Specifically, there were three steps. First, we defined the “bad case” as the sample which was difficult to distinguish. For a certain sample, if the prediction probability varies from 0.4 to 0.6, we defined this sample as a bad case for the prediction model. Second, we utilized the voting mechanism for calculating the numbers of voting for the positive sample and negative samples to determine the prediction category. Finally, we calculated the maximum/minimum prediction probability within the same prediction category to arrive at the final result.
Our method used Pytorch (RRID: SCR_018110) to accelerate the modeling process, and all experiments were conducted on an NVIDIA Geforce RTX 3090 GPU. The implementation code was programmed using Python (RRID: SCR_008394) with the 3.7.12 version. Pyradiomics 3.0.1(RRID: SCR_017489) was employed to extract radiomics features. The miceforest 5.6.3 was exploited to fill in the missing data. We also performed extensive experiments to verify the effectiveness of each model for different data sources. The evaluation metrics herein included AUC (Area under curves), accuracy, precision, recall, and F1 score (Saito and Rehmsmeier 2015 ). \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${\text{accuracy}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}}}$$\end{document} accuracy = TP TP + TN + FP + FN , \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${\text{precision}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FP}}}}$$\end{document} precision = TP TP + FP , \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${\text{ recall}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}}$$\end{document} recall = TP TP + FN , \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${\text{F}}1{\text{ score}} = \frac{{2\left( {{\text{precision}} \times {\text{recall}}} \right)}}{{{\text{precision}} + {\text{recall}}}}$$\end{document} F 1 score = 2 precision × recall precision + recall .
To assess the significance of different features, we determined the feature importance or coefficients of different models and ranked all features based on the feature importance scores of LGBM. Specifically, we calculated the mode of the feature's importance before conducting the normalization operations. Next, we calculated the average values of tenfold CV feature importance. Finally, we derived the relative importance scores for each feature by dividing the maximum.
Conclusion
18F-FDG PET/CT plays an important role in predicting the FIGO stage of OC before surgeries. Through AI algorithms, the PET/CT features combined with clinical features can improve the accuracy of staging prediction compared to traditional methods. The model we have made also showed no bad performance in the OCCC and MCOC patients.
Discussion
Ovarian patients’ prognosis depends greatly on early diagnosis and individualized treatments (Wentzensen et al. 2016 ). The FIGO stage is important for risk stratification which is essential to make subsequent treatment decisions and evaluate survival. Cytoreductive surgery in OC treatment cannot be replaced and studies have shown that the maximum diameter of residual tumor lesions less than 1 cm is an important factor affecting the prognosis of patients (Polterauer et al. 2012 ; Bois et al. 2009 ; Reuss et al. 2019 ). The tumor of patients in the early stage is more limited and can be removed completely by surgery. However, for patients in the advanced stage, their tumor may spread to the peritoneum, omentum, pelvic and abdominal lymph nodes, liver, and even lung. Complete resection is more difficult and even impossible. In addition, patients with advanced stage may have poor underlying conditions and thus cannot tolerate cytoreductive surgery. Therefore, for patients with advanced stage, neoadjuvant chemotherapy can be considered before cytoreductive surgery. After neoadjuvant chemotherapy, the tumor lesions were significantly reduced, which reduced the difficulty of cytoreductive surgery and the incidence of postoperative complications. Therefore, we used clinical features combined with PET/CT metabolic and radiomics features through machine learning approaches to build a model for predicting the OC stage.
The patients predicted to be in the early stage by the model can basically have the tumor lesions completely removed by surgeries. Therefore, these patients should be arranged for surgeries as soon as possible. However, for patients predicted to be in advanced stage, cytoreductive surgeries may not completely remove the lesions. These patients can consider to choose neoadjuvant chemotherapy before cytoreductive surgeries, so as to obtain better surgical outcomes. In addition, patients in the advanced stage also have a worse prognosis than those in the early stage. In the past, treatment decisions were made by doctors with higher professional titles and qualifications based on clinical information and laboratory tests. Sometimes patients with complex conditions even require multidisciplinary discussion to develop treatment options. Therefore, our model can integrate the examination indicators of patients, predict the staging more accurately and directly, and intuitively present the staging results through the prediction model. In low-performing settings, this can help doctors assess a patient's condition, move to follow-up care more quickly, and decide whether to facilitate referral. In some high-level medical institutions, although it cannot completely replace the assessment of senior doctors, it can provide doctors with more intuitive staging information to help them make decisions. Patients predicted to be in advanced stage can consider neoadjuvant chemotherapy. In addition, with the emphasis on the co-participation doctor-patient relationship, our prediction model helps patients and their families to better understand their own conditions, so as to properly adjust their psychological expectations. To a certain extent, it avoids some doctor-patient conflicts. Finally, in the clinical applications, it is crucial to deploy the AI prediction model. Deploying the algorithm model on NVIDIA's AI chip development board low-cost application is a good choice.
There years, AI algorithms have been used wildly in medicine because of their ability to process high throughput data and discover the potential connection of data. In the clinic, it can help radiologists make the right decisions in a shorter time. We used ten classic machine learning algorithms to build prediction models and then used an adaptive ensemble to make the best-performing model. Kawakami et al. used 32 peripheral blood markers with the RF classifier to predict the FIGO stage whose highest predictive accuracy and the AUC of the ROC curve were 69.0% and 0.760, respectively. The underperformance of their model in clear cell and mucinous histotypes was lower with 0.65 and 0.785 AUC, respectively (Kawakami et al. 2019 ). Our model showed a higher AUC value superior to the past models. There has been research about serum tumor markers and PET/CT metabolic features which are biomarkers in the prediction and evaluation of cancer. Ariel et al. found that MTV and TLG of the whole body in advanced epithelial ovarian cancer significantly were correlated with serum CA125 and HE4 making sense to cancer stratification (Glickman et al. 2022 ). Shuang Ye et al. found TLG60 was an independent negative predictor of overall survival for ovarian clear cell carcinoma (Ye et al. 2019 ). However, there are few studies to explore the meaning of radiomics features and OC. Our results showed that in the significance ranking top 40 results of 137 features based on the LGBM algorithm, we found that more than half of them were PET/CT radiomics features. And when we add the radiomics features into model building, the average AUC could be increased significantly. Radiomics provides information on the morphology, texture, and intensity of tumors. The shape and texture features offer important information for staging that’s because they can reflect tumor spatial information and tumor heterogeneity (Li et al. 2022 ). It demonstrated that introducing PET/CT radiomics features could improve the prediction ability of the stage.
The petP and petM were defined and evaluated by reading PET/CT images. Patients with more severe metastasis in the peritoneum and distant viscera outside the pelvic cavity tended to have high stages and worse prognostic outcomes. CA125 in our research was not so prominent which may be related to its relatively low specificity. It may elevate some physiological conditions and benign gynecological diseases such as menstruation and endometriosis. However, it is still an important and popular biomarker in OC evaluation (Dochez et al. 2019 ). Current studies have found that various immune and inflammatory cells have been found to play an important role in the progression of cancer, such as inducing tumor cell death and inhibiting tumor proliferation and migration. Whether inflammatory markers have predictive and suggestive effects in cancer has also attracted researchers' attention. However, in our study, the results showed that inflammatory markers like SII, NLR, and PLR were not significant in predicting the FIGO stage (Jung et al. 2010 ; Kovács et al. 2023 ).
MCOC and OCCC are special histotypes of OC that are less sensitive to platinum-based chemotherapy than serous and endometrioid adenocarcinoma. Therefore, it is more important to predict the stage more accurately to reach a zero residual state after the operation to get a better prognosis. In a retrospective study, 18.6% of OCCC patients were wrongly diagnosed with endometriosis (Pozzati et al. 2018 ). FDG is a glucose-like substance that can be transported into tissues by glucose transport proteins (GLUTs) but can’t be further metabolized. Malignant tumors with enhanced cell proliferation and glucose metabolism which is the reason for more FDG accumulation than normal tissues (Konishi et al. 2014 ). GLUTs and intracellular phosphorylation by hexokinase are key factors in FDG accumulation (Avril 2004 ). The different histotypes have different abilities of glucose metabolism which affects the results of PET metabolic features. The reason for low FDG uptake in OCCC and MCOC may be related to lower GLUT expression and lower proliferation rates (Kurokawa et al. 2004 ; Matsuura et al. 2021 ; Sato et al. 2017 ). MCOC has low tumor cellularity and more mucin which will also cause low FDG accumulation (Konishi et al. 2014 ). The AUC values in our model predicting the stage of OCCC and MCOC were 0.808 in the validation dataset through the best-performing adaptive ensemble algorithm. In previous studies, there has been little information about predicting the stage of this group of patients. We tried to build a model that could accurately predict the stage of this particular patient and our AUC value could reach the level of 0.792. However, because of the rarity of the two histological types, we still need more cases to verify the predictive performance of the model.
Limitation
Though our model showed excellent performance, there are still some limitations to our research. Next, we will discuss these limitations from three aspects: external validation, group imbalance and manual segmentation.
The number of patients enrolled in our study was limited, and external validation was lacking. In the next work, we will aim to increase the number of cases in different hospitals to prove the generalization ability and reliability of our model. To be specific, we will collect about 100 cases in Shanghai First Maternity and Infant Hospital. The dataset will be used separately as test set to validate the model's predictive performance using the same metrics. We will continue to actively seek cooperation from other hospitals to validate the model in multiple health care settings.
The group imbalance (binary class imbalance and multiclass imbalance) may lead to several issues. Firstly, an AI model trained on imbalanced data tends to develop a bias toward predicting the majority class. Specifically, the AI model tries to learn specific features from the majority class (e.g., advanced stages) and ignores the minority class (e.g., early stages). This also leads to low generalization ability for correctly predicting the minority class. Furthermore, a single performance metric is not sufficient to assess the predictive model accurately. For example, high accuracy does not necessarily indicate good model performance (Ghosh et al. 2024 ; Haixiang et al. 2017 ; Johnson and Khoshgoftaar 2019 ).
To address these issues, several types of approaches could be explored to improve the trained model: data-level techniques, algorithmic-level techniques, and hybrid technique (Niaz et al. 2022 ). First, data-level techniques include undersampling, oversampling, and stratified sampling. Oversampling can be utilized to increase the number of samples in the minority class by duplicating existing instances. Stratified sampling ensures that each batch or fold in the training process contains a proportional representation of both classes, helping the model learn to detect both early- and advanced-stage patients more effectively. Second, in algorithmic-level techniques, a hyperparameter can be introduced to modify the loss function and penalize the model more for misclassifying minority class samples than for misclassifying majority class samples. The penalty (or weight) is inversely proportional to the class frequency. Third, hybrid techniques refer to a combination of several approaches (Martinez-Velasco et al. 2024 ). In future work, we will adopt a combination of data-level and algorithmic technique to overcome this limitation.
The manual segmentation of the VOI was time wasted and may have been different due to patients’ experience. This approach will be limited in some large, multicenter studies. The reproducibility of manually segmentation and extracted features is also an important issue.The conventional wisdom is that manual segmentation has the advantage of not easily missing small lesions and avoiding overlay of VOIs. In the study, Firouzian et al. found that automated segmentation methods have high variation in some small lesions (Firouzian et al. 2014 ). It needs manual correction and there is no unified and perfect method to evaluate its segmentation results. This problem may restrict the popularity of PET/CT features. At present, the automatic segmentation method is still not perfect. But the rapid development of science and technology is amazing, and may be able to completely replace manual segmentation in the near future. In recent years, highly developed artificial intelligence technology has also been applied to medical imaging researches. Automated segmentation can significantly reduce the time and labor cost, and the segmentation results are reproducible, like CNN (Convolutional Neural Network) based segmentation methods (Constantino et al. 2023 ). A study showed there were no statistical differences in the metabolic features of PET/CT extracted from manual segmentation, semi-automatic segmentation, and automatic segmentation methods (Driessen et al. 2022 ). In addition, Mohammad et al. found that the PET/CT features of ovarian cancer extracted by the automatic segmentation based on 3D networks had high stability (Sadeghi et al. 2024 ). Jin et al. proposed automatic segmentation algorithms based on U-Net models using it to process ultrasound images. In the result, they found that the segmentation was accurate and radiomic features extracted showed good reproducibility and reliability (Jin et al. 2020 ).
To implement automated segmentation, the popular method is to utilize large models, such as SAM (Segment Anything Model) (Kirillov et al. 2023 ; Mazurowski et al. 2023 ). By leveraging these cutting-edge models, MedSAM (Ma et al. 2024 ), we expect to enhance the efficiency of the segmentation process. However, a completely automated segmentation method may also lead to new problems, such as coarse labels and missing semantic information. Therefore, we will explore the human-in-the-loop method to integrate human domain knowledge and obtain refined segmentation labels in future work (Mosqueira-Rey et al. 2023 ; Wu et al. 2022b ). Specifically, we train a segmentation model based on the PET/CT dataset with labels from experienced doctors. The labels of the new dataset are predicted based on the trained model, and then the experienced doctors refine these labels. This process continues until the prediction results meet the requirements of experienced doctors. This method significantly reduces the dependence on manual segmentation efforts while improving scalability and reproducibility in clinical applications.
Introduction
Ovarian cancer (OC) is the most deadly gynecological cancer with a five-year survival rate of less than 50% (Babic et al. 2020 ). The International Federation of Gynecology and Obstetrics (FIGO) surgical-pathological staging system is critical to guiding the treatments and predicting the prognosis of OC (Gomes Ferreira et al. 2018 ). However, the FIGO stage is obtained after surgeries that are delayed and may cause patients to miss optimal treatments.
18F-2-fluoro-2-deoxy- d -glucose positron emission tomography/computed tomography (18F-FDG PET/CT) with advantages in reflecting the metabolic activity of the lesions has been used in the diagnosis and evaluation of the prognosis of tumors. Some invasive procedures are used to assess the stage and type of tumor that may contribute to the spread of the tumor. In contrast, PET/CT is a non-invasive, relatively safe examination and can reflect the metabolic status of the tumor. It could be an effective tool for predicting staging. However, its accuracy still needs to be improved. Some retrospective studies have reported that the accuracy of PET/CT images in predicting OC stage was about 0.7 ~ 0.8 and PET/CT could detect unpredicted extra-abdominal lymph node metastasis (Nam et al. 2010 ; Castellucci et al. 2007 ).
OC is a complex gynecologic neoplasm with different histological types which have different biological features. However, PET/CT may underestimate the burden of some special histological types because they have lower FDG uptake compared to HGSC, like ovarian clear cell carcinoma (OCCC) and mucinous ovarian cancer (MCOC) (Bowtell, et al. 2015 ).
Machine learning is a branch of artificial intelligence (AI) (Boobier et al. 2017 ). It can process high-throughput data at one time and improve the accuracy of calculation. Some machine learning classifiers like Gradient Boosting Machine (GBM), Random Forest (RF), Elastic net, and so on, have been used for preoperative diagnostic and prognostic prediction (Kawakami et al. 2019 ; Wu et al. 2022a ; Zhang et al. 2019 ). Radiomics is a new high-throughput approach for image analysis that provides information on statistical, morphological, and textural features. The radiomics features can be obtained by AI. PET/CT metabolic features, which are obtained by calculating the uptake of contrast agents in different tissues, are traditionally radiomics features that can directly reflect the metabolic activity of tissues. Common PET/CT metabolic parameters, including the standardized uptake value (SUV), total lesion glycolysis, (TLG), and metabolic tumor volume (MTV). Previous studies have found that PET/CT metabolic features are beneficial for tumor diagnosis, staging, prognosis evaluation, and recurrence monitoring. Now with the improvement of technology, we can extract more PET/CT radiomics parameters and get more useful information for cancer treatment. They may help to improve the accuracy of PET/CT images (Guglielmo et al. 2021 ).
Thus, we proposed a novel mixture-of-experts model to predict the FIGO stage of OC through machine learning algorithms, based on analysis of patients’ pretreatment clinical features integrated with 18F-FDG PET/CT metabolic and radiomics features to direct better treatments. In addition, we analyzed the relative features’ importance for the tumor stage prediction based on the proposed model which could be used to guide the clinical diagnosis. We also empirically identified the difficulties of predicting OCCC and MCOC cases.
Supplementary Material
Below is the link to the electronic supplementary material. Supplementary file1 (DOCX 15 KB) Supplementary file2 (DOCX 16 KB) Supplementary file3 (DOCX 12 KB) Supplementary file4 (DOCX 25 KB) Supplementary file5 (DOCX 292 KB) Supplementary file6 (DOCX 285 KB) Supplementary file7 (DOCX 14 KB) Supplementary file8 (DOCX 15 KB)
Supplementary file1 (DOCX 15 KB)
Supplementary file2 (DOCX 16 KB)
Supplementary file3 (DOCX 12 KB)
Supplementary file4 (DOCX 25 KB)
Supplementary file5 (DOCX 292 KB)
Supplementary file6 (DOCX 285 KB)
Supplementary file7 (DOCX 14 KB)
Supplementary file8 (DOCX 15 KB)
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.