Methods
This was a multi-center, retrospective observational study. We consecutively enrolled adult patients (≥ 18 years) who were pathologically diagnosed with GBC and underwent R0 resection (with negative microscopic margins) aimed at radical cure between January 2015 and December 2022 at four medical centers. Among 693 patients initially screened from the four centers, 172 were excluded, resulting in 521 patients finally included. Exclusion criteria were: (1) Postoperative pathological confirmation of benign disease; (2) Those who received neoadjuvant therapy; (3) Undergoing palliative surgery or R1/R2 resection; (4) Perioperative death (within 30 days post-surgery); (5) Incomplete clinicopathological or follow-up data. The study protocol was approved by the Ethics Committees of all participating centers, and the requirement for informed consent was waived. Preoperative laboratory tests were completed within two weeks before surgery. Recurrence was defined as the appearance of new local or distant metastatic lesions on postoperative imaging (contrast-enhanced CT or MRI). Recurrence time was defined as the interval from the surgery date to the date of first imaging-confirmed recurrence. Follow-up protocols were uniform across all centers: every 3–6 months for the first 2 years postoperatively, then annually thereafter. Recurrence status and date were ascertained through a comprehensive review of medical records from the participating centers. For cases where follow-up imaging was performed at non-participating hospitals, dedicated research personnel conducted telephone follow-up with the patients or their families to obtain and verify the relevant examination reports and diagnoses. Only recurrence events confirmed by imaging reports (contrast-enhanced CT or MRI) were recorded. Patients who were lost to follow-up or for whom recurrence status could not be verified were excluded from the analysis, as per the exclusion criteria.Cases where postoperative imaging abnormalities were clearly identified as surgical complications (e.g., effusion, abscess) were excluded.
Data extracted from electronic medical records included: (a) Demographic data: age, sex; (b) Preoperative laboratory tests: total bilirubin, indirect bilirubin, CEA, CA19-9, AFP, etc.; (c) Postoperative pathological results: T stage, N stage, M stage (according to the AJCC 8th edition TNM stag system), tumor differentiation grade (high, moderate, poor), vascular invasion (yes/no), perineural invasion (yes/no), etc.; (d) Follow-up information: recurrence status (yes/no) and recurrence-free survival time. Recurrence was defined as the detection of new local or distant metastatic lesions on postoperative imaging.
Table S1 lists the missing rates for all key variables (Supplementary Material Table S1 ). For continuous variables with a missing rate < 10%, imputation was performed using the median of the training set; for those with missing rates ≥ 10%, multiple imputation was performed using the miceRpackage with 10 iterations, and the imputation model included all analysis variables. Categorical variables were imputed using the mode. To rigorously evaluate the model’s generalizability to new patient populations and thoroughly avoid potential data leakage from intra-center data sharing, we adopted a center-based split strategy for external validation. Specifically, all case data from one entire center were used as the training set, while data from the remaining three centers were combined as the validation set. This split resulted in a training-to-validation set case number ratio of approximately 7:3, a common choice in machine learning to balance model training needs and sufficient validation set size. Baseline characteristics table (Table S2) showed no significant differences between the two sets. Before model training, variance inflation factors (VIF) were calculated for all continuous variables, and no severe multicollinearity was found (all VIF < 5). All collected features were retained for model training to preserve all potential predictive information.
The Cox proportional hazards regression model served as the traditional statistical benchmark. The recurrence probability at a specific time point (2 years) predicted by the Cox model was used as input to calculate its AUC, enabling comparison with other ML models under the same criterion. The Support Vector Machine (SVM) employed the radial basis function (RBF) as the kernel, with the penalty parameter C and kernel parameter γ optimized via grid search. Random Forest (RF) used out-of-bag (OOB) error to assess feature importance, and hyperparameters like the number of trees (n_estimators) and maximum depth (max_depth) were optimized via randomized search with cross-validation.
Extreme Gradient Boosting (XGBoost) hyperparameters, including learning rate, max_depth, and subsample ratio, were tuned using a Bayesian optimization algorithm to prevent overfitting. The optimal parameters for the XGBoost model were: learning_rate = 0.1, max_depth = 5, subsample = 0.9; for the RF model: n_estimators = 100, max_features = 6; and for the SVM model: C = 1, gamma = 0.1. Hyperparameter optimization for all models was performed using 5-fold cross-validation on the training set.
All statistical analyses were performed using R (version 4.3.6). Model construction primarily utilized the following packages: xgboost(version 1.7.7) for XGBoost, randomForest(version 4.7–1.1) for RF, e1071(version 1.7–14) for SVM, and survival(version 3.5-8) for the Cox model. SHAP analysis was implemented via the shapvizpackage (version 0.9.3). Continuous variables conforming to a normal distribution are presented as mean ± standard deviation and compared using the independent samples t-test; non-normally distributed variables are presented as median (interquartile range) and compared using the Mann-Whitney U test. Categorical variables are presented as frequency (percentage) and compared using the chi-square test or Fisher’s exact test (when expected counts were < 5). All P-values were two-tailed, and a P-value < 0.05 was considered statistically significant (Table 1 ).
Table 1 Baseline characteristics of the study population Recurrence situation Non-recurrence Recurrence
P
n 343 178 male (%) c 131 ( 38.5) 48 ( 27.0) 0.011 Age (mean (SD)) a 64.19 (10.90) 64.14 (9.68) 0.959 Jaundice (%) c 27 ( 7.9) 26 ( 14.6) 0.024 Indirect Bilirubin (mean (SD)) b 14.41 (25.77) 14.70 (23.09) 0.899 Lesion Ultrasound Echogenicity (%) d 0.018 Hypoechoic 135 ( 39.4) 81 ( 45.5) Mixed echogenicity 8 ( 2.3) 3 ( 1.7) Isoechoic 131 ( 38.2) 43 ( 24.2) No lesion detected 68 ( 19.8) 50 ( 28.1) Hyperechoic 1 ( 0.3) 1 ( 0.6) Diabetes Mellitus (%) d 10 ( 3.0) 5 ( 3.0) 1 Smoking (%) c 47 (13.7) 21 (11.8) 0.823 Number of Masses (%) d 0.041 1 185 ( 54.6) 87 ( 48.9) 2 56 ( 16.5) 21 ( 11.8) 3 98 ( 28.9) 70 ( 39.3) Gallbladder Adenomyomatosis (%) d 1 ( 0.3) 0 ( 0.0) 1 Gallstones / Cholelithiasis (%) c 186 ( 54.2) 113 ( 63.5) 0.053 Biliary Tract Infection (%) d 5 ( 1.5) 0 ( 0.0) 0.252 Alpha-fetoprotein (AFP) (mean (SD)) b 39.41 (484.15) 13.98 (128.38) 0.492 Carcinoembryonic Antigen (CEA) (mean (SD)) b 21.75 (238.58) 16.64 (109.35) 0.787 T Stage (%) c < 0.001 0 18 ( 5.2) 2 ( 1.1) 1 69 ( 20.1) 16 ( 9.0) 2 167 ( 48.7) 79 ( 44.4) 3 68 ( 19.8) 66 ( 37.1) 4 21 ( 6.1) 15 ( 8.4) N Stage (%) c < 0.001 0 229 ( 66.8) 74 ( 41.6) 1 52 ( 15.2) 56 ( 31.5) 2 23 ( 6.7) 26 ( 14.6) 3 39 ( 11.4) 22 ( 12.4) Differentiation (%) c < 0.001 Poorly differentiated 55 ( 16.0) 49 ( 27.5) Moderately to Poorly differentiated 84 ( 24.5) 69 ( 38.8) Moderately differentiated 123 ( 35.9) 47 ( 26.4) Well differentiated 41 ( 12.0) 6 ( 3.4) Well to Moderately differentiated 26 ( 7.6) 6 ( 3.4) Undifferentiated 14 ( 4.1) 1 ( 0.6) Vascular Invasion (%) c 47 ( 13.7) 46 ( 25.8) 0.001 Perineural Invasion (%) c 59 ( 17.2) 59 ( 33.1) < 0.001 Metastasis (excluding lymph node metastasis) (%) d 0.065 Intrahepatic metastasis 27 ( 8.0) 19 ( 10.7) Extrahepatic metastasis 0 ( 0.0) 3 ( 1.7) Intrahepatic and extrahepatic 4 ( 1.2) 3 ( 1.7) No metastasis 308 ( 90.9) 153 ( 86.0) a Independent samples t-test, data presented as mean ± SD; b Mann-Whitney U test, data presented as median (IQR); c Chi-square test; d Fisher’s exact test
Baseline characteristics of the study population
a Independent samples t-test, data presented as mean ± SD; b Mann-Whitney U test, data presented as median (IQR); c Chi-square test; d Fisher’s exact test
Results
This study ultimately included 521 patients, comprising 343 patients without recurrence and 178 patients who experienced recurrence. The proportion of females was 27.0% in the recurrence group compared to 38.5% in the non-recurrence group. No significant difference was observed in the age of patients between the two groups, with the mean age being 64.19 years in the non-recurrence group and 64.14 years in the recurrence group. Jaundice was present in 14.6% of patients in the recurrence group, significantly higher than the 7.9% observed in the non-recurrence group ( P = 0.024). However, the comparison of indirect bilirubin levels between the two groups showed no significant difference ( P = 0.899). Different grades of ultrasonographic echogenicity were associated with tumor recurrence ( P = 0.018). No significant differences were found between the two groups for the indicators of diabetes mellitus and gallbladder adenomyosis. Notably, 39.3% (70/178) of patients in the recurrence group had multiple (≥ 3) masses, compared to only 28.9% (98/343) in the non-recurrence group ( P = 0.041). Although the proportion of patients with gallstones was higher in the recurrence group (63.5%, 113/178) than in the non-recurrence group (54.2%, 186/343), this difference did not reach statistical significance ( P = 0.053). The mean levels of Alpha-fetoprotein (AFP) and Carcinoembryonic Antigen (CEA) showed no significant differences between the two groups ( P = 0.492 and P = 0.787, respectively). Crucially, the rates of both vascular invasion (25.8% [46/178] vs. 13.7% [47/343]) and perineural invasion (33.1% [59/178] vs. 17.2% [59/343]) were significantly higher in the recurrence group than in the non-recurrence group (both P < 0.001), indicating strong associations with recurrence risk.
The performance of the four models was evaluated across three dimensions: discrimination, calibration, and clinical utility. As shown in Fig. 1 , on the training set, the XGBoost model (red curve) achieved the highest Area Under the Curve (AUC) value of 0.969 (95% CI: 0.953–0.971), demonstrating superior performance, followed by the Random Forest (RF) model (AUC = 0.941, 95% CI: 0.925–0.963). The discrimination capabilities of the Support Vector Machine (SVM) and Cox regression models were relatively limited. The AUC values obtained on the validation set were close to those on the training set, indicating good generalizability of the optimal model. Calibration curves, which assess the accuracy of predicted probabilities, revealed that the curves for XGBoost and Random Forest were closest to the ideal calibration line, signifying the most reliable probability estimates. In contrast, the curves for SVM and Cox regression showed considerable deviation, suggesting systematic prediction errors. Decision Curve Analysis (DCA) was employed to evaluate the clinical net benefit of using each model to guide clinical decisions (e.g., intervention or follow-up) across a range of decision thresholds, compared to strategies of “treat all” or “treat none.” The XGBoost model provided the highest clinical net benefit across a wide range of threshold probabilities, indicating the strongest clinical utility. In summary, based on comprehensive assessment of discrimination, calibration, and clinical usefulness, the XGBoost model demonstrated the best overall performance for predicting GBC recurrence, with Random Forest being a viable secondary option. The SVM and Cox regression models were non-competitive in this study. The optimal cut-off value for the XGBoost model, determined by the Youden’s index, was 58.5, corresponding to a sensitivity of 90.2%, specificity of 66.2%, positive predictive value of 62.5%, and negative predictive value of 91.8%.
Fig. 1 Comprehensive performance comparison of GBC recurrence prediction models: A Multi-Dimensional Evaluation Based on XGBoost, Random Forest, SVM, and Cox Regression
Comprehensive performance comparison of GBC recurrence prediction models: A Multi-Dimensional Evaluation Based on XGBoost, Random Forest, SVM, and Cox Regression
SHAP interpretability analysis elucidated the model’s decision-making mechanism, revealing the main drivers of postoperative recurrence in GBC (Fig. 2 ). The TNM stage system (particularly T and N stages) and tumor differentiation grade were the core prognostic indicators, while jaundice-related indicators (indirect bilirubin) and tumor markers (CEA, AFP) provided supplementary prognostic information. Figure 2 A illustrates the direction and magnitude of the impact of various clinicopathological features on the model output. The results indicated that the T-stage had the greatest influence on the model, with a high T-stage significantly increasing the risk of recurrence, while a low T-stage acted as a protective factor. The differentiation grade exhibited a similar pattern. Elevated levels of indirect bilirubin and CEA (shown in red) also increased the recurrence risk, but to a lesser extent. Figure 2 B quantifies the mean absolute contribution of each feature to the model output. The T-stage had the highest mean SHAP value, confirming it as the most important predictor of recurrence. Differentiation grade, N-stage, CEA, and indirect bilirubin ranked 2nd to 5th, respectively, consistent with the summary plot results. The relative contributions of features such as vascular invasion and smoking were smaller. Figure 2 C displays the individualized risk decision paths for specific samples, revealing heterogeneity in feature interactions. For different samples, the same feature (e.g., T-stage) could exhibit varying contribution patterns depending on its combination with other features.
Fig. 2 Visualization of feature importance and decision mechanism for the postoperative GBC recurrence risk prediction model based on the SHAP method. A Feature Contribution Summary Plot: Each dot represents a sample. The horizontal position of a dot indicates the direction of the contribution of that sample’s specific feature value to the model’s final prediction (right: increases recurrence risk; left: decreases recurrence risk). The color of the dot represents the magnitude of the feature value (red: high value; blue: low value). B Feature Importance Ranking Plot: The bar length represents the mean absolute SHAP value for that feature, which is the average magnitude of the feature’s impact on the prediction outcome across all samples, used to measure global feature importance. C Sample Decision Path Plot: This displays the prediction breakdown for multiple samples. Each vertical polyline represents one sample. Its final prediction value (far right end) starts from the baseline (the average predicted risk for all samples) and is incrementally shaped by the SHAP values (line segments) of individual features. The clustering of same-colored segments illustrates the collective influence pattern of a specific feature
Visualization of feature importance and decision mechanism for the postoperative GBC recurrence risk prediction model based on the SHAP method. A Feature Contribution Summary Plot: Each dot represents a sample. The horizontal position of a dot indicates the direction of the contribution of that sample’s specific feature value to the model’s final prediction (right: increases recurrence risk; left: decreases recurrence risk). The color of the dot represents the magnitude of the feature value (red: high value; blue: low value). B Feature Importance Ranking Plot: The bar length represents the mean absolute SHAP value for that feature, which is the average magnitude of the feature’s impact on the prediction outcome across all samples, used to measure global feature importance. C Sample Decision Path Plot: This displays the prediction breakdown for multiple samples. Each vertical polyline represents one sample. Its final prediction value (far right end) starts from the baseline (the average predicted risk for all samples) and is incrementally shaped by the SHAP values (line segments) of individual features. The clustering of same-colored segments illustrates the collective influence pattern of a specific feature
SHAP waterfall plots were used to illustrate the model’s decision paths for high-, intermediate-, and low-risk samples, respectively (Fig. 3 ). Risk stratification (high, intermediate, low) was based on the model-predicted recurrence probability. Using the results from the Decision Curve Analysis (DCA), we determined cut-off probability values (0.3 and 0.6) to categorize patients into three groups: low-risk (probability 0.6). Each subplot starts from the baseline value (the average predicted level for all patients) and visually demonstrates how each clinical feature pushes the final prediction towards a specific risk level through its SHAP value (contribution). This clearly reveals the key driving features behind the model’s prediction for patients at different risk levels. High risk originates from the accumulation of multiple risk factors, intermediate risk from the interplay of contradictory factors, and low risk benefits from the combined effect of multiple protective factors. Figure 3 A represents a high-risk sample with a model output value f(x) = 4.814, corresponding to a very high recurrence risk. The prediction is primarily driven by strong positive contributions from risk factors such as jaundice (SHAP = + 1.71), elevated CEA and AFP levels, and a high T-stage. Figure 3 B represents an intermediate-risk sample with a model output value f(x) = 0.031, indicating a risk close to the baseline. The feature contributions show a “tug-of-war” equilibrium; for instance, features like age and CEA have contributions of similar magnitude but opposite directions, canceling each other out and resulting in an average prediction. Figure 3 C represents a low-risk sample with a model output value f(x) = -3.312, corresponding to a very low recurrence probability. The prediction is dominated by the negative contributions (protective effects) of a series of factors, with favorable differentiation grade (SHAP=-1.49) being the most significant protective factor, and younger age and normal CEA levels further reducing the risk.
Fig. 3 SHAP-based Waterfall Plots for Postoperative GBC Recurrence Risk Stratification. The horizontal axis (SHAP value) represents the magnitude and direction of each feature’s contribution to the final prediction for a specific sample. The origin (0) represents the average predicted level (baseline) for all samples. Bars extending to the right (positive values) indicate that the feature increases the sample’s recurrence risk, while bars extending to the left (negative values) indicate a decrease in risk; the length of the bar represents the magnitude of the contribution. The vertical axis (feature list) displays the most significant features affecting the prediction for that sample, ranked from top to bottom in descending order of the absolute value of their SHAP value (i.e., their impact on the prediction outcome). Red bars indicate a high original value for that feature, while blue bars indicate a low original value
SHAP-based Waterfall Plots for Postoperative GBC Recurrence Risk Stratification. The horizontal axis (SHAP value) represents the magnitude and direction of each feature’s contribution to the final prediction for a specific sample. The origin (0) represents the average predicted level (baseline) for all samples. Bars extending to the right (positive values) indicate that the feature increases the sample’s recurrence risk, while bars extending to the left (negative values) indicate a decrease in risk; the length of the bar represents the magnitude of the contribution. The vertical axis (feature list) displays the most significant features affecting the prediction for that sample, ranked from top to bottom in descending order of the absolute value of their SHAP value (i.e., their impact on the prediction outcome). Red bars indicate a high original value for that feature, while blue bars indicate a low original value
Discussion
Gallbladder carcinoma (GBC) is a malignancy with an extremely poor prognosis. Radical surgery is considered the cornerstone of GBC treatment, yet less than a quarter of patients are suitable candidates for surgical intervention [ 29 ]. Even among those who undergo resection, the incidence of postoperative local or distant recurrence remains high, ranging from 25% to 65% [ 30 , 31 ]. Against this backdrop, this study successfully developed and validated a machine learning-based prediction model, specifically utilizing the eXtreme Gradient Boosting (XGBoost) algorithm, to address the challenge of predicting postoperative recurrence risk in GBC. The model demonstrated outstanding performance across multiple metrics, achieving an Area Under the Curve (AUC) of 0.969 on the training set, significantly surpassing the performance of Random Forest (AUC = 0.941), Support Vector Machine (SVM), and the Cox proportional hazards model (AUC = 0.711 and 0.676, respectively), highlighting its superior discriminatory power for recurrence prediction.
This result holds significant comparative value against recent studies applying machine learning to GBC recurrence. For instance, one study utilizing an XGBoost model incorporating ultrasonographic features, clinical characteristics, and serological markers for GBC risk assessment reported AUCs of 0.934 and 0.916 in the training and validation sets, respectively, underscoring the superiority of XGBoost in GBC risk evaluation [ 32 ]. Another study employing XGBoost for predicting early postoperative recurrence in GBC reported an AUC of 0.74, which, although lower than the 0.969 in our study, still demonstrates the potential of XGBoost in handling complex non-linear relationships [ 33 ]. Furthermore, in a study predicting the risk of distant metastasis in GBC, the XGBoost model achieved an AUC of 0.885, further validating its effectiveness in GBC prognostication [ 27 ]. The calibration curves and Decision Curve Analysis (DCA) indicated excellent predictive accuracy and clinical utility for the XGBoost model. The model demonstrated favorable clinical net benefit across a wide range of probability thresholds, suggesting its application can provide robust support for clinical decision-making.
The SHAP interpretability analysis revealed that the T-stage was the strongest driver of recurrence prediction, which is highly consistent with numerous previous studies. The T-stage directly reflects the depth of tumor invasion and is a direct indicator of local aggressiveness and metastatic potential [ 34 , 35 ]. In the 8th edition AJCC staging system, the T-stage is used for more precise pathological staging of GBC, thereby improving the accuracy of prognosis assessment [ 36 , 37 ]. Research indicates that the T-stage can effectively differentiate survival rates among GBC patients at different stages and can be combined with other prognostic factors to form more comprehensive prognostic models [ 38 , 39 ], findings which further validate our conclusion.
The subsequent important features included N-stage, tumor differentiation grade, CEA, and indirect bilirubin, which also align with previous research [ 34 , 36 , 40 ]. Studies have shown that the presence of lymph node metastasis is significantly associated with survival in GBC patients, with higher N-stages correlating with lower survival rates [ 40 , 41 ]. Additionally, the number and location of lymph node metastases influence patient prognosis, particularly in stage T2 GBC, where the presence of lymph node metastasis significantly reduces survival [ 35 ]. Patients with poor tumor differentiation typically have a worse prognosis, while elevated levels of CEA and indirect bilirubin are also associated with poorer survival [ 41 , 42 ]. This ranking not only validates the central role of the TNM staging system in GBC prognosis but also emphasizes the additional prognostic value of tumor biological behavior (differentiation grade), systemic inflammation/metabolic status (bilirubin), and tumor burden (CEA). Although the association between gallstones and recurrence risk only showed borderline significance ( P = 0.053), the higher proportion of gallstones in the recurrence group suggests that, as a chronic stimulant, gallstones might be potentially linked to the long-term progression and postoperative recurrence risk of GBC through pathophysiological mechanisms such as inducing persistent inflammation and promoting abnormal cell proliferation [ 43 – 45 ]. This finding warrants validation in larger, prospective cohorts. The variable “Number of Masses” relies on postoperative pathology, thus its availability for immediate postoperative adjuvant therapy decision-making is limited; its primary value lies in enriching the understanding of tumor biology.
It is noteworthy that the significant differences in various clinical characteristics between the recurrence and non-recurrence groups in this study further underscore the importance of early identification of high-risk patients. The significant differences in vascular and perineural invasion, in particular, reveal their potential role in prognosis assessment, offering new directions for future exploration. Furthermore, the significant difference in jaundice presence ( P = 0.024) contrasting with the lack of difference in indirect bilirubin levels ( P = 0.899) might be related to several factors. Firstly, preoperative jaundice may be an important indicator of poor prognosis in GBC patients. Research suggests that preoperative jaundice is closely related to survival rates and the risk of postoperative complications in GBC patients [ 46 ]. The presence of jaundice often indicates stronger tumor aggressiveness, possibly involving invasion of the bile ducts or liver, leading to an increased risk of postoperative recurrence. Indirect bilirubin levels might not show a significant effect because total bilirubin levels better reflect the overall state of liver function and tumor aggressiveness, whereas indirect bilirubin primarily reflects metabolic products after red blood cell destruction and has a less direct relationship with tumor burden [ 47 ]. This finding points to potential intervenable factors, suggesting a need for more focus on their impact on recurrence in the future. In clinical practice, these interpretability analyses can help clinicians understand the recurrence risks of different patients more clearly, thereby facilitating individualized management.
Despite the important advancements achieved, this study has several limitations. Firstly, as a retrospective study, potential selection bias due to missing follow-up data exists. Although we found no significant differences in the baseline characteristics of patients included versus those excluded due to loss of follow-up, this could still potentially affect the generalizability of the results. Secondly, despite being a multi-center study, the sample size might still be insufficient for certain subgroup analyses (e.g., specific TNM stages). Thirdly, the model currently relies solely on conventional clinicopathological data. Future integration of radiomic features (e.g., extracted from preoperative CT or MRI) and genomic information (e.g., TP53, KRAS mutations) could lead to the development of more powerful predictive tools. Finally, the model’s performance requires further validation in larger, more diverse prospective cohorts.
In conclusion, the XGBoost model developed in this study effectively predicts postoperative recurrence in GBC with good performance. Its interpretability analysis, while validating known clinical risk factors, provides support for individual risk assessment. This model shows promise in assisting clinicians to identify high-risk patients early after surgery, thereby offering decision support for formulating individualized adjuvant treatment strategies (such as more actively recommending adjuvant chemotherapy for high-risk patients) and differentiated follow-up plans.
Introduction
Gallbladder cancer (GBC) is the most common malignancy of the biliary tract, ranking fifth to sixth among gastrointestinal malignancies, and its incidence is increasing globally [ 1 – 3 ]. GBC is a highly aggressive malignancy, typically with inconspicuous early symptoms, leading to most patients being diagnosed at an advanced stage [ 4 , 5 ]. Studies indicate that the 5-year survival rate for GBC is generally below 20%, and even after intended radical resection, the postoperative recurrence rate remains as high as 30% to 50% [ 6 , 7 ]. Furthermore, the recurrence patterns of GBC are complex, with early and frequent distant recurrence being the main cause of surgical failure [ 8 ]. Although surgical resection is the only potentially curative treatment for GBC, most patients are ineligible for resection at diagnosis. Even after radical resection, the postoperative recurrence rate remains high [ 9 – 11 ]. Recurrence is a primary cause of treatment failure and patient death. Research suggests that postoperative adjuvant chemotherapy can, to some extent, prolong patient survival, particularly in node-positive patients [ 9 , 10 ]. Therefore, accurately identifying high-risk populations for postoperative recurrence and administering intensified adjuvant therapy and close follow-up are key to improving prognosis.
Currently, prognosis assessment for GBC relies primarily on traditional clinicopathological indicators, such as TNM stage, tumor differentiation grade, lymph node metastasis, vascular invasion, and perineural invasion. These indicators are widely used in clinical practice and are considered important for assessing the prognosis of GBC patients [ 11 – 16 ]. Additionally, serum tumor markers like CA19-9 and CEA are also widely used for prognosis assessment [ 17 – 19 ]. However, the predictive power of these factors is limited when used individually, and it is difficult for clinicians to integrate these non-linear, interacting variables to precisely quantify the recurrence risk for individual patients. Therefore, there is an urgent need to develop an objective, accurate, and integrated recurrence prediction tool incorporating multi-dimensional information.
In recent years, artificial intelligence (AI) and Machine Learning (ML) technologies have shown great potential in the field of medical prognosis prediction. In GBC research, scholars have begun exploring applications of ML and radiomics [ 20 , 21 ]. For instance, radiomics models based on ultrasound or contrast-enhanced CT have shown good discriminatory value in differentiating benign from malignant gallbladder polyps [ 22 ]. Meanwhile, interpretable diagnostic frameworks developed through multimodal data fusion have further improved the accuracy and credibility of early diagnosis [ 23 ]. Moreover, studies have used ML to predict lymph node metastasis status or construct postoperative survival prediction models, demonstrating potential application value in preoperative assessment and prognosis prediction [ 24 ]. However, most current studies are single-center retrospective analyses, lacking sufficient validation of model generalizability across different equipment and operators. Many existing studies also suffer from insufficient model comparison, a lack of rigorous clinical utility validation, and poor model interpretability, limiting their translation to clinical practice [ 25 – 28 ]. Although studies have explored ML applications in GBC, the present study is the first, within a multi-center framework, to systematically compare the performance of XGBoost, RF, SVM, and Cox models in predicting postoperative GBC recurrence. Furthermore, it deeply integrates the SHAP framework to provide individualized, visual explanations for model decisions, surpassing the traditional model of providing only risk scores.
Supplementary Material
Supplementary Material 1.
Supplementary Material 1.
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.