Calibration Drift Under Cross-Institutional Deployment: An External Validation Framework for ICU Mortality Prediction Across MIMIC-IV and eICU

preprint OA: closed CC-BY-4.0
📄 Open PDF Full text JSON View at publisher
Full text 152,742 characters · extracted from preprint-html · click to expand
Calibration Drift Under Cross-Institutional Deployment: An External Validation Framework for ICU Mortality Prediction Across MIMIC-IV and eICU | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Calibration Drift Under Cross-Institutional Deployment: An External Validation Framework for ICU Mortality Prediction Across MIMIC-IV and eICU Krutarth Patel, Phanindra Beedala This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-9602675/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Background Machine learning models for intensive care unit (ICU) mortality prediction achieve strong internal discrimination yet rarely undergo external validation with calibration assessment — a gap undermining clinical deployment. Calibration, the agreement between predicted probabilities and observed event rates, is prerequisite for threshold-based decisions yet remains underreported. Methods We conducted a retrospective cohort study using MIMIC-IV (v2.2; n = 52,028 ICU stays) for model development and eICU (n = 114,060) for independent external validation. Logistic regression, random forest, and gradient boosting (XGBoost) were evaluated on first-24-hour clinical variables. Discrimination was assessed via receiver operating characteristic area (AUROC) and precision-recall area (AUPRC); calibration via slope, intercept, and expected calibration error (ECE). Post-hoc logistic recalibration was applied externally. Clinical utility was evaluated by decision curve analysis benchmarked against Acute Physiology and Chronic Health Evaluation (APACHE) scores. Subgroup analyses examined sex and race/ethnicity; SHapley Additive exPlanations (SHAP) assessed feature importance. Uncertainty was estimated via bootstrap resampling; the study adheres to TRIPOD guidelines. Results The recalibrated XGBoost model achieved internal AUROC 0.847 (95% CI: 0.832–0.860) and external AUROC 0.819 (95% CI: 0.815–0.823). Internal calibration was near-ideal (slope 0.982; intercept 0.001), whereas external validation revealed systematic risk overestimation (intercept − 0.678) attributable to prevalence-driven label shift. An intercept-only adjustment reduced ECE by 26%. The model outperformed APACHE (AUROC 0.817 vs. 0.795; p < 0.001). Conclusions ICU mortality models exhibit transportable discrimination but clinically significant calibration drift under cross-institutional deployment. Calibration evaluation and targeted recalibration should be mandatory in any clinical machine learning validation framework. Critical Care & Emergency Medicine Medical Informatics Artificial Intelligence and Machine Learning Biostatistics Epidemiology ICU mortality prediction Model calibration External validation Dataset shift Clinical decision support Probability calibration Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 1. INTRODUCTION Machine learning models for intensive care unit (ICU) mortality prediction routinely achieve strong internal discrimination yet fail to meet calibration standards when transferred across institutions — a gap with direct and underappreciated consequences for clinical deployment [ 1 , 2 ]. Reliable risk stratification can guide triage, inform resource allocation, and support high-stakes bedside decisions [ 3 , 4 ], but a model that correctly ranks patients by relative risk may simultaneously overestimate or underestimate absolute mortality probability by a clinically significant margin [ 5 ]. This dissociation between preserved discrimination and degraded calibration is the central deployment hazard this work addresses. The widespread adoption of electronic health records has enabled data-driven outcome prediction [ 6 ], and machine learning techniques have demonstrated the ability to model complex nonlinear relationships across physiological, laboratory, and demographic variables [ 7 , 8 ]. Large critical care databases — including MIMIC-IV and the eICU Collaborative Research Database — have facilitated the development and benchmarking of ICU mortality models [ 9 , 10 ], with logistic regression, random forest, and gradient boosting approaches consistently reporting strong discriminative performance [ 11 , 12 ]. Yet single-dataset performance does not guarantee reliable generalization: differences in patient populations, clinical workflows, and data collection practices can substantially alter model behavior in external cohorts [ 13 ], and systematic reviews have documented that external validation remains inconsistently reported across the clinical prediction model literature [ 14 ] — making rigorous cross-institutional evaluation essential for assessing deployment readiness [ 15 ]. The field faces additional concerns regarding reproducibility and methodological rigor [ 16 ]. Variability in cohort definitions and preprocessing pipelines produces non-comparable results, and temporal data leakage can generate overly optimistic performance estimates [ 17 ]. Most critically, while AUROC is widely reported, calibration — the agreement between predicted probabilities and observed event rates — remains systematically under-evaluated despite its direct importance for threshold-based clinical decisions [ 18 , 19 ]. Emerging evidence confirms that distributional shifts between institutions disproportionately affect absolute probability estimates while leaving discriminative ranking relatively intact [ 18 ], a dissociation with serious implications for any deployment context where absolute risk estimates drive clinical action. To address these gaps, this study presents a reproducible, calibration-aware benchmarking framework for ICU hospital mortality prediction, operationalized across two large, independent, publicly available critical care databases. The framework spans cohort construction, feature extraction, preprocessing, model development, calibration-aware evaluation, post-hoc recalibration, and demographic fairness auditing — with all code, cohort definitions, and model artifacts publicly available as an immediately adoptable evaluation template for clinical informatics researchers. 2. METHODS 2.1 Study Design and Data Sources We conducted a retrospective cohort study using two publicly available, de-identified critical care databases. MIMIC-IV (version 2.2) [ 9 ] comprises longitudinal EHR data from a single quaternary academic medical center (Beth Israel Deaconess Medical Center, Boston, MA, USA; 2008–2022) and served as the model development and internal validation source. The eICU Collaborative Research Database [ 10 ] aggregates ICU data from 335 units across 208 U.S. hospitals (2014–2015) and constituted the independent external validation cohort. All procedures adhered to established standards for clinical prediction model development and external validation [ 13 ]. The primary outcome was in-hospital mortality; secondary outcomes were ICU mortality and prolonged ICU length of stay (≥ 7 days). Eligibility criteria were applied identically across both datasets: adult patients (aged ≥ 18 years) with a unique first ICU stay per hospitalization. Exclusion criteria were: ICU stays < 4 hours, missing outcome data, and age < 18 years. As all data were fully de-identified and publicly available under approved data-use agreements, IRB review was not required. Baseline cohort characteristics are reported in Table 1 . Standardised mean differences (SMDs) were computed to characterise distributional similarity between the development and external validation cohorts: |µ₁ − µ₂| / pooled SD for continuous normally distributed variables, the proportion-based formula for binary variables, and raw individual-level data before aggregation for non-normally distributed variables reported as median [IQR]. 2.2 Feature Extraction and Preprocessing Predictor variables were restricted to routinely collected clinical data from the first 24 hours of ICU admission to reflect realistic deployment constraints and prevent temporal data leakage [ 15 , 16 , 17 ]. Features spanned four domains: demographics, vital signs, laboratory measurements, and clinical treatment indicators including fluid balance. Thirty-six candidate features were initially extracted; after screening, the final modeling set comprised 38 predictors inclusive of binary missingness indicators (added for variables with > 10% missing values). Continuous variables were summarized using minimum, maximum, or mean within the 24-hour window (detailed in Supplementary Table S1). All preprocessing parameters — imputation values, encoding mappings, and scaling transformations — were derived exclusively from the MIMIC-IV training partition and applied without modification to all validation and external cohorts [ 13 ]. Features with > 45% missingness were excluded; remaining missing values were imputed using training-set medians [ 13 ]. Categorical variables were one-hot encoded; min–max normalization was applied to logistic regression only. 2.3 Model Development MIMIC-IV was partitioned into training (70%), internal validation (15%), and held-out test (15%) sets using stratified random sampling (random seed = 42). The eICU dataset was withheld entirely for external validation. Three model classes were evaluated: logistic regression, random forest [ 20 ], and gradient boosting (XGBoost). Class imbalance (~ 10% mortality rate) was addressed through class-weighted loss functions. Given that calibration is a critical and under-evaluated dimension of clinical model validity [ 18 , 19 ], model selection was based on a study-specific composite criterion: Score = 0.40 × AUROC + 0.25 × AUPRC − 0.20 × Brier − 0.10 × ECE − 0.03 × |slope − 1| − 0.02 × |intercept|, where ECE denotes expected calibration error [ 18 , 19 ]. Higher scores indicate better overall performance, with weights selected to prioritize discrimination while explicitly penalizing miscalibration and probabilistic error. Hyperparameters were tuned via five-fold stratified cross-validation; the binary threshold was set using the Youden index. The XGBoost model with post hoc logistic recalibration achieved the highest composite validation score and was designated the primary model. 2.4 Validation and Evaluation Internal validation was performed on the held-out MIMIC-IV test set. Metrics comprised: discrimination (AUROC, AUPRC [ 21 , 22 ]); overall probabilistic accuracy (Brier score [ 23 ]); calibration (slope, intercept, ECE [ 18 , 19 ]); and binary classification performance (sensitivity, specificity, PPV, NPV, F1). Bootstrap confidence intervals were estimated using 500 iterations for primary model-performance analyses, 300 iterations for subgroup analyses, and 1,000 iterations for paired benchmark comparisons. Iteration counts were selected to balance computational efficiency with estimation stability across analyses of varying complexity. External validation was conducted on the full eICU cohort without retraining or parameter updates, applying all preprocessing exactly as defined on MIMIC-IV training data [ 15 ]. Post hoc Platt scaling was applied to external predictions using internal validation outputs [ 24 , 25 ]; a label-shift intercept-only adjustment was evaluated as a sensitivity analysis. Calibration was assessed graphically (loess-smoothed calibration curves) and quantitatively. Clinical utility was evaluated using decision curve analysis (DCA), quantifying net benefit across clinically plausible threshold probabilities relative to treat-all and treat-none strategies [ 26 , 27 ], with additional benchmark comparison against APACHE scores in a matched eICU subset. 2.5 Subgroup, Sensitivity, and Interpretability Analyses Subgroup analyses were performed across sex and race/ethnicity groups available in eICU. AUROC and calibration metrics were computed per subgroup with bootstrap confidence intervals; performance disparities were quantified as absolute between-group differences following established algorithmic fairness frameworks [ 28 , 29 ]. Five pre-specified sensitivity analyses assessed robustness to: single-stay-per-patient restriction; restriction to ICU stays ≥ 48 hours; exclusion of laboratory variables; exclusion of arterial blood gas features; and exclusion of race/ethnicity variables. Model interpretability was assessed using SHAP values (TreeExplainer) [ 30 ] and permutation importance, with rank concordance evaluated by Spearman correlation. 2.6 Reproducibility and Reporting All analyses were implemented in Python (random seed = 42). The complete pipeline is publicly available at https://github.com/Krutarth007/icu-mortality-prediction-ml . This study adheres to TRIPOD reporting guidelines [ 31 ] and incorporates PROBAST risk-of-bias assessment [ 32 ], consistent with standards for rigorous clinical machine learning research [ 16 ]. 3. RESULTS 3.1 Cohort Characteristics After applying eligibility criteria, the development cohort comprised 52,028 adult ICU stays from MIMIC-IV (training n = 36,328; validation n = 7,933; held-out test n = 7,767) and the external validation cohort comprised 114,060 ICU stays from eICU. In-hospital mortality was 10.5% in MIMIC-IV and 8.7% in eICU. Baseline characteristics are presented in Table 1 . Standardised mean differences (SMDs) were < 0.10 for most variables, indicating broadly comparable distributions; the largest shifts were observed for 24-hour urine output (median 2,160 vs. 1,335 mL; SMD = 0.431) and mean arterial pressure (77.6 vs. 81.7 mmHg; SMD = 0.322), likely reflecting institutional differences in fluid management protocols and documentation practices. Table 1 Baseline demographic and clinical characteristics of the MIMIC-IV and eICU study cohorts Variable MIMIC-IV (n = 52,028) eICU (n = 114,060) SMD Demographics Age, mean (SD), years 63.61 (16.56) 63.69 (16.77) 0.005 Female, % 43.69 46.01 0.046 Vital signs Heart rate, mean (SD), bpm 85.00 (15.76) 84.99 (16.38) < 0.01 MAP, mean (SD), mmHg 77.61 (12.20) 81.72 (13.48) 0.322 Respiratory rate, mean (SD) 19.15 (3.76) 19.66 (4.74) 0.12 SpO₂, mean (SD), % 96.95 (1.99) 96.85 (2.21) 0.048 Laboratory values BUN, median [IQR], mg/dL 20.00 [13.00–32.00] 20.00 [13.00–34.00] 0.037 Creatinine, median [IQR], mg/dL 1.00 [0.70–1.50] 1.03 [0.76–1.63] 0.074 Lactate, median [IQR], mmol/L 2.10 [1.40–3.10] 1.80 [1.20–3.10] 0.183 Hemoglobin, median [IQR], g/dL 10.00 [8.50–11.70] 10.70 [9.00–12.40] 0.157 Platelets, median [IQR], ×10³/µL 177 [125–240] 185 [135–243] 0.065 Clinical indicators and outcomes Urine output 24 h, median [IQR], mL 2160 [1300–3250] 1335 [730–2120] 0.431 Hospital mortality, % 10.5 8.69 0.06 ICU mortality, % 6.71 5.55 0.047 Prolonged ICU LOS ≥ 7 days, % 13.19 10.91 0.072 Values are mean (SD), median [IQR], or percentage as appropriate. SMD = standardised mean difference; SMD < 0.10 indicates negligible distributional difference. Race/ethnicity was not available in harmonised form for MIMIC-IV and is reported for eICU only in Section 3.5 . MAP = mean arterial pressure; BUN = blood urea nitrogen; LOS = length of stay. The large urine output SMD (0.431) likely reflects institutional variation in fluid protocols and documentation practices rather than a fundamental cohort difference. 3.2 Model Performance Internal and external performance metrics are summarised in Table 2 . On the held-out MIMIC-IV test set, the primary model (XGBoost + logistic recalibration) achieved AUROC 0.847 (95% CI 0.832–0.860), AUPRC 0.441 (95% CI 0.402–0.475), and Brier score 0.075 (95% CI 0.071–0.079). In external validation on eICU, the model maintained strong discrimination (AUROC 0.819, 95% CI 0.815–0.823; AUPRC 0.355; Brier 0.072), with an absolute AUROC reduction of 0.028 — consistent with expected attenuation under cross-institutional distributional shift. ROC curves for all models on the internal test set are shown in Fig. 2 . Table 2 Discrimination, probabilistic accuracy, and external calibration characteristics of all prediction models Model Internal validation (MIMIC-IV test, n = 7,767) External validation (eICU, n = 114,060) External calibration slope / intercept AUROC (95% CI) AUPRC (95% CI) Brier (95% CI) AUROC (95% CI) AUPRC (95% CI) Brier (95% CI) Logistic regression 0.797 (0.782–0.812) 0.380 (0.345–0.412) 0.080 (0.076–0.084) 0.769 (0.764–0.773) 0.295 (0.286–0.305) 0.081 (0.080–0.082) 0.844 / −1.045 Random forest 0.835 (0.820–0.847) 0.398 (0.363–0.432) 0.089 (0.086–0.092) 0.807 (0.803–0.811) 0.329 (0.320–0.339) 0.097 (0.096–0.097) 1.436 / −1.067 XGBoost (base) 0.847 (0.832–0.860) 0.441 (0.402–0.475) 0.075 (0.071–0.079) 0.819 (0.815–0.823) 0.355 (0.346–0.366) 0.072 (0.071–0.073) 0.998 / −0.691 XGBoost + logistic recalibration (primary model) 0.847 (0.832–0.860) 0.441 (0.402–0.475) 0.075 (0.071–0.079) 0.819 (0.815–0.823) 0.355 (0.346–0.366) 0.072 (0.071–0.073) 0.980 / −0.678 AUROC = area under the ROC curve; AUPRC = area under the precision–recall curve; Brier = Brier score (lower = better). 95% confidence intervals were derived using stratified bootstrap resampling (500 iterations for primary model-performance estimates and 1,000 iterations for paired comparisons). Calibration slope ≈ 1.0 and intercept ≈ 0 indicate perfect calibration; a negative intercept indicates systematic risk overestimation. Logistic recalibration preserves rank ordering; therefore AUROC/AUPRC are identical between XGBoost (base) and the recalibrated model. ROC curves for logistic regression (AUROC 0.797), random forest (0.835), XGBoost (0.847), and recalibrated XGBoost (0.847) on the internal test set. The XGBoost and recalibrated XGBoost curves are superimposed because recalibration preserves rank-based discrimination; the two models differ in probability estimates and calibration characteristics. 3.3 Calibration and Recalibration Calibration plots are shown in Fig. 3 . Internally, the primary model was near-ideally calibrated (slope 0.982, 95% CI 0.919–1.046; intercept 0.001, 95% CI − 0.141 to 0.144; ECE = 0.010). Externally, the calibration slope remained near-ideal (0.980, 95% CI 0.964–0.998), confirming preservation of relative risk ordering across institutions. However, the calibration intercept was substantially negative (− 0.678, 95% CI − 0.712 to − 0.649), indicating systematic overestimation of absolute mortality risk attributable to the 1.81-percentage-point lower event rate in eICU — a pattern consistent with prevalence-driven label shift rather than covariate shift. ECE increased fivefold (0.010 internally to 0.053 externally). A post hoc intercept-only label-shift correction reduced ECE to 0.039 (intercept − 0.501), a 26% relative improvement, demonstrating that targeted recalibration without retraining can substantially restore the clinical reliability of probability estimates. Points = mean predicted probability vs observed event rate per decile; dashed orange diagonal = perfect calibration. Internal: slope 0.982, intercept 0.001. External: slope 0.980, intercept − 0.678. The near-unit slope externally confirms preserved relative risk ordering; the negative intercept reflects systematic absolute risk overestimation attributable to lower event-rate prevalence in eICU vs MIMIC-IV. 3.4 Clinical Utility and APACHE Benchmark Decision curve analysis demonstrated positive net benefit over treat-all and treat-none strategies across threshold probabilities of approximately 2–40% in the external cohort (Fig. 4 ). In the matched eICU subset with available APACHE scores (n = 98,788), the primary model outperformed APACHE in discrimination (AUROC 0.817 vs. 0.795; DeLong p < 0.001) and probabilistic accuracy (Brier 0.074 vs. 0.075). AUPRC was marginally lower (0.364 vs. 0.382), likely reflecting APACHE's weighting towards high-acuity patients. APACHE exhibited markedly poor absolute calibration (slope 0.591, intercept − 1.159), indicating systematic risk overestimation. Full results are in Table 3 . Dashed purple = treat-all; dotted brown = treat-none. All ML models exceeded treat-none across the full range and exceeded treat-all above ~ 5%. The recalibrated XGBoost model yielded the highest net benefit across the clinically relevant 2–40% range. XGBoost and recalibrated XGBoost curves coincide, confirming recalibration does not alter decision-analytic utility. Table 3 Benchmark comparison with APACHE and subgroup performance of the primary model (external eICU cohort) Category Group n AUROC (95% CI) AUPRC Brier Cal. slope / int. APACHE benchmark (matched subset, n = 98,788) APACHE 98,788 0.795 0.382 0.075 0.591 / −1.159 XGBoost + recalibration (primary model) 98,788 0.817 (0.815–0.819) 0.364 0.074 0.975 / −0.651 Sex (external eICU cohort) Male 61,548 0.823 (0.818–0.828) 0.363 0.07 1.002 / −0.569 Female 52,474 0.814 (0.808–0.820) 0.349 0.074 0.965 / −0.783 Race/ethnicity (external eICU cohort) Caucasian 87,619 0.817 (0.812–0.822) 0.354 0.072 0.981 / −0.669 African American 13,170 0.824 (0.810–0.834) 0.355 0.067 0.975 / −0.721 Hispanic 4,226 0.823 (0.802–0.841) 0.357 0.077 0.944 / −0.735 Asian 1,920 0.840 (0.814–0.867) 0.424 0.073 1.135 / −0.400 Native American 699 0.822 (0.764–0.868) 0.429 0.073 0.952 / −0.485 Other/Unknown 5,166 0.838 (0.820–0.854) 0.367 0.069 1.027 / −0.539 APACHE comparison restricted to eICU patients with available APACHE hospital mortality predictions (n = 98,788; event rate 9.0%). DeLong test p < 0.001 for XGBoost vs APACHE AUROC. 95% confidence intervals were estimated using stratified bootstrap resampling, with iteration counts varying by analysis type (500 for primary model-performance estimates, 300 for subgroup analyses, and 1,000 for paired benchmark comparisons). Cal. int. = calibration intercept. Race/ethnicity reported for eICU only (not available in harmonised form for MIMIC-IV). Native American subgroup (n = 699; 67 events) has correspondingly wider CIs. APACHE calibration slope of 0.591 indicates substantial under-separation of predicted vs observed risk. 3.5 Subgroup, Sensitivity, and Secondary Outcomes Discriminative performance was consistent across sex (AUROC gap 0.009) and racial/ethnic groups (AUROC range 0.817–0.840; maximum gap 0.044), with overlapping confidence intervals for most pairwise comparisons (Table 3 ). Calibration intercepts varied more substantially by subgroup (range − 0.400 [Asian] to − 0.783 [Female]), indicating unevenly distributed absolute risk overestimation that may require subgroup-specific recalibration prior to deployment. Exclusion of race/ethnicity variables produced negligible change in discrimination (ΔAUROC = + 0.001). Across five sensitivity analyses (Supplementary Table S2), discrimination was broadly stable. A routine-predictor model using only 19 features (excluding arterial blood gas variables) achieved external AUROC 0.794 (ΔAUROC = − 0.025), supporting feasibility in resource-limited settings. Restricting to ICU stays ≥ 48 hours produced the largest attenuation (ΔAUROC = − 0.059), consistent with survivor selection bias. For ICU mortality, external AUROC was 0.836 (95% CI 0.830–0.840); for prolonged LOS (≥ 7 days), external AUROC was 0.720 (95% CI 0.715–0.725). 3.6 Model Interpretability SHAP analysis identified clinically coherent predictors (Fig. 5 ). The five highest-ranked features by mean |SHAP| were 24-hour urine output (0.310), age (0.293), maximum BUN (0.291), ventilation flag (0.218), and mean respiratory rate (0.201) — all established markers of organ dysfunction and haemodynamic compromise. Permutation importance yielded consistent rankings (top three: urine output, lactate, age), confirming interpretability robustness. The race variable appeared in the top five by permutation importance (ΔAUROC = 0.0015) but showed modest SHAP contribution, likely reflecting correlation with physiological predictors rather than independent signal. Permutation importance results are in Supplementary Figure S1 . Each point represents one ICU stay; x-axis = SHAP value (impact on log-odds of mortality); colour = feature value (red = high, blue = low). Features ranked by descending mean |SHAP|. Feature labels rendered with clinical nomenclature. SHAP values computed using TreeExplainer. Permutation importance is provided in Supplementary Figure S1. 4. DISCUSSION 4.1 Principal Findings This study develops and externally validates a reproducible, calibration-aware machine learning framework for ICU hospital mortality prediction and demonstrates that external validation practices relying solely on discrimination metrics may systematically misrepresent model readiness for clinical deployment. Three principal findings emerge. First, gradient boosting with logistic recalibration achieved transportable discrimination (internal AUROC 0.847, external AUROC 0.819; absolute reduction 0.028), consistent with prior benchmarking studies on MIMIC-derived data reporting AUROC values of 0.82–0.86 using machine learning approaches [ 3 ], including deep learning and XGBoost-based methods [ 8 ], multitask recurrent architectures [ 11 ], and tree-based ensemble models across different critical care outcomes [ 12 ]. The present results meaningfully extend prior work through systematic external validation on a fully independent multi-site cohort of 114,060 ICU admissions — validation that remains inconsistently reported across the clinical prediction model literature [ 13 ]. Second, the primary model outperformed APACHE in discrimination (AUROC 0.817 vs. 0.795; DeLong p < 0.001) and probabilistic accuracy (Brier 0.074 vs. 0.075) in the matched external subset. APACHE exhibited markedly poor absolute calibration (slope 0.591, intercept − 1.159), reflecting a recognized limitation of conventional severity scoring in mixed-acuity ICU populations [ 3 , 13 ] — reinforcing that well-validated machine learning can offer performance advantages over legacy scores provided calibration is explicitly evaluated prior to deployment [ 15 ]. Third, and most critically, discriminative generalizability did not imply calibration generalizability. Despite near-ideal internal calibration (slope 0.982, intercept 0.001), the external calibration intercept was substantially negative (− 0.678, 95% CI − 0.712 to − 0.649), attributable to prevalence-driven label shift — a form of distributional change entirely undetectable by AUROC-based validation. A simple post hoc intercept update, applied without retraining, reduced ECE by 26% (0.053 to 0.039), demonstrating that targeted recalibration can restore clinical reliability at new deployment sites with minimal infrastructure overhead. 4.2 Calibration–Discrimination Dissociation and Clinical Utility The dissociation between preserved discrimination and degraded calibration under cross-institutional shift is mechanistically attributable to the lower in-hospital mortality prevalence in eICU versus MIMIC-IV. When event-rate prevalence differs between development and deployment settings, predicted absolute probabilities diverge from local observed rates even when patient risk rankings are preserved — consistent with emerging evidence that distributional shifts disproportionately affect probability estimates while leaving discriminative ranking relatively intact [ 15 ]. Most prior ICU mortality prediction studies report AUROC as the primary or sole metric and do not assess calibration slope and intercept under true external validation [ 16 , 17 ], leaving a critical gap in deployment readiness assessment [ 18 , 19 ]. A model with AUROC 0.82 but a calibration intercept of − 0.68 may correctly rank the sickest patients while systematically overstating their absolute mortality risk, driving over-intervention near the decision threshold or producing misleading prognostic communications. Decision curve analysis confirmed positive net benefit across threshold probabilities of approximately 2–40% in the external cohort (Fig. 4 ) [ 26 , 27 ], encompassing the range most directly relevant to ICU triage, early intervention activation, and resource prioritization. This net benefit, however, is recoverable at new sites only after local intercept recalibration adjusts predicted probabilities to reflect site-specific mortality prevalence. Calibration verification and site-specific recalibration should therefore be treated as prerequisites to deployment rather than optional post-hoc steps. The fully reproducible end-to-end pipeline — encompassing harmonised cohort construction, strict temporal leakage controls, calibration-aware model selection, and public code release — directly addresses the reproducibility concerns identified as structural weaknesses of clinical machine learning research [ 18 , 19 ]. 4.3 Equity and Interpretability Subgroup analyses revealed that while discriminative performance was consistent across racial/ethnic groups (AUROC range 0.817–0.840; maximum gap 0.044) and sex (gap 0.009), calibration intercepts varied substantially by subgroup (range − 0.400 [Asian] to − 0.783 [Female]) [ 28 , 29 ]. Absolute risk overestimation was unevenly distributed, indicating that a single global intercept adjustment may not restore equitable probability estimation across all subpopulations. Subgroup-specific calibration monitoring or stratified recalibration protocols should be considered prior to threshold-based deployment [ 28 , 29 ]. Excluding race/ethnicity variables produced negligible discrimination change (ΔAUROC = + 0.001), supporting race-excluded model variants in settings where demographic variables may encode structural inequities rather than independent clinical risk. SHAP values and permutation importance yielded strongly concordant feature rankings [ 30 ], with urine output, age, maximum BUN, ventilation flag, and respiratory rate identified as the five most influential predictors — all established markers of organ dysfunction and haemodynamic compromise. Feature contributions reflect associative rather than causal relationships; clinical interpretation should be made accordingly [ 30 ]. 4.4 Strengths, Limitations, and Future Directions Key strengths include: two large, diverse public critical care databases (n = 166,088 combined); a comprehensive evaluation framework incorporating discrimination, calibration, decision curve analysis, APACHE benchmarking, fairness analysis, and five pre-specified sensitivity analyses; a fully reproducible pipeline with explicit leakage controls supporting transparent replication [ 31 ]; a resource-limited variant (19-predictor model, external AUROC 0.794) demonstrating feasibility in community hospital settings [ 33 ]; and public release of all code, model artifacts, and outputs consistent with reproducibility standards for trustworthy clinical machine learning [ 34 ]. Decision curve analyses were conducted and reported in accordance with established interpretive guidelines [ 35 ]. Limitations include: exclusive reliance on U.S. datasets limiting international generalizability; exclusion of temperature features due to near-complete missingness in MIMIC-IV; use of median imputation rather than multiple imputation; subgroup analyses restricted to sex and race/ethnicity available in eICU; and a retrospective design precluding conclusions about real-world clinical impact. Advanced domain adaptation strategies — including transfer learning and Bayesian updating — were not evaluated. Future work should prioritize prospective stepped-wedge validation within active clinical decision support systems, adaptive site-specific recalibration protocols, and evaluation in non-U.S. and lower-resource healthcare settings to characterize global transportability and broaden fairness assessment. 5. CONCLUSION Machine learning models for ICU mortality prediction can achieve transportable discrimination, but this study demonstrates that transportable discrimination does not guarantee transportable clinical utility. The recalibrated XGBoost model maintained strong external discrimination (AUROC 0.819, 95% CI 0.815–0.823) across 114,060 independent ICU admissions and outperformed APACHE in discrimination and probabilistic accuracy. Yet despite near-ideal internal calibration (slope 0.982, intercept 0.001), the external calibration intercept shifted substantially (− 0.678), reflecting systematic absolute risk overestimation driven by prevalence-driven label shift — a dissociation entirely undetectable by AUROC alone. A simple post hoc intercept-only adjustment reduced expected calibration error by 26% without retraining, establishing that deployment-ready calibration is achievable through pragmatically feasible recalibration strategies with direct relevance for health systems deploying predictive tools across institutional boundaries [ 15 , 16 ]. Calibration must therefore be treated as a mandatory evaluation standard, not an optional reporting item [ 18 , 19 ]. Equity analyses further demonstrate that calibration intercepts varied substantially across demographic subgroups (range − 0.400 to − 0.783) despite consistent discrimination, indicating that global recalibration alone may not restore equitable probability estimation — reinforcing the need for subgroup-specific calibration monitoring as a prerequisite to equitable deployment [ 28 , 29 ]. The reproducible benchmarking framework presented here, developed in accordance with TRIPOD and PROBAST standards [ 31 , 32 ], a 19-feature resource-limited variant achieving external AUROC 0.794 [ 33 ], publicly released code and model artifacts [ 34 ], and DCA reporting per established guidelines [ 35 ], together provide an immediately adoptable evaluation template for clinical prediction model research in critical care informatics. Realizing the clinical promise of AI-assisted critical care requires validation that extends unconditionally beyond discrimination to encompass calibration, equity, and decision-analytic utility. Future work should prioritize prospective validation within clinical decision support systems, adaptive site-specific recalibration protocols, and evaluation in non-U.S. and lower-resource healthcare settings to characterize global transportability and prevent amplification of existing disparities in critical care outcomes. Declarations Acknowledgements : The authors would like to acknowledge the contributors of the MIMIC-IV (version 2.2) and the eICU Collaborative Research Database for making these valuable datasets publicly available for research. We also acknowledge PhysioNet for providing access to these resources and supporting reproducible research in critical care. Ethical Considerations : This study utilized publicly available, de-identified datasets, namely the MIMIC-IV (version 2.2) and the eICU Collaborative Research Database. Both datasets have received prior institutional review board (IRB) approval, and all patient data were fully de-identified in accordance with the Health Insurance Portability and Accountability Act (HIPAA) Safe Harbor provisions. As the data are anonymized and publicly accessible, this study was exempt from additional ethical review and did not require informed consent. Access to the datasets was obtained following completion of the required data use agreements and credentialing procedures. All analyses were conducted in accordance with relevant data use policies and ethical guidelines. Conflict of Interest: The authors declare they have no competing financial or non-financial interests that are directly or indirectly related to the work submitted for publication. This research was conducted independently; the affiliations listed are for identification purposes and do not imply institutional funding or endorsement of the results. Funding Statement: This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors. Consent to Participate: Not applicable Consent for Publication: Not applicable Clinical trial number : Not Applicable Data Availability Statement: The datasets analyzed in this study are publicly available. The MIMIC-IV (version 2.2) database and the eICU Collaborative Research Database can be accessed via PhysioNet (https://physionet.org/), subject to completion of the required credentialing, training, and data use agreements. Due to data use restrictions, the datasets cannot be redistributed by the authors. All code and analytical procedures used in this study are publicly available at: https://github.com/Krutarth007/icu-mortality-prediction-ml Use of Generative AI: During the preparation of this manuscript, the authors used generative AI tools (Gemini, ChatGPT, Claude) to assist with language refinement and code debugging. The authors critically reviewed and edited all outputs and take full responsibility for the accuracy and integrity of the final work. Author Contributions (CRediT Taxonomy): Krutarth Patel: Conceptualization (Lead); Methodology (Lead); Software (Lead); Formal Analysis (Lead); Investigation (Lead); Writing – Review & Editing (Lead); Project Administration (Lead). Phanindra Beedala: Validation (Lead); Data Curation (Lead); Writing – Original Draft (Lead); Methodology (Supporting); Software (Supporting). References Komorowski M, Celi LA, Badawi O, Gordon AC, Faisal AA (2018) The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nat Med 24(11):1716–1720. https://doi.org/10.1038/s41591-018-0213-5 Shickel B, Tighe PJ, Bihorac A, Rashidi P (2017) Deep EHR: a survey of recent advances in deep learning techniques for electronic health record analysis. J Biomed Inf 83:168–185. https://doi.org/10.1016/j.jbi.2017.04.001 Calvert J, Mao Q, Hoffman JL, Jay M, Desautels T, Mohamadlou H et al (2016) Using electronic health record collected clinical variables to predict medical intensive care unit mortality. Crit Care Med 44(2):e61–e67. https://doi.org/10.1097/CCM.0000000000001515 Rajkomar A, Dean J, Kohane I (2019) Machine learning in medicine. N Engl J Med 380(14):1347–1358. https://doi.org/10.1056/NEJMra1814259 Zhang Z, Ho KM, Hong Y (2019) Machine learning for the prediction of mortality in patients with sepsis: a systematic review. Ann Transl Med 7(24):832. https://doi.org/10.21037/atm.2019.11.50 Johnson AEW, Ghassemi MM, Nemati S, Niehaus KE, Clifton DA, Clifford GD (2016) Machine learning and decision support in critical care. Proc IEEE . ;104(2):444–466. https://doi.org/10.1109/JPROC.2015.2501978 Topol EJ (2019) High-performance medicine: the convergence of human and artificial intelligence. Nat Med 25:44–56. https://doi.org/10.1038/s41591-018-0300-7 Purushotham S, Meng C, Che Z, Liu Y (2018) Benchmarking deep learning models on large healthcare datasets. J Biomed Inf 83:112–134. https://doi.org/10.1016/j.jbi.2018.04.007 Johnson AEW, Pollard TJ, Shen L, Lehman LH, Feng M, Ghassemi M et al (2023) MIMIC-IV, a freely accessible electronic health record dataset. Sci Data 10:1. https://doi.org/10.1038/s41597-022-01899-x Pollard TJ, Johnson AEW, Raffa JD, Celi LA, Mark RG, Badawi O (2018) The eICU Collaborative Research Database, a freely available multi-center database for critical care research. Sci Data 5:180178. https://doi.org/10.1038/sdata.2018.178 Harutyunyan H, Khachatrian H, Kale DC, Ver Steeg G, Galstyan A. Multitask learning and benchmarking with clinical time series data. In: Advances in Neural Information Processing Systems 32 (NeurIPS 2019); Dec 8–14;, Vancouver BC (2019) Red Hook, NY: Curran Associates; 2019. Available from: https://proceedings.neurips.cc/paper_files/paper/2019/hash/4735450b461412351b12c3fef0bac8b0-Abstract.html Desautels T, Calvert J, Hoffman J, Jay M, Kerem Y, Shieh L et al (2016) Prediction of early unplanned intensive care unit readmission using machine learning. Crit Care Med 44(4):e270–e278. https://doi.org/10.1097/CCM.0000000000001490 Steyerberg EW (2019) Clinical prediction models: a practical approach to development, validation, and updating, 2nd edn. Springer, New York. https://doi.org/10.1007/978-3-030-16399-0 Wynants L, Van Calster B, Collins GS, Riley RD, Heinze G, Schuit E et al (2020) Prediction models for diagnosis and prognosis of COVID-19: systematic review and critical appraisal. BMJ 369:m1328. https://doi.org/10.1136/bmj.m1328 Subbaswamy A, Saria S (2020) From development to deployment: dataset shift, causality, and shift-stable models in health AI. Biostatistics 21(2):345–352. https://doi.org/10.1093/biostatistics/kxz041 Roberts M, Driggs D, Thorpe M, Gilbey J, Yeung M, Ursprung S et al (2021) Common pitfalls and recommendations for using machine learning in healthcare. Nat Med 27:745–758. https://doi.org/10.1038/s41591-021-01223-2 Kapoor S, Narayanan A (2023) Leakage and the reproducibility crisis in machine-learning-based science. Patterns 4(8):100804. https://doi.org/10.1016/j.patter.2023.100804 Van Calster B, McLernon DJ, Van Smeden M, Wynants L, Steyerberg EW (2019) Calibration: the Achilles heel of predictive analytics. BMC Med 17(1):230. https://doi.org/10.1186/s12916-019-1466-7 Austin PC, Steyerberg EW (2019) The Integrated Calibration Index (ICI) and related metrics for quantifying the calibration of logistic regression models. Stat Med 38(21):4051–4065. https://doi.org/10.1002/sim.8281 Breiman L (2001) Random forests. Mach Learn 45(1):5–32. https://doi.org/10.1023/A:1010933404324 Davis J, Goadrich M. The relationship between precision-recall and ROC curves. In: Proceedings of the 23rd International Conference on Machine Learning (ICML '06); Jun 25–29;, Pittsburgh PA (2006) New York: ACM; 2006. pp. 233–240. https://doi.org/10.1145/1143844.1143874 Saito T, Rehmsmeier M (2015) The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 10(3):e0118432. https://doi.org/10.1371/journal.pone.0118432 Brier GW (1950) Verification of forecasts expressed in terms of probability. Mon Weather Rev 78(1):1–3. https://doi.org/10.1175/1520-0493 (1950)078%3C0001:VOFEIT%3E2.0.CO;2 Niculescu-Mizil A, Caruana R (2005) Predicting good probabilities with supervised learning. In: Proceedings of the 22nd International Conference on Machine Learning (ICML '05); 2005 Aug 7–11; Bonn, Germany. New York: ACM; pp. 625–632. https://doi.org/10.1145/1102351.1102430 Zadrozny B, Elkan C (2002) Transforming classifier scores into accurate multiclass probability estimates. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '02); Jul 23–26; Edmonton, Alberta, Canada. New York: ACM; 2002. pp. 694–699. https://doi.org/10.1145/775047.775151 Vickers AJ, Elkin EB (2006) Decision curve analysis: a novel method for evaluating prediction models. Med Decis Mak 26(6):565–574. https://doi.org/10.1177/0272989X06295361 Vickers AJ, Van Calster B, Steyerberg EW (2016) Net benefit approaches to the evaluation of prediction models, molecular markers, and diagnostic tests. BMJ 352:i6. https://doi.org/10.1136/bmj.i6 Obermeyer Z, Powers B, Vogeli C, Mullainathan S (2019) Dissecting racial bias in an algorithm used to manage the health of populations. Science 366(6464):447–453. https://doi.org/10.1126/science.aax2342 Chen IY, Szolovits P, Ghassemi M (2019) Can AI help reduce disparities in general medical and mental health care? AMA J Ethics 21(2):E167–E179. https://doi.org/10.1001/amajethics.2019.167 Lundberg SM, Lee SI (2017) A unified approach to interpreting model predictions. In: Advances in Neural Information Processing Systems 30 (NeurIPS 2017); Dec 4–9; Long Beach, CA. Red Hook, NY: Curran Associates; 2017. Available from: https://proceedings.neurips.cc/paper_files/paper/2017/hash/8a20a8621978632d76c43dfd28b67767-Abstract.html Collins GS, Reitsma JB, Altman DG, Moons KGM (2015) Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. Ann Intern Med 162(1):55–63. https://doi.org/10.7326/M14-0697 Wolff RF, Moons KGM, Riley RD, Whiting PF, Westwood M, Collins GS et al (2019) PROBAST: a tool to assess the risk of bias and applicability of prediction model studies. Ann Intern Med 170(1):51–58. https://doi.org/10.7326/M18-1376 Kelly CJ, Karthikesalingam A, Suleyman M, Corrado G, King D (2019) Key challenges for delivering clinical impact with artificial intelligence. Lancet Digit Health 1(6):e312–e315. https://doi.org/10.1016/S2589-7500(19)30123-2 Pineau J, Vincent-Lamarre P, Sinha K, Larivière V, Beygelzimer A, d'Alché-Buc F et al (2021) Improving reproducibility in machine learning research (a report from the NeurIPS 2019 reproducibility program). J Mach Learn Res . ;22(164):1–20. Available from: https://jmlr.org/papers/v22/20-1212.html Van Calster B, Wynants L, Verbeek JFM, Verbakel JY, Christodoulou E, Vickers AJ et al (2018) Reporting and interpreting decision curve analysis: a guide for investigators. BMJ 362:k3483. https://doi.org/10.1136/bmj.k3483 Additional Declarations The authors declare no competing interests. Supplementary Files SupplementaryFigureS1PermutationImportance.png SupplementaryFigureS2ExternalROCAllModels.png SupplementaryFigureS3ExternalPRCurves.png SupplementaryFigureS4MissingnessHeatmap.png SupplementaryTableS1SensitivityAnalyses.csv SupplementaryTableS2SecondaryOutcomes.csv SupplementaryTableS3MissingnessReport.csv SupplementaryTableS4HyperparameterTuning.csv SupplementaryTableS5FeatureSelection.csv SupplementaryTableS6TRIPODPROBASTChecklist.csv graphicalabstract.png Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-9602675","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":633730265,"identity":"6b0eff94-2ad5-4bba-87f6-008741cb8ff1","order_by":0,"name":"Krutarth Patel","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABJ0lEQVRIiWNgGAWjYBACCSidACYfGNjI2R9vALIMLPBpYQQqMQBqYQZqLEgzZjhzAKRFglgtHw4nNtxIQLYeE0jOPmP+4OeeP3nm7OcPPkgwOJzYOPP51Q0/CiQY+Nu7E7BpkebLMWzseWZQbNmTzGyQYJBu3CydU3azB+gwiTNnN2DTIsfDY9jAc8AgccOBZDaJBANr2TbpnLQbPEAtBhK5OLU0/gFpOf+Y/UeCATNjj+SZtJt/8GiRBmppBttyI5mNIcHAWXGGBPux2/hskexhK5wtc8A4ceeMx8ZAh6UZG/DksN2WMZDgweUXiTPMGz6+OSCXuJ0/8eGHD39s5AzYjz+7+QbI4G/vxaoFDgwQTB4wmwevcjQt7A8Iqh4Fo2AUjIIRBQCBkGYRa53/RAAAAABJRU5ErkJggg==","orcid":"https://orcid.org/0009-0002-8748-8098","institution":"Humana","correspondingAuthor":true,"prefix":"","firstName":"Krutarth","middleName":"","lastName":"Patel","suffix":""},{"id":633730266,"identity":"0fdb1daf-0ada-4bfb-bea1-2eb2cc60eef2","order_by":1,"name":"Phanindra Beedala","email":"","orcid":"","institution":"Independent Researcher","correspondingAuthor":false,"prefix":"","firstName":"Phanindra","middleName":"","lastName":"Beedala","suffix":""}],"badges":[],"createdAt":"2026-05-03 21:40:24","currentVersionCode":1,"declarations":{"humanSubjects":false,"vertebrateSubjects":false,"conflictsOfInterestStatement":false,"humanSubjectEthicalGuidelines":false,"humanSubjectConsent":false,"humanSubjectClinicalTrial":false,"humanSubjectCaseReport":false,"vertebrateSubjectEthicalGuidelines":false},"doi":"10.21203/rs.3.rs-9602675/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-9602675/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":108804185,"identity":"b03caedb-a059-46eb-a7ca-aa860a9caccf","added_by":"auto","created_at":"2026-05-08 15:17:28","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":170390,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eCohort selection and data partitioning for model development (MIMIC-IV) and external validation (eICU).\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"Figure1.png","url":"https://assets-eu.researchsquare.com/files/rs-9602675/v1/7b95707be209b0ab823376c6.png"},{"id":108466923,"identity":"08b6b18d-3518-4200-b007-a6774f19b217","added_by":"auto","created_at":"2026-05-05 04:01:55","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":115634,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eROC curves for all models on the internal MIMIC-IV held-out test set (n = 7,767).\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"Figure2.png","url":"https://assets-eu.researchsquare.com/files/rs-9602675/v1/8f054fd35cb99ae65bd4244f.png"},{"id":108494055,"identity":"8799c50a-2911-43a1-b1b9-7bb2fcbcc57d","added_by":"auto","created_at":"2026-05-05 10:02:27","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":63861,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eCalibration plots for the primary model on the internal MIMIC-IV test set (upper) and external eICU cohort (lower).\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"Figure3.png","url":"https://assets-eu.researchsquare.com/files/rs-9602675/v1/15a8b7b6b8c34ab5f0a763ac.png"},{"id":108466926,"identity":"dd5d9d60-0eb4-4496-9646-3fb3a316f098","added_by":"auto","created_at":"2026-05-05 04:01:55","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":90888,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eDecision curve analysis in the external eICU cohort (n = 114,060). Net benefit is plotted against threshold probability (0–50%).\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"Figure4.png","url":"https://assets-eu.researchsquare.com/files/rs-9602675/v1/96adc15012f66bb50bf95bdc.png"},{"id":108494003,"identity":"87cb9010-d87a-4ea6-8349-77cc999b590d","added_by":"auto","created_at":"2026-05-05 10:02:15","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":189462,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eSHAP beeswarm plot for the top 20 features of the primary model (XGBoost + logistic recalibration), computed on a random sample of 2,000 internal test-set observations.\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"Figure5.png","url":"https://assets-eu.researchsquare.com/files/rs-9602675/v1/51d8fb852267e48484978c89.png"},{"id":108809487,"identity":"530477a3-5800-4125-8f71-f88a5f7b8125","added_by":"auto","created_at":"2026-05-08 15:53:10","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1022188,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-9602675/v1/c40514d2-94f3-4b60-baba-0982ac728dcc.pdf"},{"id":108494035,"identity":"4487ac03-cb6f-4a06-b7c6-08d304c8603a","added_by":"auto","created_at":"2026-05-05 10:02:22","extension":"png","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":119848,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryFigureS1PermutationImportance.png","url":"https://assets-eu.researchsquare.com/files/rs-9602675/v1/fa9cf2d38ef79914702120d1.png"},{"id":108804353,"identity":"b45169ff-b042-463b-8b82-926e968f7bd7","added_by":"auto","created_at":"2026-05-08 15:19:40","extension":"png","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":310127,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryFigureS2ExternalROCAllModels.png","url":"https://assets-eu.researchsquare.com/files/rs-9602675/v1/a6705f605c88f79b9162f00a.png"},{"id":108493715,"identity":"e86a8767-41a1-4580-a1f7-e928062730d3","added_by":"auto","created_at":"2026-05-05 10:01:24","extension":"png","order_by":3,"title":"","display":"","copyAsset":false,"role":"supplement","size":270368,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryFigureS3ExternalPRCurves.png","url":"https://assets-eu.researchsquare.com/files/rs-9602675/v1/6db6a5e68ed373660633a539.png"},{"id":108466931,"identity":"047d1ef0-2368-4713-a582-dedd94f11c0f","added_by":"auto","created_at":"2026-05-05 04:01:55","extension":"png","order_by":4,"title":"","display":"","copyAsset":false,"role":"supplement","size":365794,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryFigureS4MissingnessHeatmap.png","url":"https://assets-eu.researchsquare.com/files/rs-9602675/v1/72518505fd2cb45f3400bbde.png"},{"id":108803782,"identity":"429009a9-fe17-465f-9638-b0abb741a1dc","added_by":"auto","created_at":"2026-05-08 15:06:51","extension":"csv","order_by":5,"title":"","display":"","copyAsset":false,"role":"supplement","size":3363,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryTableS1SensitivityAnalyses.csv","url":"https://assets-eu.researchsquare.com/files/rs-9602675/v1/9ab1e3234e556b0e0a05d49b.csv"},{"id":108466932,"identity":"0a5bc66b-d85a-4d26-b3c1-7a1670523e57","added_by":"auto","created_at":"2026-05-05 04:01:55","extension":"csv","order_by":6,"title":"","display":"","copyAsset":false,"role":"supplement","size":3398,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryTableS2SecondaryOutcomes.csv","url":"https://assets-eu.researchsquare.com/files/rs-9602675/v1/e0a5ece0f96575db15ca5045.csv"},{"id":108493589,"identity":"b158b012-d7db-4d43-a948-0d4cc111cbb6","added_by":"auto","created_at":"2026-05-05 10:01:02","extension":"csv","order_by":7,"title":"","display":"","copyAsset":false,"role":"supplement","size":3013,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryTableS3MissingnessReport.csv","url":"https://assets-eu.researchsquare.com/files/rs-9602675/v1/6c956e97ea39e359db79d070.csv"},{"id":108493760,"identity":"053a4ebd-cd23-42a2-9ad3-9a81ccf6f13c","added_by":"auto","created_at":"2026-05-05 10:01:36","extension":"csv","order_by":8,"title":"","display":"","copyAsset":false,"role":"supplement","size":10367,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryTableS4HyperparameterTuning.csv","url":"https://assets-eu.researchsquare.com/files/rs-9602675/v1/64bf8627a79307984df1da0d.csv"},{"id":108804591,"identity":"d75244ac-9c72-4aab-9a37-12223d635da1","added_by":"auto","created_at":"2026-05-08 15:21:54","extension":"csv","order_by":9,"title":"","display":"","copyAsset":false,"role":"supplement","size":4181,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryTableS5FeatureSelection.csv","url":"https://assets-eu.researchsquare.com/files/rs-9602675/v1/17bbf499b0bc6e1c1a47a0a2.csv"},{"id":108466934,"identity":"cb84f006-c82e-4b35-8efd-8db72b0f5c58","added_by":"auto","created_at":"2026-05-05 04:01:55","extension":"csv","order_by":10,"title":"","display":"","copyAsset":false,"role":"supplement","size":720,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryTableS6TRIPODPROBASTChecklist.csv","url":"https://assets-eu.researchsquare.com/files/rs-9602675/v1/cae46d78195de8468519964b.csv"},{"id":108493740,"identity":"5f9924b6-ecca-4690-a3d3-f8b1e57dd2ab","added_by":"auto","created_at":"2026-05-05 10:01:30","extension":"png","order_by":11,"title":"","display":"","copyAsset":false,"role":"supplement","size":1137499,"visible":true,"origin":"","legend":"","description":"","filename":"graphicalabstract.png","url":"https://assets-eu.researchsquare.com/files/rs-9602675/v1/9ac5c7a366a61e5dab7774d2.png"}],"financialInterests":"The authors declare no competing interests.","formattedTitle":"\u003cp\u003e\u003cstrong\u003eCalibration Drift Under Cross-Institutional Deployment: An External Validation Framework for ICU Mortality Prediction Across MIMIC-IV and eICU\u003c/strong\u003e\u003c/p\u003e","fulltext":[{"header":"1. INTRODUCTION","content":"\u003cp\u003eMachine learning models for intensive care unit (ICU) mortality prediction routinely achieve strong internal discrimination yet fail to meet calibration standards when transferred across institutions \u0026mdash; a gap with direct and underappreciated consequences for clinical deployment [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e, \u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]. Reliable risk stratification can guide triage, inform resource allocation, and support high-stakes bedside decisions [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e, \u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e], but a model that correctly ranks patients by relative risk may simultaneously overestimate or underestimate absolute mortality probability by a clinically significant margin [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e]. This dissociation between preserved discrimination and degraded calibration is the central deployment hazard this work addresses.\u003c/p\u003e \u003cp\u003eThe widespread adoption of electronic health records has enabled data-driven outcome prediction [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e], and machine learning techniques have demonstrated the ability to model complex nonlinear relationships across physiological, laboratory, and demographic variables [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e, \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e]. Large critical care databases \u0026mdash; including MIMIC-IV and the eICU Collaborative Research Database \u0026mdash; have facilitated the development and benchmarking of ICU mortality models [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e, \u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e], with logistic regression, random forest, and gradient boosting approaches consistently reporting strong discriminative performance [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e, \u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e]. Yet single-dataset performance does not guarantee reliable generalization: differences in patient populations, clinical workflows, and data collection practices can substantially alter model behavior in external cohorts [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e], and systematic reviews have documented that external validation remains inconsistently reported across the clinical prediction model literature [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e] \u0026mdash; making rigorous cross-institutional evaluation essential for assessing deployment readiness [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eThe field faces additional concerns regarding reproducibility and methodological rigor [\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e]. Variability in cohort definitions and preprocessing pipelines produces non-comparable results, and temporal data leakage can generate overly optimistic performance estimates [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e]. Most critically, while AUROC is widely reported, calibration \u0026mdash; the agreement between predicted probabilities and observed event rates \u0026mdash; remains systematically under-evaluated despite its direct importance for threshold-based clinical decisions [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e, \u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e]. Emerging evidence confirms that distributional shifts between institutions disproportionately affect absolute probability estimates while leaving discriminative ranking relatively intact [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e], a dissociation with serious implications for any deployment context where absolute risk estimates drive clinical action.\u003c/p\u003e \u003cp\u003eTo address these gaps, this study presents a reproducible, calibration-aware benchmarking framework for ICU hospital mortality prediction, operationalized across two large, independent, publicly available critical care databases. The framework spans cohort construction, feature extraction, preprocessing, model development, calibration-aware evaluation, post-hoc recalibration, and demographic fairness auditing \u0026mdash; with all code, cohort definitions, and model artifacts publicly available as an immediately adoptable evaluation template for clinical informatics researchers.\u003c/p\u003e"},{"header":"2. METHODS","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003e2.1 Study Design and Data Sources\u003c/h2\u003e \u003cp\u003eWe conducted a retrospective cohort study using two publicly available, de-identified critical care databases. MIMIC-IV (version 2.2) [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e] comprises longitudinal EHR data from a single quaternary academic medical center (Beth Israel Deaconess Medical Center, Boston, MA, USA; 2008\u0026ndash;2022) and served as the model development and internal validation source. The eICU Collaborative Research Database [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e] aggregates ICU data from 335 units across 208 U.S. hospitals (2014\u0026ndash;2015) and constituted the independent external validation cohort. All procedures adhered to established standards for clinical prediction model development and external validation [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eThe primary outcome was in-hospital mortality; secondary outcomes were ICU mortality and prolonged ICU length of stay (\u0026ge;\u0026thinsp;7 days). Eligibility criteria were applied identically across both datasets: adult patients (aged\u0026thinsp;\u0026ge;\u0026thinsp;18 years) with a unique first ICU stay per hospitalization. Exclusion criteria were: ICU stays\u0026thinsp;\u0026lt;\u0026thinsp;4 hours, missing outcome data, and age\u0026thinsp;\u0026lt;\u0026thinsp;18 years. As all data were fully de-identified and publicly available under approved data-use agreements, IRB review was not required.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eBaseline cohort characteristics are reported in Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e. Standardised mean differences (SMDs) were computed to characterise distributional similarity between the development and external validation cohorts: |\u0026micro;₁ \u0026minus; \u0026micro;₂| / pooled SD for continuous normally distributed variables, the proportion-based formula for binary variables, and raw individual-level data before aggregation for non-normally distributed variables reported as median [IQR].\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003e2.2 Feature Extraction and Preprocessing\u003c/h2\u003e \u003cp\u003ePredictor variables were restricted to routinely collected clinical data from the first 24 hours of ICU admission to reflect realistic deployment constraints and prevent temporal data leakage [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e, \u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e, \u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e]. Features spanned four domains: demographics, vital signs, laboratory measurements, and clinical treatment indicators including fluid balance. Thirty-six candidate features were initially extracted; after screening, the final modeling set comprised 38 predictors inclusive of binary missingness indicators (added for variables with \u0026gt;\u0026thinsp;10% missing values). Continuous variables were summarized using minimum, maximum, or mean within the 24-hour window (detailed in Supplementary Table S1).\u003c/p\u003e \u003cp\u003eAll preprocessing parameters \u0026mdash; imputation values, encoding mappings, and scaling transformations \u0026mdash; were derived exclusively from the MIMIC-IV training partition and applied without modification to all validation and external cohorts [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e]. Features with \u0026gt;\u0026thinsp;45% missingness were excluded; remaining missing values were imputed using training-set medians [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e]. Categorical variables were one-hot encoded; min\u0026ndash;max normalization was applied to logistic regression only.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec5\" class=\"Section2\"\u003e \u003ch2\u003e2.3 Model Development\u003c/h2\u003e \u003cp\u003eMIMIC-IV was partitioned into training (70%), internal validation (15%), and held-out test (15%) sets using stratified random sampling (random seed\u0026thinsp;=\u0026thinsp;42). The eICU dataset was withheld entirely for external validation. Three model classes were evaluated: logistic regression, random forest [\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e], and gradient boosting (XGBoost). Class imbalance (~\u0026thinsp;10% mortality rate) was addressed through class-weighted loss functions.\u003c/p\u003e \u003cp\u003eGiven that calibration is a critical and under-evaluated dimension of clinical model validity [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e, \u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e], model selection was based on a study-specific composite criterion: Score\u0026thinsp;=\u0026thinsp;0.40 \u0026times; AUROC\u0026thinsp;+\u0026thinsp;0.25 \u0026times; AUPRC\u0026thinsp;\u0026minus;\u0026thinsp;0.20 \u0026times; Brier\u0026thinsp;\u0026minus;\u0026thinsp;0.10 \u0026times; ECE\u0026thinsp;\u0026minus;\u0026thinsp;0.03 \u0026times; |slope\u0026thinsp;\u0026minus;\u0026thinsp;1| \u0026minus; 0.02 \u0026times; |intercept|, where ECE denotes expected calibration error [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e, \u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e]. Higher scores indicate better overall performance, with weights selected to prioritize discrimination while explicitly penalizing miscalibration and probabilistic error. Hyperparameters were tuned via five-fold stratified cross-validation; the binary threshold was set using the Youden index. The XGBoost model with post hoc logistic recalibration achieved the highest composite validation score and was designated the primary model.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec6\" class=\"Section2\"\u003e \u003ch2\u003e2.4 Validation and Evaluation\u003c/h2\u003e \u003cp\u003e \u003cb\u003eInternal validation\u003c/b\u003e was performed on the held-out MIMIC-IV test set. Metrics comprised: discrimination (AUROC, AUPRC [\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e, \u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e]); overall probabilistic accuracy (Brier score [\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e]); calibration (slope, intercept, ECE [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e, \u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e]); and binary classification performance (sensitivity, specificity, PPV, NPV, F1). Bootstrap confidence intervals were estimated using 500 iterations for primary model-performance analyses, 300 iterations for subgroup analyses, and 1,000 iterations for paired benchmark comparisons. Iteration counts were selected to balance computational efficiency with estimation stability across analyses of varying complexity.\u003c/p\u003e \u003cp\u003e \u003cb\u003eExternal validation\u003c/b\u003e was conducted on the full eICU cohort without retraining or parameter updates, applying all preprocessing exactly as defined on MIMIC-IV training data [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e]. Post hoc Platt scaling was applied to external predictions using internal validation outputs [\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e, \u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e]; a label-shift intercept-only adjustment was evaluated as a sensitivity analysis. Calibration was assessed graphically (loess-smoothed calibration curves) and quantitatively. Clinical utility was evaluated using decision curve analysis (DCA), quantifying net benefit across clinically plausible threshold probabilities relative to treat-all and treat-none strategies [\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e, \u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e], with additional benchmark comparison against APACHE scores in a matched eICU subset.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec7\" class=\"Section2\"\u003e \u003ch2\u003e2.5 Subgroup, Sensitivity, and Interpretability Analyses\u003c/h2\u003e \u003cp\u003eSubgroup analyses were performed across sex and race/ethnicity groups available in eICU. AUROC and calibration metrics were computed per subgroup with bootstrap confidence intervals; performance disparities were quantified as absolute between-group differences following established algorithmic fairness frameworks [\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e, \u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e]. Five pre-specified sensitivity analyses assessed robustness to: single-stay-per-patient restriction; restriction to ICU stays\u0026thinsp;\u0026ge;\u0026thinsp;48 hours; exclusion of laboratory variables; exclusion of arterial blood gas features; and exclusion of race/ethnicity variables.\u003c/p\u003e \u003cp\u003eModel interpretability was assessed using SHAP values (TreeExplainer) [\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e] and permutation importance, with rank concordance evaluated by Spearman correlation.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003e2.6 Reproducibility and Reporting\u003c/h2\u003e \u003cp\u003eAll analyses were implemented in Python (random seed\u0026thinsp;=\u0026thinsp;42). The complete pipeline is publicly available at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/Krutarth007/icu-mortality-prediction-ml\u003c/span\u003e\u003cspan address=\"https://github.com/Krutarth007/icu-mortality-prediction-ml\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. This study adheres to TRIPOD reporting guidelines [\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e] and incorporates PROBAST risk-of-bias assessment [\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e], consistent with standards for rigorous clinical machine learning research [\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e].\u003c/p\u003e \u003c/div\u003e"},{"header":"3. RESULTS","content":"\u003cdiv id=\"Sec10\" class=\"Section2\"\u003e \u003ch2\u003e3.1 Cohort Characteristics\u003c/h2\u003e \u003cp\u003eAfter applying eligibility criteria, the development cohort comprised 52,028 adult ICU stays from MIMIC-IV (training n\u0026thinsp;=\u0026thinsp;36,328; validation n\u0026thinsp;=\u0026thinsp;7,933; held-out test n\u0026thinsp;=\u0026thinsp;7,767) and the external validation cohort comprised 114,060 ICU stays from eICU. In-hospital mortality was 10.5% in MIMIC-IV and 8.7% in eICU. Baseline characteristics are presented in Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e. Standardised mean differences (SMDs) were \u0026lt;\u0026thinsp;0.10 for most variables, indicating broadly comparable distributions; the largest shifts were observed for 24-hour urine output (median 2,160 vs. 1,335 mL; SMD\u0026thinsp;=\u0026thinsp;0.431) and mean arterial pressure (77.6 vs. 81.7 mmHg; SMD\u0026thinsp;=\u0026thinsp;0.322), likely reflecting institutional differences in fluid management protocols and documentation practices.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eBaseline demographic and clinical characteristics of the MIMIC-IV and eICU study cohorts\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"4\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eVariable\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMIMIC-IV (n\u0026thinsp;=\u0026thinsp;52,028)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eeICU (n\u0026thinsp;=\u0026thinsp;114,060)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eSMD\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003ctr\u003e \u003cth align=\"left\" colspan=\"4\" nameend=\"c4\" namest=\"c1\"\u003e \u003cp\u003eDemographics\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAge, mean (SD), years\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e63.61 (16.56)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e63.69 (16.77)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.005\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eFemale, %\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e43.69\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e46.01\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.046\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"4\" nameend=\"c4\" namest=\"c1\"\u003e \u003cp\u003e\u003cb\u003eVital signs\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eHeart rate, mean (SD), bpm\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e85.00 (15.76)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e84.99 (16.38)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u0026lt;\u0026thinsp;0.01\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMAP, mean (SD), mmHg\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e77.61 (12.20)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e81.72 (13.48)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.322\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eRespiratory rate, mean (SD)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e19.15 (3.76)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e19.66 (4.74)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.12\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eSpO₂, mean (SD), %\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e96.95 (1.99)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e96.85 (2.21)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.048\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"4\" nameend=\"c4\" namest=\"c1\"\u003e \u003cp\u003e\u003cb\u003eLaboratory values\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eBUN, median [IQR], mg/dL\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e20.00 [13.00\u0026ndash;32.00]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e20.00 [13.00\u0026ndash;34.00]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.037\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCreatinine, median [IQR], mg/dL\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e1.00 [0.70\u0026ndash;1.50]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e1.03 [0.76\u0026ndash;1.63]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.074\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLactate, median [IQR], mmol/L\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e2.10 [1.40\u0026ndash;3.10]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e1.80 [1.20\u0026ndash;3.10]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.183\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eHemoglobin, median [IQR], g/dL\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e10.00 [8.50\u0026ndash;11.70]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e10.70 [9.00\u0026ndash;12.40]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.157\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ePlatelets, median [IQR], \u0026times;10\u0026sup3;/\u0026micro;L\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e177 [125\u0026ndash;240]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e185 [135\u0026ndash;243]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.065\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"4\" nameend=\"c4\" namest=\"c1\"\u003e \u003cp\u003e\u003cb\u003eClinical indicators and outcomes\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eUrine output 24 h, median [IQR], mL\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e2160 [1300\u0026ndash;3250]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e1335 [730\u0026ndash;2120]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.431\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eHospital mortality, %\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e10.5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e8.69\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.06\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eICU mortality, %\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e6.71\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e5.55\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.047\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eProlonged ICU LOS\u0026thinsp;\u0026ge;\u0026thinsp;7 days, %\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e13.19\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e10.91\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.072\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cem\u003eValues are mean (SD), median [IQR], or percentage as appropriate. SMD\u0026thinsp;=\u0026thinsp;standardised mean difference; SMD\u0026thinsp;\u0026lt;\u0026thinsp;0.10 indicates negligible distributional difference. Race/ethnicity was not available in harmonised form for MIMIC-IV and is reported for eICU only in\u003c/em\u003e Section \u003cspan refid=\"Sec14\" class=\"InternalRef\"\u003e3.5\u003c/span\u003e. \u003cem\u003eMAP\u0026thinsp;=\u0026thinsp;mean arterial pressure; BUN\u0026thinsp;=\u0026thinsp;blood urea nitrogen; LOS\u0026thinsp;=\u0026thinsp;length of stay. The large urine output SMD (0.431) likely reflects institutional variation in fluid protocols and documentation practices rather than a fundamental cohort difference.\u003c/em\u003e\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003e3.2 Model Performance\u003c/h2\u003e \u003cp\u003eInternal and external performance metrics are summarised in Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e. On the held-out MIMIC-IV test set, the primary model (XGBoost\u0026thinsp;+\u0026thinsp;logistic recalibration) achieved AUROC 0.847 (95% CI 0.832\u0026ndash;0.860), AUPRC 0.441 (95% CI 0.402\u0026ndash;0.475), and Brier score 0.075 (95% CI 0.071\u0026ndash;0.079). In external validation on eICU, the model maintained strong discrimination (AUROC 0.819, 95% CI 0.815\u0026ndash;0.823; AUPRC 0.355; Brier 0.072), with an absolute AUROC reduction of 0.028 \u0026mdash; consistent with expected attenuation under cross-institutional distributional shift. ROC curves for all models on the internal test set are shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eDiscrimination, probabilistic accuracy, and external calibration characteristics of all prediction models\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"8\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c8\" colnum=\"8\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eModel\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colspan=\"3\" nameend=\"c4\" namest=\"c2\"\u003e \u003cp\u003eInternal validation (MIMIC-IV test, n\u0026thinsp;=\u0026thinsp;7,767)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colspan=\"3\" nameend=\"c7\" namest=\"c5\"\u003e \u003cp\u003eExternal validation (eICU, n\u0026thinsp;=\u0026thinsp;114,060)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c8\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eExternal calibration slope / intercept\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eAUROC (95% CI)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eAUPRC (95% CI)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eBrier (95% CI)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eAUROC (95% CI)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eAUPRC (95% CI)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003eBrier (95% CI)\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLogistic regression\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.797 (0.782\u0026ndash;0.812)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.380 (0.345\u0026ndash;0.412)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.080 (0.076\u0026ndash;0.084)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.769 (0.764\u0026ndash;0.773)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.295 (0.286\u0026ndash;0.305)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.081 (0.080\u0026ndash;0.082)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026minus;\" colname=\"c8\"\u003e \u003cp\u003e0.844 / \u0026minus;1.045\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eRandom forest\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.835 (0.820\u0026ndash;0.847)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.398 (0.363\u0026ndash;0.432)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.089 (0.086\u0026ndash;0.092)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.807 (0.803\u0026ndash;0.811)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.329 (0.320\u0026ndash;0.339)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.097 (0.096\u0026ndash;0.097)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026minus;\" colname=\"c8\"\u003e \u003cp\u003e1.436 / \u0026minus;1.067\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eXGBoost (base)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.847 (0.832\u0026ndash;0.860)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.441 (0.402\u0026ndash;0.475)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.075 (0.071\u0026ndash;0.079)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.819 (0.815\u0026ndash;0.823)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.355 (0.346\u0026ndash;0.366)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.072 (0.071\u0026ndash;0.073)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026minus;\" colname=\"c8\"\u003e \u003cp\u003e0.998 / \u0026minus;0.691\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eXGBoost\u0026thinsp;+\u0026thinsp;logistic recalibration (primary model)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003e0.847 (0.832\u0026ndash;0.860)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cb\u003e0.441 (0.402\u0026ndash;0.475)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cb\u003e0.075 (0.071\u0026ndash;0.079)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e\u003cb\u003e0.819 (0.815\u0026ndash;0.823)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e\u003cb\u003e0.355 (0.346\u0026ndash;0.366)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e\u003cb\u003e0.072 (0.071\u0026ndash;0.073)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026minus;\" colname=\"c8\"\u003e \u003cp\u003e\u003cb\u003e0.980 / \u0026minus;0.678\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cem\u003eAUROC\u0026thinsp;=\u0026thinsp;area under the ROC curve; AUPRC\u0026thinsp;=\u0026thinsp;area under the precision\u0026ndash;recall curve; Brier\u0026thinsp;=\u0026thinsp;Brier score (lower\u0026thinsp;=\u0026thinsp;better). 95% confidence intervals were derived using stratified bootstrap resampling (500 iterations for primary model-performance estimates and 1,000 iterations for paired comparisons).\u003c/em\u003e \u003c/p\u003e \u003cp\u003e \u003cem\u003eCalibration slope\u0026thinsp;\u0026asymp;\u0026thinsp;1.0 and intercept\u0026thinsp;\u0026asymp;\u0026thinsp;0 indicate perfect calibration; a negative intercept indicates systematic risk overestimation. Logistic recalibration preserves rank ordering; therefore AUROC/AUPRC are identical between XGBoost (base) and the recalibrated model.\u003c/em\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cem\u003eROC curves for logistic regression (AUROC 0.797), random forest (0.835), XGBoost (0.847), and recalibrated XGBoost (0.847) on the internal test set. The XGBoost and recalibrated XGBoost curves are superimposed because recalibration preserves rank-based discrimination; the two models differ in probability estimates and calibration characteristics.\u003c/em\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003e3.3 Calibration and Recalibration\u003c/h2\u003e \u003cp\u003eCalibration plots are shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e. Internally, the primary model was near-ideally calibrated (slope 0.982, 95% CI 0.919\u0026ndash;1.046; intercept 0.001, 95% CI\u0026thinsp;\u0026minus;\u0026thinsp;0.141 to 0.144; ECE\u0026thinsp;=\u0026thinsp;0.010). Externally, the calibration slope remained near-ideal (0.980, 95% CI 0.964\u0026ndash;0.998), confirming preservation of relative risk ordering across institutions. However, the calibration intercept was substantially negative (\u0026minus;\u0026thinsp;0.678, 95% CI\u0026thinsp;\u0026minus;\u0026thinsp;0.712 to \u0026minus;\u0026thinsp;0.649), indicating systematic overestimation of absolute mortality risk attributable to the 1.81-percentage-point lower event rate in eICU \u0026mdash; a pattern consistent with prevalence-driven label shift rather than covariate shift. ECE increased fivefold (0.010 internally to 0.053 externally). A post hoc intercept-only label-shift correction reduced ECE to 0.039 (intercept\u0026thinsp;\u0026minus;\u0026thinsp;0.501), a 26% relative improvement, demonstrating that targeted recalibration without retraining can substantially restore the clinical reliability of probability estimates.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cem\u003ePoints\u0026thinsp;=\u0026thinsp;mean predicted probability vs observed event rate per decile; dashed orange diagonal\u0026thinsp;=\u0026thinsp;perfect calibration. Internal: slope 0.982, intercept 0.001. External: slope 0.980, intercept\u0026thinsp;\u0026minus;\u0026thinsp;0.678. The near-unit slope externally confirms preserved relative risk ordering; the negative intercept reflects systematic absolute risk overestimation attributable to lower event-rate prevalence in eICU vs MIMIC-IV.\u003c/em\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec13\" class=\"Section2\"\u003e \u003ch2\u003e3.4 Clinical Utility and APACHE Benchmark\u003c/h2\u003e \u003cp\u003eDecision curve analysis demonstrated positive net benefit over treat-all and treat-none strategies across threshold probabilities of approximately 2\u0026ndash;40% in the external cohort (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e). In the matched eICU subset with available APACHE scores (n\u0026thinsp;=\u0026thinsp;98,788), the primary model outperformed APACHE in discrimination (AUROC 0.817 vs. 0.795; DeLong p\u0026thinsp;\u0026lt;\u0026thinsp;0.001) and probabilistic accuracy (Brier 0.074 vs. 0.075). AUPRC was marginally lower (0.364 vs. 0.382), likely reflecting APACHE's weighting towards high-acuity patients. APACHE exhibited markedly poor absolute calibration (slope 0.591, intercept\u0026thinsp;\u0026minus;\u0026thinsp;1.159), indicating systematic risk overestimation. Full results are in Table\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cem\u003eDashed purple\u0026thinsp;=\u0026thinsp;treat-all; dotted brown\u0026thinsp;=\u0026thinsp;treat-none. All ML models exceeded treat-none across the full range and exceeded treat-all above ~\u0026thinsp;5%. The recalibrated XGBoost model yielded the highest net benefit across the clinically relevant 2\u0026ndash;40% range. XGBoost and recalibrated XGBoost curves coincide, confirming recalibration does not alter decision-analytic utility.\u003c/em\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eBenchmark comparison with APACHE and subgroup performance of the primary model (external eICU cohort)\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"7\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCategory\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGroup\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003en\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eAUROC (95% CI)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eAUPRC\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eBrier\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003eCal. slope / int.\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003e\u003cb\u003eAPACHE benchmark (matched subset, n\u0026thinsp;=\u0026thinsp;98,788)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eAPACHE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e98,788\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.795\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.382\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.075\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026minus;\" colname=\"c7\"\u003e \u003cp\u003e0.591 / \u0026minus;1.159\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eXGBoost\u0026thinsp;+\u0026thinsp;recalibration (primary model)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e98,788\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cb\u003e0.817 (0.815\u0026ndash;0.819)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.364\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.074\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026minus;\" colname=\"c7\"\u003e \u003cp\u003e0.975 / \u0026minus;0.651\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003e\u003cb\u003eSex (external eICU cohort)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMale\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e61,548\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.823 (0.818\u0026ndash;0.828)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.363\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.07\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026minus;\" colname=\"c7\"\u003e \u003cp\u003e1.002 / \u0026minus;0.569\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eFemale\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e52,474\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.814 (0.808\u0026ndash;0.820)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.349\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.074\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026minus;\" colname=\"c7\"\u003e \u003cp\u003e0.965 / \u0026minus;0.783\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"5\" rowspan=\"6\"\u003e \u003cp\u003e\u003cb\u003eRace/ethnicity (external eICU cohort)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCaucasian\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e87,619\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.817 (0.812\u0026ndash;0.822)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.354\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.072\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026minus;\" colname=\"c7\"\u003e \u003cp\u003e0.981 / \u0026minus;0.669\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eAfrican American\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e13,170\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.824 (0.810\u0026ndash;0.834)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.355\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.067\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026minus;\" colname=\"c7\"\u003e \u003cp\u003e0.975 / \u0026minus;0.721\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eHispanic\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e4,226\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.823 (0.802\u0026ndash;0.841)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.357\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.077\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026minus;\" colname=\"c7\"\u003e \u003cp\u003e0.944 / \u0026minus;0.735\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eAsian\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e1,920\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.840 (0.814\u0026ndash;0.867)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.424\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.073\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026minus;\" colname=\"c7\"\u003e \u003cp\u003e1.135 / \u0026minus;0.400\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eNative American\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e699\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.822 (0.764\u0026ndash;0.868)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.429\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.073\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026minus;\" colname=\"c7\"\u003e \u003cp\u003e0.952 / \u0026minus;0.485\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eOther/Unknown\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e5,166\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.838 (0.820\u0026ndash;0.854)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.367\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.069\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026minus;\" colname=\"c7\"\u003e \u003cp\u003e1.027 / \u0026minus;0.539\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cem\u003eAPACHE comparison restricted to eICU patients with available APACHE hospital mortality predictions (n\u0026thinsp;=\u0026thinsp;98,788; event rate 9.0%). DeLong test p\u0026thinsp;\u0026lt;\u0026thinsp;0.001 for XGBoost vs APACHE AUROC. 95% confidence intervals were estimated using stratified bootstrap resampling, with iteration counts varying by analysis type (500 for primary model-performance estimates, 300 for subgroup analyses, and 1,000 for paired benchmark comparisons).\u003c/em\u003e \u003c/p\u003e \u003cp\u003e \u003cem\u003eCal. int. = calibration intercept. Race/ethnicity reported for eICU only (not available in harmonised form for MIMIC-IV). Native American subgroup (n\u0026thinsp;=\u0026thinsp;699; 67 events) has correspondingly wider CIs. APACHE calibration slope of 0.591 indicates substantial under-separation of predicted vs observed risk.\u003c/em\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec14\" class=\"Section2\"\u003e \u003ch2\u003e3.5 Subgroup, Sensitivity, and Secondary Outcomes\u003c/h2\u003e \u003cp\u003eDiscriminative performance was consistent across sex (AUROC gap 0.009) and racial/ethnic groups (AUROC range 0.817\u0026ndash;0.840; maximum gap 0.044), with overlapping confidence intervals for most pairwise comparisons (Table\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e). Calibration intercepts varied more substantially by subgroup (range\u0026thinsp;\u0026minus;\u0026thinsp;0.400 [Asian] to \u0026minus;\u0026thinsp;0.783 [Female]), indicating unevenly distributed absolute risk overestimation that may require subgroup-specific recalibration prior to deployment. Exclusion of race/ethnicity variables produced negligible change in discrimination (ΔAUROC\u0026thinsp;=\u0026thinsp;+\u0026thinsp;0.001).\u003c/p\u003e \u003cp\u003eAcross five sensitivity analyses (Supplementary Table S2), discrimination was broadly stable. A routine-predictor model using only 19 features (excluding arterial blood gas variables) achieved external AUROC 0.794 (ΔAUROC\u0026thinsp;=\u0026thinsp;\u0026minus;\u0026thinsp;0.025), supporting feasibility in resource-limited settings. Restricting to ICU stays\u0026thinsp;\u0026ge;\u0026thinsp;48 hours produced the largest attenuation (ΔAUROC\u0026thinsp;=\u0026thinsp;\u0026minus;\u0026thinsp;0.059), consistent with survivor selection bias. For ICU mortality, external AUROC was 0.836 (95% CI 0.830\u0026ndash;0.840); for prolonged LOS (\u0026ge;\u0026thinsp;7 days), external AUROC was 0.720 (95% CI 0.715\u0026ndash;0.725).\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec15\" class=\"Section2\"\u003e \u003ch2\u003e3.6 Model Interpretability\u003c/h2\u003e \u003cp\u003eSHAP analysis identified clinically coherent predictors (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e). The five highest-ranked features by mean |SHAP| were 24-hour urine output (0.310), age (0.293), maximum BUN (0.291), ventilation flag (0.218), and mean respiratory rate (0.201) \u0026mdash; all established markers of organ dysfunction and haemodynamic compromise. Permutation importance yielded consistent rankings (top three: urine output, lactate, age), confirming interpretability robustness. The race variable appeared in the top five by permutation importance (ΔAUROC\u0026thinsp;=\u0026thinsp;0.0015) but showed modest SHAP contribution, likely reflecting correlation with physiological predictors rather than independent signal. Permutation importance results are in \u003cb\u003eSupplementary Figure S1\u003c/b\u003e.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cem\u003eEach point represents one ICU stay; x-axis\u0026thinsp;=\u0026thinsp;SHAP value (impact on log-odds of mortality); colour\u0026thinsp;=\u0026thinsp;feature value (red\u0026thinsp;=\u0026thinsp;high, blue\u0026thinsp;=\u0026thinsp;low). Features ranked by descending mean |SHAP|. Feature labels rendered with clinical nomenclature. SHAP values computed using TreeExplainer. Permutation importance is provided in Supplementary Figure S1.\u003c/em\u003e \u003c/p\u003e \u003c/div\u003e"},{"header":"4. DISCUSSION","content":"\u003cdiv id=\"Sec17\" class=\"Section2\"\u003e \u003ch2\u003e4.1 Principal Findings\u003c/h2\u003e \u003cp\u003eThis study develops and externally validates a reproducible, calibration-aware machine learning framework for ICU hospital mortality prediction and demonstrates that external validation practices relying solely on discrimination metrics may systematically misrepresent model readiness for clinical deployment. Three principal findings emerge.\u003c/p\u003e \u003cp\u003eFirst, gradient boosting with logistic recalibration achieved transportable discrimination (internal AUROC 0.847, external AUROC 0.819; absolute reduction 0.028), consistent with prior benchmarking studies on MIMIC-derived data reporting AUROC values of 0.82\u0026ndash;0.86 using machine learning approaches [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e], including deep learning and XGBoost-based methods [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e], multitask recurrent architectures [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e], and tree-based ensemble models across different critical care outcomes [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e]. The present results meaningfully extend prior work through systematic external validation on a fully independent multi-site cohort of 114,060 ICU admissions \u0026mdash; validation that remains inconsistently reported across the clinical prediction model literature [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eSecond, the primary model outperformed APACHE in discrimination (AUROC 0.817 vs. 0.795; DeLong p\u0026thinsp;\u0026lt;\u0026thinsp;0.001) and probabilistic accuracy (Brier 0.074 vs. 0.075) in the matched external subset. APACHE exhibited markedly poor absolute calibration (slope 0.591, intercept\u0026thinsp;\u0026minus;\u0026thinsp;1.159), reflecting a recognized limitation of conventional severity scoring in mixed-acuity ICU populations [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e, \u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e] \u0026mdash; reinforcing that well-validated machine learning can offer performance advantages over legacy scores provided calibration is explicitly evaluated prior to deployment [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eThird, and most critically, discriminative generalizability did not imply calibration generalizability. Despite near-ideal internal calibration (slope 0.982, intercept 0.001), the external calibration intercept was substantially negative (\u0026minus;\u0026thinsp;0.678, 95% CI\u0026thinsp;\u0026minus;\u0026thinsp;0.712 to \u0026minus;\u0026thinsp;0.649), attributable to prevalence-driven label shift \u0026mdash; a form of distributional change entirely undetectable by AUROC-based validation. A simple post hoc intercept update, applied without retraining, reduced ECE by 26% (0.053 to 0.039), demonstrating that targeted recalibration can restore clinical reliability at new deployment sites with minimal infrastructure overhead.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec18\" class=\"Section2\"\u003e \u003ch2\u003e4.2 Calibration\u0026ndash;Discrimination Dissociation and Clinical Utility\u003c/h2\u003e \u003cp\u003eThe dissociation between preserved discrimination and degraded calibration under cross-institutional shift is mechanistically attributable to the lower in-hospital mortality prevalence in eICU versus MIMIC-IV. When event-rate prevalence differs between development and deployment settings, predicted absolute probabilities diverge from local observed rates even when patient risk rankings are preserved \u0026mdash; consistent with emerging evidence that distributional shifts disproportionately affect probability estimates while leaving discriminative ranking relatively intact [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e]. Most prior ICU mortality prediction studies report AUROC as the primary or sole metric and do not assess calibration slope and intercept under true external validation [\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e, \u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e], leaving a critical gap in deployment readiness assessment [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e, \u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e]. A model with AUROC 0.82 but a calibration intercept of \u0026minus;\u0026thinsp;0.68 may correctly rank the sickest patients while systematically overstating their absolute mortality risk, driving over-intervention near the decision threshold or producing misleading prognostic communications.\u003c/p\u003e \u003cp\u003eDecision curve analysis confirmed positive net benefit across threshold probabilities of approximately 2\u0026ndash;40% in the external cohort (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e) [\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e, \u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e], encompassing the range most directly relevant to ICU triage, early intervention activation, and resource prioritization. This net benefit, however, is recoverable at new sites only after local intercept recalibration adjusts predicted probabilities to reflect site-specific mortality prevalence. Calibration verification and site-specific recalibration should therefore be treated as prerequisites to deployment rather than optional post-hoc steps. The fully reproducible end-to-end pipeline \u0026mdash; encompassing harmonised cohort construction, strict temporal leakage controls, calibration-aware model selection, and public code release \u0026mdash; directly addresses the reproducibility concerns identified as structural weaknesses of clinical machine learning research [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e, \u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e].\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec19\" class=\"Section2\"\u003e \u003ch2\u003e4.3 Equity and Interpretability\u003c/h2\u003e \u003cp\u003eSubgroup analyses revealed that while discriminative performance was consistent across racial/ethnic groups (AUROC range 0.817\u0026ndash;0.840; maximum gap 0.044) and sex (gap 0.009), calibration intercepts varied substantially by subgroup (range\u0026thinsp;\u0026minus;\u0026thinsp;0.400 [Asian] to \u0026minus;\u0026thinsp;0.783 [Female]) [\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e, \u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e]. Absolute risk overestimation was unevenly distributed, indicating that a single global intercept adjustment may not restore equitable probability estimation across all subpopulations. Subgroup-specific calibration monitoring or stratified recalibration protocols should be considered prior to threshold-based deployment [\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e, \u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e]. Excluding race/ethnicity variables produced negligible discrimination change (ΔAUROC\u0026thinsp;=\u0026thinsp;+\u0026thinsp;0.001), supporting race-excluded model variants in settings where demographic variables may encode structural inequities rather than independent clinical risk.\u003c/p\u003e \u003cp\u003eSHAP values and permutation importance yielded strongly concordant feature rankings [\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e], with urine output, age, maximum BUN, ventilation flag, and respiratory rate identified as the five most influential predictors \u0026mdash; all established markers of organ dysfunction and haemodynamic compromise. Feature contributions reflect associative rather than causal relationships; clinical interpretation should be made accordingly [\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e].\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec20\" class=\"Section2\"\u003e \u003ch2\u003e4.4 Strengths, Limitations, and Future Directions\u003c/h2\u003e \u003cp\u003eKey strengths include: two large, diverse public critical care databases (n\u0026thinsp;=\u0026thinsp;166,088 combined); a comprehensive evaluation framework incorporating discrimination, calibration, decision curve analysis, APACHE benchmarking, fairness analysis, and five pre-specified sensitivity analyses; a fully reproducible pipeline with explicit leakage controls supporting transparent replication [\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e]; a resource-limited variant (19-predictor model, external AUROC 0.794) demonstrating feasibility in community hospital settings [\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e]; and public release of all code, model artifacts, and outputs consistent with reproducibility standards for trustworthy clinical machine learning [\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e]. Decision curve analyses were conducted and reported in accordance with established interpretive guidelines [\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eLimitations include: exclusive reliance on U.S. datasets limiting international generalizability; exclusion of temperature features due to near-complete missingness in MIMIC-IV; use of median imputation rather than multiple imputation; subgroup analyses restricted to sex and race/ethnicity available in eICU; and a retrospective design precluding conclusions about real-world clinical impact. Advanced domain adaptation strategies \u0026mdash; including transfer learning and Bayesian updating \u0026mdash; were not evaluated. Future work should prioritize prospective stepped-wedge validation within active clinical decision support systems, adaptive site-specific recalibration protocols, and evaluation in non-U.S. and lower-resource healthcare settings to characterize global transportability and broaden fairness assessment.\u003c/p\u003e \u003c/div\u003e"},{"header":"5. CONCLUSION","content":"\u003cp\u003eMachine learning models for ICU mortality prediction can achieve transportable discrimination, but this study demonstrates that transportable discrimination does not guarantee transportable clinical utility. The recalibrated XGBoost model maintained strong external discrimination (AUROC 0.819, 95% CI 0.815\u0026ndash;0.823) across 114,060 independent ICU admissions and outperformed APACHE in discrimination and probabilistic accuracy. Yet despite near-ideal internal calibration (slope 0.982, intercept 0.001), the external calibration intercept shifted substantially (\u0026minus;\u0026thinsp;0.678), reflecting systematic absolute risk overestimation driven by prevalence-driven label shift \u0026mdash; a dissociation entirely undetectable by AUROC alone. A simple post hoc intercept-only adjustment reduced expected calibration error by 26% without retraining, establishing that deployment-ready calibration is achievable through pragmatically feasible recalibration strategies with direct relevance for health systems deploying predictive tools across institutional boundaries [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e, \u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eCalibration must therefore be treated as a mandatory evaluation standard, not an optional reporting item [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e, \u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e]. Equity analyses further demonstrate that calibration intercepts varied substantially across demographic subgroups (range\u0026thinsp;\u0026minus;\u0026thinsp;0.400 to \u0026minus;\u0026thinsp;0.783) despite consistent discrimination, indicating that global recalibration alone may not restore equitable probability estimation \u0026mdash; reinforcing the need for subgroup-specific calibration monitoring as a prerequisite to equitable deployment [\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e, \u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e]. The reproducible benchmarking framework presented here, developed in accordance with TRIPOD and PROBAST standards [\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e, \u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e], a 19-feature resource-limited variant achieving external AUROC 0.794 [\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e], publicly released code and model artifacts [\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e], and DCA reporting per established guidelines [\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e], together provide an immediately adoptable evaluation template for clinical prediction model research in critical care informatics.\u003c/p\u003e \u003cp\u003eRealizing the clinical promise of AI-assisted critical care requires validation that extends unconditionally beyond discrimination to encompass calibration, equity, and decision-analytic utility. Future work should prioritize prospective validation within clinical decision support systems, adaptive site-specific recalibration protocols, and evaluation in non-U.S. and lower-resource healthcare settings to characterize global transportability and prevent amplification of existing disparities in critical care outcomes.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eAcknowledgements\u003c/strong\u003e: The authors would like to acknowledge the contributors of the MIMIC-IV (version 2.2) and the eICU Collaborative Research Database for making these valuable datasets publicly available for research. We also acknowledge PhysioNet for providing access to these resources and supporting reproducible research in critical care.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eEthical\u0026nbsp;\u003c/strong\u003e\u003cstrong\u003eConsiderations\u003c/strong\u003e: This study utilized publicly available, de-identified datasets, namely the MIMIC-IV (version 2.2) and the eICU Collaborative Research Database. Both datasets have received prior institutional review board (IRB) approval, and all patient data were fully de-identified in accordance with the Health Insurance Portability and Accountability Act (HIPAA) Safe Harbor provisions.\u003c/p\u003e\n\u003cp\u003eAs the data are anonymized and publicly accessible, this study was exempt from additional ethical review and did not require informed consent. Access to the datasets was obtained following completion of the required data use agreements and credentialing procedures. All analyses were conducted in accordance with relevant data use policies and ethical guidelines.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConflict of Interest:\u003c/strong\u003e The authors declare they have no competing financial or non-financial interests that are directly or indirectly related to the work submitted for publication. This research was conducted independently; the affiliations listed are for identification purposes and do not imply institutional funding or endorsement of the results.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding Statement:\u003c/strong\u003e This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConsent to Participate:\u003c/strong\u003e Not applicable\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConsent for Publication:\u003c/strong\u003e Not applicable\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eClinical trial number\u003c/strong\u003e: Not Applicable\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eData Availability Statement:\u0026nbsp;\u003c/strong\u003eThe datasets analyzed in this study are publicly available. The MIMIC-IV (version 2.2) database and the eICU Collaborative Research Database can be accessed via PhysioNet (https://physionet.org/), subject to completion of the required credentialing, training, and data use agreements. Due to data use restrictions, the datasets cannot be redistributed by the authors. All code and analytical procedures used in this study are publicly available at: https://github.com/Krutarth007/icu-mortality-prediction-ml\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eUse of Generative AI:\u003c/strong\u003e During the preparation of this manuscript, the authors used generative AI tools (Gemini, ChatGPT, Claude) to assist with language refinement and code debugging. The authors critically reviewed and edited all outputs and take full responsibility for the accuracy and integrity of the final work.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthor Contributions (CRediT Taxonomy):\u003c/strong\u003e\u003c/p\u003e\n\u003cul type=\"disc\"\u003e\n \u003cli\u003e\u003cstrong\u003eKrutarth Patel:\u003c/strong\u003e Conceptualization (Lead); Methodology (Lead); Software (Lead); Formal Analysis (Lead); Investigation (Lead); Writing \u0026ndash; Review \u0026amp; Editing (Lead); Project Administration (Lead).\u003c/li\u003e\n \u003cli\u003e\u003cstrong\u003ePhanindra Beedala:\u003c/strong\u003e Validation (Lead); Data Curation (Lead); Writing \u0026ndash; Original Draft (Lead); Methodology (Supporting); Software (Supporting).\u003c/li\u003e\n\u003c/ul\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eKomorowski M, Celi LA, Badawi O, Gordon AC, Faisal AA (2018) The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nat Med 24(11):1716\u0026ndash;1720. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1038/s41591-018-0213-5\u003c/span\u003e\u003cspan address=\"10.1038/s41591-018-0213-5\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eShickel B, Tighe PJ, Bihorac A, Rashidi P (2017) Deep EHR: a survey of recent advances in deep learning techniques for electronic health record analysis. J Biomed Inf 83:168\u0026ndash;185. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.jbi.2017.04.001\u003c/span\u003e\u003cspan address=\"10.1016/j.jbi.2017.04.001\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCalvert J, Mao Q, Hoffman JL, Jay M, Desautels T, Mohamadlou H et al (2016) Using electronic health record collected clinical variables to predict medical intensive care unit mortality. Crit Care Med 44(2):e61\u0026ndash;e67. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1097/CCM.0000000000001515\u003c/span\u003e\u003cspan address=\"10.1097/CCM.0000000000001515\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRajkomar A, Dean J, Kohane I (2019) Machine learning in medicine. N Engl J Med 380(14):1347\u0026ndash;1358. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1056/NEJMra1814259\u003c/span\u003e\u003cspan address=\"10.1056/NEJMra1814259\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhang Z, Ho KM, Hong Y (2019) Machine learning for the prediction of mortality in patients with sepsis: a systematic review. Ann Transl Med 7(24):832. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.21037/atm.2019.11.50\u003c/span\u003e\u003cspan address=\"10.21037/atm.2019.11.50\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJohnson AEW, Ghassemi MM, Nemati S, Niehaus KE, Clifton DA, Clifford GD (2016) Machine learning and decision support in critical care. \u003cem\u003eProc IEEE\u003c/em\u003e. ;104(2):444\u0026ndash;466. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1109/JPROC.2015.2501978\u003c/span\u003e\u003cspan address=\"10.1109/JPROC.2015.2501978\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTopol EJ (2019) High-performance medicine: the convergence of human and artificial intelligence. Nat Med 25:44\u0026ndash;56. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1038/s41591-018-0300-7\u003c/span\u003e\u003cspan address=\"10.1038/s41591-018-0300-7\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePurushotham S, Meng C, Che Z, Liu Y (2018) Benchmarking deep learning models on large healthcare datasets. J Biomed Inf 83:112\u0026ndash;134. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.jbi.2018.04.007\u003c/span\u003e\u003cspan address=\"10.1016/j.jbi.2018.04.007\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJohnson AEW, Pollard TJ, Shen L, Lehman LH, Feng M, Ghassemi M et al (2023) MIMIC-IV, a freely accessible electronic health record dataset. Sci Data 10:1. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1038/s41597-022-01899-x\u003c/span\u003e\u003cspan address=\"10.1038/s41597-022-01899-x\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePollard TJ, Johnson AEW, Raffa JD, Celi LA, Mark RG, Badawi O (2018) The eICU Collaborative Research Database, a freely available multi-center database for critical care research. Sci Data 5:180178. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1038/sdata.2018.178\u003c/span\u003e\u003cspan address=\"10.1038/sdata.2018.178\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHarutyunyan H, Khachatrian H, Kale DC, Ver Steeg G, Galstyan A. Multitask learning and benchmarking with clinical time series data. In: Advances in Neural Information Processing Systems 32 (NeurIPS 2019); Dec 8\u0026ndash;14;, Vancouver BC (2019) Red Hook, NY: Curran Associates; 2019. Available from: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://proceedings.neurips.cc/paper_files/paper/2019/hash/4735450b461412351b12c3fef0bac8b0-Abstract.html\u003c/span\u003e\u003cspan address=\"https://proceedings.neurips.cc/paper_files/paper/2019/hash/4735450b461412351b12c3fef0bac8b0-Abstract.html\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDesautels T, Calvert J, Hoffman J, Jay M, Kerem Y, Shieh L et al (2016) Prediction of early unplanned intensive care unit readmission using machine learning. Crit Care Med 44(4):e270\u0026ndash;e278. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1097/CCM.0000000000001490\u003c/span\u003e\u003cspan address=\"10.1097/CCM.0000000000001490\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSteyerberg EW (2019) Clinical prediction models: a practical approach to development, validation, and updating, 2nd edn. Springer, New York. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1007/978-3-030-16399-0\u003c/span\u003e\u003cspan address=\"10.1007/978-3-030-16399-0\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWynants L, Van Calster B, Collins GS, Riley RD, Heinze G, Schuit E et al (2020) Prediction models for diagnosis and prognosis of COVID-19: systematic review and critical appraisal. BMJ 369:m1328. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1136/bmj.m1328\u003c/span\u003e\u003cspan address=\"10.1136/bmj.m1328\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSubbaswamy A, Saria S (2020) From development to deployment: dataset shift, causality, and shift-stable models in health AI. Biostatistics 21(2):345\u0026ndash;352. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1093/biostatistics/kxz041\u003c/span\u003e\u003cspan address=\"10.1093/biostatistics/kxz041\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRoberts M, Driggs D, Thorpe M, Gilbey J, Yeung M, Ursprung S et al (2021) Common pitfalls and recommendations for using machine learning in healthcare. Nat Med 27:745\u0026ndash;758. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1038/s41591-021-01223-2\u003c/span\u003e\u003cspan address=\"10.1038/s41591-021-01223-2\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKapoor S, Narayanan A (2023) Leakage and the reproducibility crisis in machine-learning-based science. Patterns 4(8):100804. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.patter.2023.100804\u003c/span\u003e\u003cspan address=\"10.1016/j.patter.2023.100804\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVan Calster B, McLernon DJ, Van Smeden M, Wynants L, Steyerberg EW (2019) Calibration: the Achilles heel of predictive analytics. BMC Med 17(1):230. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1186/s12916-019-1466-7\u003c/span\u003e\u003cspan address=\"10.1186/s12916-019-1466-7\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAustin PC, Steyerberg EW (2019) The Integrated Calibration Index (ICI) and related metrics for quantifying the calibration of logistic regression models. Stat Med 38(21):4051\u0026ndash;4065. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1002/sim.8281\u003c/span\u003e\u003cspan address=\"10.1002/sim.8281\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBreiman L (2001) Random forests. Mach Learn 45(1):5\u0026ndash;32. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1023/A:1010933404324\u003c/span\u003e\u003cspan address=\"10.1023/A:1010933404324\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDavis J, Goadrich M. The relationship between precision-recall and ROC curves. In: Proceedings of the 23rd International Conference on Machine Learning (ICML '06); Jun 25\u0026ndash;29;, Pittsburgh PA (2006) New York: ACM; 2006. pp. 233\u0026ndash;240. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1145/1143844.1143874\u003c/span\u003e\u003cspan address=\"10.1145/1143844.1143874\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSaito T, Rehmsmeier M (2015) The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 10(3):e0118432. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1371/journal.pone.0118432\u003c/span\u003e\u003cspan address=\"10.1371/journal.pone.0118432\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBrier GW (1950) Verification of forecasts expressed in terms of probability. Mon Weather Rev 78(1):1\u0026ndash;3. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1175/1520-0493\u003c/span\u003e\u003cspan address=\"10.1175/1520-0493\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e(1950)078%3C0001:VOFEIT%3E2.0.CO;2\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNiculescu-Mizil A, Caruana R (2005) Predicting good probabilities with supervised learning. In: Proceedings of the 22nd International Conference on Machine Learning (ICML '05); 2005 Aug 7\u0026ndash;11; Bonn, Germany. New York: ACM; pp. 625\u0026ndash;632. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1145/1102351.1102430\u003c/span\u003e\u003cspan address=\"10.1145/1102351.1102430\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZadrozny B, Elkan C (2002) Transforming classifier scores into accurate multiclass probability estimates. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '02); Jul 23\u0026ndash;26; Edmonton, Alberta, Canada. New York: ACM; 2002. pp. 694\u0026ndash;699. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1145/775047.775151\u003c/span\u003e\u003cspan address=\"10.1145/775047.775151\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVickers AJ, Elkin EB (2006) Decision curve analysis: a novel method for evaluating prediction models. Med Decis Mak 26(6):565\u0026ndash;574. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1177/0272989X06295361\u003c/span\u003e\u003cspan address=\"10.1177/0272989X06295361\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVickers AJ, Van Calster B, Steyerberg EW (2016) Net benefit approaches to the evaluation of prediction models, molecular markers, and diagnostic tests. BMJ 352:i6. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1136/bmj.i6\u003c/span\u003e\u003cspan address=\"10.1136/bmj.i6\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eObermeyer Z, Powers B, Vogeli C, Mullainathan S (2019) Dissecting racial bias in an algorithm used to manage the health of populations. Science 366(6464):447\u0026ndash;453. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1126/science.aax2342\u003c/span\u003e\u003cspan address=\"10.1126/science.aax2342\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChen IY, Szolovits P, Ghassemi M (2019) Can AI help reduce disparities in general medical and mental health care? AMA J Ethics 21(2):E167\u0026ndash;E179. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1001/amajethics.2019.167\u003c/span\u003e\u003cspan address=\"10.1001/amajethics.2019.167\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLundberg SM, Lee SI (2017) A unified approach to interpreting model predictions. In: Advances in Neural Information Processing Systems 30 (NeurIPS 2017); Dec 4\u0026ndash;9; Long Beach, CA. Red Hook, NY: Curran Associates; 2017. Available from: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://proceedings.neurips.cc/paper_files/paper/2017/hash/8a20a8621978632d76c43dfd28b67767-Abstract.html\u003c/span\u003e\u003cspan address=\"https://proceedings.neurips.cc/paper_files/paper/2017/hash/8a20a8621978632d76c43dfd28b67767-Abstract.html\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCollins GS, Reitsma JB, Altman DG, Moons KGM (2015) Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. Ann Intern Med 162(1):55\u0026ndash;63. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.7326/M14-0697\u003c/span\u003e\u003cspan address=\"10.7326/M14-0697\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWolff RF, Moons KGM, Riley RD, Whiting PF, Westwood M, Collins GS et al (2019) PROBAST: a tool to assess the risk of bias and applicability of prediction model studies. Ann Intern Med 170(1):51\u0026ndash;58. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.7326/M18-1376\u003c/span\u003e\u003cspan address=\"10.7326/M18-1376\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKelly CJ, Karthikesalingam A, Suleyman M, Corrado G, King D (2019) Key challenges for delivering clinical impact with artificial intelligence. Lancet Digit Health 1(6):e312\u0026ndash;e315. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/S2589-7500(19)30123-2\u003c/span\u003e\u003cspan address=\"10.1016/S2589-7500(19)30123-2\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePineau J, Vincent-Lamarre P, Sinha K, Larivi\u0026egrave;re V, Beygelzimer A, d'Alch\u0026eacute;-Buc F et al (2021) Improving reproducibility in machine learning research (a report from the NeurIPS 2019 reproducibility program). \u003cem\u003eJ Mach Learn Res\u003c/em\u003e. ;22(164):1\u0026ndash;20. Available from: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://jmlr.org/papers/v22/20-1212.html\u003c/span\u003e\u003cspan address=\"https://jmlr.org/papers/v22/20-1212.html\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVan Calster B, Wynants L, Verbeek JFM, Verbakel JY, Christodoulou E, Vickers AJ et al (2018) Reporting and interpreting decision curve analysis: a guide for investigators. BMJ 362:k3483. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1136/bmj.k3483\u003c/span\u003e\u003cspan address=\"10.1136/bmj.k3483\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"Independent Researcher","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"ICU mortality prediction, Model calibration, External validation, Dataset shift, Clinical decision support, Probability calibration","lastPublishedDoi":"10.21203/rs.3.rs-9602675/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-9602675/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003eBackground\u003c/h2\u003e \u003cp\u003eMachine learning models for intensive care unit (ICU) mortality prediction achieve strong internal discrimination yet rarely undergo external validation with calibration assessment \u0026mdash; a gap undermining clinical deployment. Calibration, the agreement between predicted probabilities and observed event rates, is prerequisite for threshold-based decisions yet remains underreported.\u003c/p\u003e\u003ch2\u003eMethods\u003c/h2\u003e \u003cp\u003eWe conducted a retrospective cohort study using MIMIC-IV (v2.2; n\u0026thinsp;=\u0026thinsp;52,028 ICU stays) for model development and eICU (n\u0026thinsp;=\u0026thinsp;114,060) for independent external validation. Logistic regression, random forest, and gradient boosting (XGBoost) were evaluated on first-24-hour clinical variables. Discrimination was assessed via receiver operating characteristic area (AUROC) and precision-recall area (AUPRC); calibration via slope, intercept, and expected calibration error (ECE). Post-hoc logistic recalibration was applied externally. Clinical utility was evaluated by decision curve analysis benchmarked against Acute Physiology and Chronic Health Evaluation (APACHE) scores. Subgroup analyses examined sex and race/ethnicity; SHapley Additive exPlanations (SHAP) assessed feature importance. Uncertainty was estimated via bootstrap resampling; the study adheres to TRIPOD guidelines.\u003c/p\u003e\u003ch2\u003eResults\u003c/h2\u003e \u003cp\u003eThe recalibrated XGBoost model achieved internal AUROC 0.847 (95% CI: 0.832\u0026ndash;0.860) and external AUROC 0.819 (95% CI: 0.815\u0026ndash;0.823). Internal calibration was near-ideal (slope 0.982; intercept 0.001), whereas external validation revealed systematic risk overestimation (intercept\u0026thinsp;\u0026minus;\u0026thinsp;0.678) attributable to prevalence-driven label shift. An intercept-only adjustment reduced ECE by 26%. The model outperformed APACHE (AUROC 0.817 vs. 0.795; p\u0026thinsp;\u0026lt;\u0026thinsp;0.001).\u003c/p\u003e\u003ch2\u003eConclusions\u003c/h2\u003e \u003cp\u003eICU mortality models exhibit transportable discrimination but clinically significant calibration drift under cross-institutional deployment. Calibration evaluation and targeted recalibration should be mandatory in any clinical machine learning validation framework.\u003c/p\u003e","manuscriptTitle":"Calibration Drift Under Cross-Institutional Deployment: An External Validation Framework for ICU Mortality Prediction Across MIMIC-IV and eICU","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-05-05 04:01:50","doi":"10.21203/rs.3.rs-9602675/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"b94270f3-64a1-44d0-8b53-95f3c90e4546","owner":[],"postedDate":"May 5th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":67446757,"name":"Critical Care \u0026 Emergency Medicine"},{"id":67446758,"name":"Medical Informatics"},{"id":67446759,"name":"Artificial Intelligence and Machine Learning"},{"id":67446760,"name":"Biostatistics"},{"id":67446761,"name":"Epidemiology"}],"tags":[],"updatedAt":"2026-05-05T04:01:50+00:00","versionOfRecord":[],"versionCreatedAt":"2026-05-05 04:01:50","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-9602675","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-9602675","identity":"rs-9602675","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall
last seen: 2026-05-23T02:00:01.238055+00:00
License: CC-BY-4.0