Clinicians vs. Artificial Intelligence in Patient Outcome Prediction in the Intensive Care Unit | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Clinicians vs. Artificial Intelligence in Patient Outcome Prediction in the Intensive Care Unit Corin Kuang, Camilo E. Valderrama, Henry T. Stelfox, Colin B. Josephson, and 1 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8745082/v1 This work is licensed under a CC BY 4.0 License Status: Under Revision Version 1 posted 9 You are reading this latest preprint version Abstract IMPORTANCE: Accurate prediction of patient outcomes in the intensive care unit (ICU) is critical for clinical decision-making. While artificial intelligence (AI) has shown potential in retrospective prediction, direct comparisons with human clinicians, particularly in prospective real-world settings, remain unclear. OBJECTIVE: To compare the predictive performances of human clinicians and AI in predicting ICU patient outcomes both retrospectively and prospectively. DESIGN: A mixed retrospective and prospective study conducted to compare clinician and AI performances of ICU patient outcome prediction. SETTING: Fifteen adult ICUs in Alberta, Canada. PARTICIPANTS: Retrospective analysis included 990 ICU admissions randomly selected between February 2012 and December 2019, the patient outcomes of which were collectively predicted by 7 clinicians. Prospective analysis involved 238 ICU admissions from 215 adult patients between September 2020 and December 2022, with a total of 75 clinicians making at least one prediction each. EXPOSURES: Retrospective clinician predictions were made based on patient data. Prospective clinician predictions were collected during active patient care. AI models were trained on retrospective data from 46,631 ICU admissions of 41,096 unique patients to predict the outcomes of the same patients the clinicians predicted in the retrospective and prospective settings. MAIN OUTCOMES AND MEASURES: Primary patient outcomes were in-hospital mortality, 30-day post-discharge mortality, ICU length of stay (LOS), and hospital LOS. Secondary outcomes included the occurrences of delirium and acute kidney injury during the ICU stay. Classification and regression performance metrics of AI and clinicians were compared using Wilcoxon rank-sum tests. Inter-rater agreements amongst clinicians and between clinicians and AI were analyzed with Cohen’s kappa and the intraclass correlation coefficient. RESULTS: In the retrospective setting, AI generally outperformed clinicians but aggregated predictions from seven clinicians outperformed AI. In the prospective setting, subspecialized physicians generally outperformed AI, whereas physicians in training and nurses generally underperformed AI. Both clinicians and AI performed poorly in LOS prediction. Inter-rater agreement was poor or fair for both amongst clinicians and between clinicians and AI. CONCLUSIONS AND RELEVANCE: This study provides a comprehensive evaluation of clinician and AI performances in ICU outcome prediction under both retrospective and real-world prospective conditions, setting important prediction performance benchmarks. Health sciences/Diseases Health sciences/Health care Health sciences/Medical research Health sciences/Risk factors Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 INTRODUCTION The intensive care unit (ICU) is a dynamic environment where clinicians make frequent, high-stakes decisions for critically ill patients. 1 Accurate prediction of patient outcomes such as mortality and length of stay (LOS) can support timely interventions, efficient resource allocation, and improved patient care. Although traditional scoring systems such as APACHE 2 and SAPS 3 are widely used, they offer limited predictive accuracy at the individual level. 4 The growing volume and complexity of ICU data exceed human cognitive limits, leading to information overload and potentially suboptimal decisions. 5 Machine learning-based artificial intelligence (AI) offers powerful tools to analyze these data, identify patterns, and generate accurate predictions. Numerous studies have demonstrated strong AI performance in predicting outcomes such as sepsis, delirium, and hospital LOS. 6 However, few studies have directly compared AI and clinician performances in ICU patient outcome prediction, and most evaluations have been limited to retrospective settings. As a result, it remains unclear whether AI prediction performance holds up in prospective, real-world clinical settings or how it compares with clinician prediction performance. 7 To address this gap, we compared the predictive performances of clinicians and AI for several ICU patient outcomes using both retrospective and prospective designs. METHODS Overall Study Design and Setting Figure 1 shows the overall retrospective and prospective settings designed to compare the patient outcome prediction performances of ICU clinicians and AI. The retrospective and prospective comparisons involved 990 and 238 ICU admissions, respectively. AI models trained on 46,631 ICU admissions provided AI predictions in both settings. Predicted Patient Outcomes In both retrospective and prospective settings, the primary patient outcomes were in-hospital mortality, 30-day post-discharge mortality, ICU LOS, and hospital LOS. Secondary outcomes were delirium and acute kidney injury (AKI) during the ICU stay. To determine AKI outcomes, patients were required to have at least two serum creatinine measurements to allow assessment according to the Kidney Disease: Improving Global Outcomes (KDIGO) criteria, 8 defined as an increase in serum creatinine of ≥ 0.3 mg/dL (≥ 26.5 µmol/L) within 48 hours or an increase of ≥ 1.5 times the baseline value over 7 previous days. Delirium outcomes were determined using the Intensive Care Delirium Screening Checklist (ICDSC), 9 with scores ≥ 4 indicating delirium. AI Model Development The AI models used in both the retrospective and prospective settings were trained by applying machine learning to the ABeICU database. 10,11 ABeICU includes 55,689 ICU admissions from 48,672 de-identified adult patients across 15 ICUs in Alberta, Canada, including tertiary-care and community ICUs caring for medical, surgical, trauma, burn, neuroscience, and vascular surgery patients, between February 2012 and December 2019. It integrates eCritical TRACER, 12 an electronic medical record (EMR)-based ICU repository containing demographics, vital signs, laboratory results, acuity scores, clinical assessments, and intervention data, linked with the Vital Statistics database from Alberta Health for out-of-hospital mortality data. The inclusion criteria applied to ABeICU were adult patients (18 years or older) with an ICU LOS longer than 24 hours. Also, 1,000 patients from December 2018 to June 2019 were reserved for retrospective comparisons, although data linkage issues led to 990 patients being used. For each outcome, patients with a missing target label were excluded. For AKI and delirium predictions, patients who had already developed AKI and delirium within the first 24 hours of ICU admission were excluded. The final model development cohort diagram for each predicted outcome is shown in eFigure 1. Each model was developed to use patient data from the first 24 hours in the ICU to predict a given outcome at 24 hours after ICU admission. Five models were trained for each classification outcome (in-hospital mortality, 30-day post-discharge mortality, AKI, and delirium): logistic regression, support vector machine, XGBoost, random forest, and neural network. In addition, five regression models were trained to predict ICU LOS and hospital LOS: elastic net, support vector machine, XGBoost, random forest, and neural network. Classification model performance was evaluated using sensitivity, specificity, accuracy, balanced accuracy, F1 score, the area under the precision-recall curve (AUPRC), and the area under the receiver operating characteristic curve (AUROC). Regression performance was assessed using mean absolute error (MAE), root mean squared error (RMSE), and the coefficient of determination (R²). All model training followed a standardized pipeline shown in eFigure 2. Hyperparameter tuning, using the Bayesian optimization in Optuna, and best model selection optimized median AUROC and RMSE in 5-fold cross-validation for classification and regression, respectively. The hyperparameter space of each model is shown in eTables 1 and 2. Classification model output thresholds were optimized for AUPRC in 5-fold cross-validation. AI modeling utilized scikit-learn, XGBoost, and PyTorch in Python (version 3.9.19). Retrospective Clinician Predictions Three ICU physicians (two intensivists and one fellow) and four ICU nurses from the Foothills Medical Centre (FMC) in Calgary, Alberta, Canada were recruited to provide retrospective outcome predictions for the 990 ABeICU test set patients. Predictions were made based on de-identified patient data from the first 24 hours after ICU admission in secure REDCap web surveys. Patient data were presented in both tabular and visual formats, with selected variables (e.g., vital signs) displayed as time series plots to facilitate interpretation. An example REDCap survey is provided in Supplement 2. Fifty from the 990 patient cases were randomly selected to be predicted by all seven clinicians. The remaining patients were evenly divided into five prediction sets, and each clinician was assigned to one set, containing approximately 190 patients. Each patient received predictions from one, two, or all seven clinicians. For analysis, the 990 patients were categorized into three groups based on the number of clinician predictions received (eTable 3). Clinicians were given three weeks to predict their assigned cases. To minimize fatigue, clinicians were encouraged to distribute the workload across multiple sessions. Prospective Clinician Predictions In the prospective setting, we collected bedside predictions from clinicians at the FMC ICU, a 28-bed tertiary unit providing multidisciplinary care for medical, surgical, trauma, burn, vascular, and neuroscience patients. We included adult patients that had been in the ICU for 16–24 hours and asked clinicians to predict the primary and secondary outcomes for the patients they were caring for. Clinicians were encouraged to use all information available in routine clinical practice. Clinician predictions were collected with a paper-based survey similar to the one used in the retrospective setting (Supplement 3), capturing clinician type, bed number, admission timestamps, and patient identifier to enable data linkage later for AI prediction and outcome determination. In total, 465 predictions from 75 clinicians were collected for 287 patients admitted to the FMC ICU from September 1, 2020 to December 31, 2022, with each patient predicted by 1 to 4 clinicians. For analysis, clinicians were grouped based on their type: subspecialized physicians (attending physicians and fellows), physicians in training (residents and interns), and nurses (eTable 4). Retrospective and Prospective AI Predictions For both the retrospective and prospective settings, the same best AI models predicted all primary and secondary outcomes for the patients that clinicians predicted, based on patient data from the first 24 hours in the ICU. For the prospective setting, a data set comparable to ABeICU was extracted for the included patients to provide features (i.e., model input variables) for AI prediction and ground truth outcome labels for prediction performance evaluation. For each outcome, we excluded cases where data linkage was unsuccessful, the outcome could not be determined, or the outcome had already occurred during the first 24 hours in the ICU, leaving 238 ICU admissions from 215 patients for analysis (see eFigure 3 for a cohort diagram). Statistical Analyses In both the retrospective and prospective settings, two-sample Wilcoxon rank-sum tests were used to compare the prediction performance metrics between the clinicians and AI based on 1,000 bootstrapped samples of the test data. These comparisons were stratified by the number of clinicians providing predictions and clinician type in the retrospective and prospective comparisons, respectively. Since the clinician predictions for the classification outcomes were recorded using a 5-point Likert scale, the top two options (i.e., “most likely” and “likely”) were combined to represent a positive prediction. When there were two or more clinicians predicting the same patient, their Likert-scale responses were averaged first and then dichotomized. Furthermore, the prediction-level inter-rater agreement between clinicians and AI was assessed using Cohen’s κ and the intraclass correlation coefficient (ICC) for the classification and regression predictions, respectively. The inter-rater agreement between clinicians when multiple clinicians predicted, in both the retrospective and prospective settings, was also assessed using the same methods. All analyses were performed in Python (version 3.9.19). Sensitivity Analysis A sensitivity analysis on the Likert scale threshold was conducted by grouping “most likely”, “likely”, and “equally likely” to form a positive prediction. Prediction performance comparisons between clinicians and AI were repeated with this new threshold for the classification outcomes. Ethics Approval This study was approved by the University of Calgary Conjoint Health Research Ethics Board (REB19-2107) and conducted in accordance with the Declaration of Helsinki. All research participants provided informed consent prior to participating in the study. RESULTS Patient Cohorts eTable 5 summarizes the patient characteristics of the three cohorts used in this study for AI model training, retrospective comparisons, and prospective comparisons. While baseline characteristics were similar between the model training and retrospective cohorts, the prospective cohort had higher in-hospital mortality and delirium rates, as well as a longer mean hospital LOS. The severity of illness scores also indicate that the prospective cohort was more ill in general. Best AI Model Selection eTable 6 lists the features selected for at least one AI model across all outcomes. eTables 7 and 8 report the 5-fold cross-validation prediction performances of all AI models for the classification and regression outcomes, respectively. For all outcomes, XGBoost was the best model based on median AUROC and RMSE for classification and regression, respectively. The performances of the best XGBoost models are comparable to previously reported performances on ABeICU. 11 Retrospective Clinician vs. AI Comparisons Figure 2 and eTable 9 compare the retrospective prediction performances of clinicians and AI for the four classification outcomes. Figure 3 and eTable 10 show the corresponding results for the two regression outcomes. For the classification outcomes, AI generally outperformed clinicians in the one- and two-clinician groups in terms of most performance metrics, with some exceptions such as specificity in the two-clinician group for in-hospital mortality (median: 0.925, 95% confidence interval [CI]: [0.895–0.953] vs. median: 0.896, 95% CI: [0.861–0.926]; p < 0.001) or sensitivity in the one-clinician group for AKI (median: 0.652, 95% CI: [0.564–0.738] vs. median: 0.551, 95% CI: [0.464–0.650]; p < 0.001). However, in the seven-clinician group, clinicians tended to outperform AI for all outcomes except delirium; AI outperformed clinicians by the largest margin in delirium prediction. Both clinicians and AI were more specific than sensitive for the two mortality outcomes and AKI that exhibited low event rates. Specificity and sensitivity were more balanced for delirium for both clinicians and AI. For the LOS outcomes, AI outperformed clinicians in general except MAE in the two-clinician group for ICU LOS (median: 4.436, 95% CI: [4.109–4.786] vs. median: 4.534, 95% CI: [4.275–4.800]; p < 0.001) and MAE in the two-clinician group for hospital LOS (median: 16.73, 95% CI: [15.46–18.08] vs. median: 17.14, 95% CI: [16.18–18.08]; p < 0.001). Both clinicians and AI exhibited poor R 2 scores for both LOS outcomes. Unlike the classification outcomes, clinicians did not outperform AI in the seven-clinician group. Prospective Clinician vs. AI Comparisons Figure 4 and eTable 11 compare the prospective prediction performances of clinicians and AI for the classification outcomes. Figure 5 and eTable 12 show the corresponding results for the regression outcomes. For the classification outcomes, subspecialized physicians generally outperformed AI across all outcomes except delirium, with some exceptions such as sensitivity for in-hospital mortality (median: 0.625, 95% CI: [0.384–0.857] vs. median: 0.700, 95% CI: [0.444–0.923]; p < 0.001). For delirium, there was no significant difference between subspecialized physicians and AI in terms of all metrics. The prediction performances of physicians in training and nurses tended to be lower than those of subspecialized physicians but still outperformed AI for 30-day mortality with respect to most performance metrics. AI generally outperformed trainees and nurses in AKI and delirium predictions with some exceptions such as specificity in nurses for delirium (median: 0.681, 95% CI: [0.500–0.850] vs. median: 0.640, 95% CI: [0.444–0.826]; p < 0.001). For both LOS predictions, all clinician types outperformed AI in terms of MAE except for subspecialized clinicians for hospital LOS (median: 42.646, 95% CI: [28.483–61.359] vs. median: 37.798, 95% CI: [25.583–54.494]; p < 0.001). AI outperformed clinicians with respect to RMSE and R 2 for both LOS outcomes except for RMSE (median: 9.295, 95% CI: [5.087–13.839] vs. median: 9.641, 95% CI: [6.108–14.277]; p < 0.001) and R 2 (median: 0.022, 95% CI: [-0.520-0.317] vs. median: -0.056, 95% CI: [-0.551–0.061]; p < 0.001) in trainees for ICU LOS. Inter-rater Agreement Analyses eTables 13–14 and 15–16 present the inter-rater agreements between clinicians in the retrospective and prospective settings, respectively. In general, agreements were poor in the retrospective setting, with fair agreements observed in AKI prediction. Agreements were also poor in the prospective setting but there were moderate agreements in the 2-clinician group for in-hospital mortality and 30-day mortality, in the 2- and 4-clinician groups for AKI, and in the 2-clinician group for hospital LOS. eTables 17–18 and 19–20 show the inter-rater agreements between clinicians and AI in the retrospective and prospective settings, respectively. Overall, the retrospective predictions in the 7-clinician group, except AKI, led to moderate agreements but all other results showed poor to fair agreements. Sensitivity Analysis eTables 21–22 report the sensitivity analysis results from adjusting the Likert scale threshold in the retrospective and prospective settings, respectively. As expected from the inherent tradeoffs between performance metrics, grouping the middle Likert scale point into the positive class improved sensitivity and NPV, as well as F1 and balanced accuracy to a lesser extent, at the expense of specificity and PPV. Overall, the 7-clinician group and subspecialized physicians outperformed AI to a lesser extent. The Likert threshold used in the primary analysis seemed to be a better choice for clinician predictions. DISCUSSION Principal Findings The findings from this study showed that AI generally outperformed clinicians in retrospective patient outcome prediction, while the aggregation of predictions in the 7-clinician group enabled clinicians to outperform AI. In the prospective setting, subspecialized physicians outperformed AI except for delirium. Subspecialized physicians showed better prediction performance than physicians in training and nurses. In general, clinicians exhibited better performance in predicting mortality than AKI or delirium. Both clinicians and AI performed poorly in LOS prediction. The performance gap between clinicians and AI was the largest in retrospective delirium prediction. Limited inter-rater agreements within clinicians and between clinicians and AI showed large prediction variability and corroborated the performance gaps between clinicians and AI. Clinicians performed better against AI in the prospective than the retrospective setting. This may have stemmed from that AI is better than clinicians in analyzing complex, multi-dimensional data, whereas clinicians have access to tacit patient information unavailable in, or unable to be coded into, EMR which may contain important predictive information. Furthermore, the fact that clinicians knew the patients present in the ICU well in the prospective setting could have improved their predictions, in comparison with the retrospective setting where they knew the patients only through data. Comparison with Previous Studies In recent years, there has been an increasing number of studies that compared clinicians and AI in various clinical tasks, particularly in diagnosis. The systematic reviews conducted by Takita et al., 13 Salinas et al., 14 and Shen et al. 15 reviewed the studies that had compared clinicians and AI in medical diagnostics, skin cancer diagnosis, and disease diagnosis, respectively. In general, these reviews found that AI performed better or comparably to clinicians. They also noted that clinician performance increased with experience. Notably, Goh et al. 16 conducted a randomized clinical trial to investigate if the use of a large language model (LLM) improved diagnostic reasoning compared with conventional resources in a retrospective setting. They found that there was no difference between using the LLM and conventional resources. In fact, the best diagnostic performance resulted from the LLM alone. In addition to diagnosis, AI has been shown to yield superior or comparable performance to clinicians in patient question answering, in terms of both response quality and empathy, 17 symptom check, 18 and triage. 19 Furthermore, Huang et al. 20 prospectively compared physician and nurse predictions of 28-day mortality in ICU patients. They found that physicians outperformed nurses by a substantial margin. Nurses with less than 10 years of ICU experience could not make accurate predictions. Our findings are largely in line with the findings from these previous studies. Our novel contributions include: 1) comparison between retrospective and prospective settings; 2) focus on prognosis as opposed to diagnosis; and 3) investigation of a wide range of ICU patient outcomes. Clinical Implications Clinicians and AI demonstrated distinct but complementary strengths to patient outcome prediction in the ICU. Clinicians’ better performance in the prospective setting speaks to the value of contextual bedside awareness. Experienced intensivists and aggregation of predictions from 7 clinicians showed how clinician prediction can be improved through clinical experience and multiple clinicians complementing one another. In contrast, AI excelled when only complex, high-dimensional data were available in the retrospective setting. These findings indicate that senior intensivists would be in the best position to make patient outcome predictions in real-world clinical settings. If only junior clinicians and nurses are present, aggregation of their predictions would also be effective. While AI may not be as good as senior intensivists, it can still provide more accurate predictions than junior clinicians and nurses in an automated manner. Limitations Several limitations of this study should be considered. First, prospective data collection was hampered by the erroneous manual entry of patient identifiers resulting in approximately 35% of the clinician predictions failing data linkage. Second, prospective data collection occurred during the COVID pandemic, the unique situation of which may have affected clinical workload, decision-making, and patient characteristics. Third, the retrospective cohort included only seven clinicians, which may limit generalizability. Fourth, in the prospective cohort, only three attending physicians participated, and the loosely defined clinician type grouping may not have fully captured the differences in clinical experience, training, and role. Fifth, prospective predictions were made within 16–24 hours of ICU admission, which may have disadvantaged clinicians compared with the AI model, which had access to all available data up to 24 hours. Sixth, the AI models were trained on pre-COVID data and used to predict patients during the COVID pandemic. This may have led to decreased AI performance in the prospective setting. Lastly, the exclusion of the patients for AKI and delirium who did not have required measurements or already developed AKI or delirium during the first 24 hours in the ICU reduced the sample size by approximately 40% and may have introduced selection bias. CONCLUSIONS Our findings indicate that clinicians and AI exhibit complementary strengths in ICU patient outcome prediction. AI excels when only complex EMR data can be used to make predictions, whereas clinicians’ access to contextual information in real-world clinical environments not captured in EMR data led to improved performance. These differences imply that AI may augment, rather than replace, clinician decision-making in real-world ICUs. Declarations Data and Code Availability The ABeICU data cannot be shared without permission from the data custodians, Alberta Health Services and Alberta Health. The clinician predictions may be shared upon reasonable request without the corresponding patient data. The source code of all data pre-processing, model training, and analyses is available on GitHub: https://github.com/data-intelligence-for-health-lab/Human_vs_AI_ICU_Patient_Outcome_Prediction. Author Contributions CK performed all analyses, generated results, and wrote and revised the manuscript. CEV played a major role in the collection of both retrospective and prospective clinician predictions. HTS provided access to the ABeICU database, provided clinical oversight, critically analyzed the study design, and provided some clinician predictions. CBJ provided clinical and epidemiological/biostatistical oversight. JL conceived of and designed the study, provided resources and direct supervision to the research team, and wrote and revised the manuscript. All authors proofread and approved the final manuscript. Conflict of Interest Disclosures JL is a co-founder and major shareholder of Symbiotic AI, Inc. All other authors have no conflict of interest to declare. Acknowledgment This study was funded by the Cumming School of Medicine and Libin Cardiovascular Institute, University of Calgary, as well as a Discovery Grant from the Natural Sciences and Engineering Research Council of Canada (RGPIN-2021-02588). We would like to thank Steven Ng, Maliyat Noor, and Anika Achari for their contributions to data collection. References James, F. R., Power, N. & Laha, S. Decision-making in intensive care medicine – A review. J Intensive Care Soc 19 , 247–258 (2018). Zimmerman, J. E., Kramer, A. A., McNair, D. S. & Malila, F. M. Acute Physiology and Chronic Health Evaluation (APACHE) IV: Hospital mortality assessment for today’s critically ill patients*. Critical Care Medicine 34 , 1297–1310 (2006). Legall, J. R., Lemeshow, S. & Saulnier, F. A NEW SIMPLIFIED ACUTE PHYSIOLOGY SCORE (SAPS-II) BASED ON A EUROPEAN NORTH-AMERICAN MULTICENTER STUDY. Jama-Journal of the American Medical Association 270 , 2957–2963 (1993). Power, G. S. & Harrison, D. A. Why try to predict ICU outcomes? Curr Opin Crit Care 20 , 544–549 (2014). Ehrenfeld, J. & Cannesson, M. Monitoring Technologies in Acute Care Environments: A Comprehensive Guide to Patient Monitoring Technology . (2014). doi:10.1007/978-1-4614-8557-5. Mokart, D. et al. Delayed intensive care unit admission is associated with increased mortality in patients with cancer with acute respiratory failure. Leuk Lymphoma 54 , 1724–1729 (2013). Lee, J. Is Artificial Intelligence Better Than Human Clinicians in Predicting Patient Outcomes? Journal of Medical Internet Research 22 , e19918 (2020). Okusa, M. D. & Davenport, A. Reading between the (guide)lines—the KDIGO practice guideline on acute kidney injury in the individual patient. Kidney Int 85 , 10.1038/ki.2013.378 (2014). Bergeron, N., Dubois, M. J., Dumont, M., Dial, S. & Skrobik, Y. Intensive Care Delirium Screening Checklist: evaluation of a new screening tool. Intensive Care Med 27 , 859–864 (2001). Lucini, F. R., Stelfox, H. T. & Lee, J. Deep Learning-Based Recurrent Delirium Prediction in Critically Ill Patients. Crit Care Med 51 , 492–502 (2023). Mutnuri, M. K., Stelfox, H. T., Forkert, N. D. & Lee, J. Using Domain Adaptation and Inductive Transfer Learning to Improve Patient Outcome Prediction in the Intensive Care Unit: Retrospective Observational Study. J Med Internet Res 26 , e52730 (2024). Brundin-Mather, R. et al. Secondary EMR data for quality improvement and research: A comparison of manual and electronic data collection from an integrated critical care electronic medical record system. Journal of Critical Care 47 , 295–301 (2018). Takita, H. et al. A systematic review and meta-analysis of diagnostic performance comparison between generative AI and physicians. npj Digit. Med. 8 , 175 (2025). Salinas, M. P. et al. A systematic review and meta-analysis of artificial intelligence versus clinicians for skin cancer diagnosis. npj Digit. Med. 7 , 125 (2024). Shen, J. et al. Artificial Intelligence Versus Clinicians in Disease Diagnosis: Systematic Review. JMIR Medical Informatics 7 , e10010 (2019). Goh, E. et al. Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial. JAMA Netw Open 7 , e2440969 (2024). Ayers, J. W. et al. Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum. JAMA Intern Med 183 , 589–596 (2023). Gräf, M. et al. Comparison of physician and artificial intelligence-based symptom checker diagnostic accuracy. Rheumatol Int 42 , 2167–2176 (2022). Baker, A. et al. A Comparison of Artificial Intelligence and Human Doctors for the Purpose of Triage and Diagnosis. Front. Artif. Intell. 3 , (2020). Huang, Y., Zhang, R., Deng, Y. & Meng, M. Accuracy of physician and nurse predictions for 28-day prognosis in ICU: a single center prospective study. Sci Rep 13 , 22023 (2023). Additional Declarations Competing interest reported. JL is a co-founder and major shareholder of Symbiotic AI, Inc. All other authors have no conflict of interest to declare. Supplementary Files Supplement1.pdf Supplement2.pdf Supplement3.pdf Cite Share Download PDF Status: Under Revision Version 1 posted Editorial decision: Revision requested 06 Apr, 2026 Reviews received at journal 02 Apr, 2026 Reviews received at journal 23 Mar, 2026 Reviewers agreed at journal 11 Mar, 2026 Reviewers agreed at journal 09 Mar, 2026 Reviewers invited by journal 11 Feb, 2026 Editor assigned by journal 04 Feb, 2026 Submission checks completed at journal 03 Feb, 2026 First submitted to journal 30 Jan, 2026 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8745082","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":591647532,"identity":"f9f54fce-cb31-4e2f-a994-ce4a8572f129","order_by":0,"name":"Corin Kuang","email":"","orcid":"","institution":"University of Calgary","correspondingAuthor":false,"prefix":"","firstName":"Corin","middleName":"","lastName":"Kuang","suffix":""},{"id":591647534,"identity":"45974a54-bcbf-4ba6-91fb-3ed14dee78dd","order_by":1,"name":"Camilo E. Valderrama","email":"","orcid":"","institution":"University of Winnipeg","correspondingAuthor":false,"prefix":"","firstName":"Camilo","middleName":"E.","lastName":"Valderrama","suffix":""},{"id":591647536,"identity":"7a9e88d2-230c-46c7-a0b5-f9cd23e0c34c","order_by":2,"name":"Henry T. Stelfox","email":"","orcid":"","institution":"University of Alberta","correspondingAuthor":false,"prefix":"","firstName":"Henry","middleName":"T.","lastName":"Stelfox","suffix":""},{"id":591647539,"identity":"9145238f-49df-4617-8aef-601b187a6a90","order_by":3,"name":"Colin B. Josephson","email":"","orcid":"","institution":"University of Calgary","correspondingAuthor":false,"prefix":"","firstName":"Colin","middleName":"B.","lastName":"Josephson","suffix":""},{"id":591647543,"identity":"6f0e7720-1695-4407-97d8-5adf6c7cfe25","order_by":4,"name":"Joon Lee","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA6UlEQVRIie2RsQrCMBBATwKdRNd0UT/hpFAUpN+SUKhLQUfBwUJBl+LczxAE50ChLn5ARkVw7lhE0GbTJWYUzBtyl3CPC3cAFssP0krUOQfoqng2VxDAVQkz79UoKEwVsknLM+Ck58nwUrF7MENBLpX2Y1k5RcDI82XkUb4NxzvheFSr5LFPAQt+kDFQnhFE0QYT5bna5zGpebZSCqkNFMGQxg5ldaEUR98lKyPKMBzmp5s/4skR3cLxRzpl2EyMVoug392EV1k9ltg5plepVRJ4W0drDTAodPUN/c/ro3lJvigWi8Xyd7wAz8dIY6IiWZwAAAAASUVORK5CYII=","orcid":"","institution":"University of Calgary","correspondingAuthor":true,"prefix":"","firstName":"Joon","middleName":"","lastName":"Lee","suffix":""}],"badges":[],"createdAt":"2026-01-30 22:08:45","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8745082/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8745082/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":102852106,"identity":"b3d6911d-7d2b-4c3c-a515-2c97c5e43a81","added_by":"auto","created_at":"2026-02-17 14:35:45","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":159586,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eStudy design and data flow for the retrospective and prospective comparisons of clinician and AI predictions\u003c/strong\u003e. ABeICU was the ICU dataset used for AI training and retrospective comparisons. The best performing AI model for each outcome was selected and compared with clinician predictions in two settings: (1) a retrospective setting (left side) where clinicians reviewed de-identified electronic medical records from the first 24 hours of 990 admissions from 990 unique patients in ABeICU to predict their outcomes, and (2) a prospective setting (right side) where clinicians provided bedside predictions during actual patient care between 16 and 24 hours after ICU admission for 238 ICU admissions from 215 unique patients. The same AI models made predictions in both the retrospective and prospective settings based on patient data from the first 24 hours in the ICU.\u003c/p\u003e","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-8745082/v1/9ce4b25b4486c189efbe69ab.png"},{"id":102852112,"identity":"bad9632b-e009-4d87-b83a-c18fbe7cc947","added_by":"auto","created_at":"2026-02-17 14:35:46","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":601298,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eRetrospective comparisons between clinician and AI classification performances across different numbers of clinicians providing predictions\u003c/strong\u003e. Boxplots show the distributions of sensitivity (left) and specificity (right) for clinician and AI predictions across four patient outcomes: in-hospital mortality, 30-day post-discharge mortality, acute kidney injury (AKI), and delirium. Comparisons are stratified by the number of clinicians contributing predictions: one, two, and seven clinicians. Statistically significant differences between clinician and AI performances were evaluated using two-sample Wilcoxon rank-sum tests, with P \u0026lt; .05 (*), P \u0026lt; .01 (**), and P \u0026lt; .001 (***) indicating increasing levels of significance; “ns” indicates no statistically significant difference.\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-8745082/v1/9f55065e7c942e2b3b8b9ee7.png"},{"id":102852111,"identity":"2d249b9d-2c9e-4f41-9c5c-267401b5e5cd","added_by":"auto","created_at":"2026-02-17 14:35:46","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":530064,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eRetrospective comparisons between clinician and AI performances for length of stay predictions across different numbers of clinicians providing predictions\u003c/strong\u003e. Boxplots show the distributions of mean absolute error (MAE; top row), root mean squared error (RMSE; middle row), and coefficient of determination (R²; bottom row) for clinician and AI predictions for intensive care unit (ICU; left) and hospital lengths of stay (right). Comparisons are stratified by the number of clinicians contributing predictions: one, two, and seven clinicians. Statistically significant differences between clinician and AI performances were evaluated using two-sample Wilcoxon rank-sum tests, with P \u0026lt; .05 (*), P \u0026lt; .01 (**), and P \u0026lt; .001 (***) indicating increasing levels of significance; “ns” indicates no statistically significant difference.\u003c/p\u003e","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-8745082/v1/d9d4b804189376a1a221b477.png"},{"id":102852113,"identity":"cfc40949-4c8a-4e86-b5ad-d302598cf45b","added_by":"auto","created_at":"2026-02-17 14:35:47","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":474280,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eProspective comparisons between clinician and AI classification performances across different clinician types\u003c/strong\u003e. Boxplots show the distributions of sensitivity (left) and specificity (right) for clinician and AI predictions across four patient outcomes: in-hospital mortality, 30-day post-discharge mortality, acute kidney injury (AKI), and delirium. Comparisons are stratified by clinician type: subspecialized physicians (intensivists and fellows), physicians in training (residents and interns), and nurses. Statistically significant differences between clinician and AI performances were evaluated using two-sample Wilcoxon rank-sum tests, with P \u0026lt; .05 (*), P \u0026lt; .01 (**), and P \u0026lt; .001 (***) indicating increasing levels of significance; “ns” indicates no statistically significant difference.\u003c/p\u003e","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-8745082/v1/ab614f6b75911d78b7bcc682.png"},{"id":102852108,"identity":"bdf0f7e8-278e-44e9-b86a-4e0beed86fdd","added_by":"auto","created_at":"2026-02-17 14:35:45","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":534723,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eProspective comparisons between clinician and AI performances for length of stay predictions across different clinician types.\u003c/strong\u003e Boxplots show the distributions of mean absolute error (MAE; top row), root mean squared error (RMSE; middle row), and coefficient of determination (R²; bottom row) for clinician and AI predictions for intensive care unit (ICU) (left) and hospital lengths of stay (right). Comparisons are stratified by clinician type: subspecialized physicians (intensivists and fellows), physicians in training (residents and interns), and nurses. Statistically significant differences between clinician and AI performances were evaluated using two-sample Wilcoxon rank-sum tests, with P \u0026lt; .05 (*), P \u0026lt; .01 (**), and P \u0026lt; .001 (***) indicating increasing levels of significance.\u003c/p\u003e","description":"","filename":"floatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-8745082/v1/0c206511e1de5145195507fe.png"},{"id":102965320,"identity":"a3440965-c656-4445-9d75-5027f8a9ce4f","added_by":"auto","created_at":"2026-02-19 04:31:13","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":2993576,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8745082/v1/cc4641c7-1c0f-44ca-8f96-9311bb143fe9.pdf"},{"id":102963555,"identity":"bd27e937-0730-4160-b9c5-15bfbb8c15c4","added_by":"auto","created_at":"2026-02-19 04:19:00","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"supplement","size":966099,"visible":true,"origin":"","legend":"","description":"","filename":"Supplement1.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8745082/v1/55272569cd8c6009fc62a317.pdf"},{"id":102852107,"identity":"965f516d-ba6e-4b61-8706-6f5292c209e9","added_by":"auto","created_at":"2026-02-17 14:35:45","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":842531,"visible":true,"origin":"","legend":"","description":"","filename":"Supplement2.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8745082/v1/d5155bdc2bd5f70a0c89e8a0.pdf"},{"id":102852110,"identity":"763c2b9d-f58e-4692-9292-beadd371d978","added_by":"auto","created_at":"2026-02-17 14:35:46","extension":"pdf","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":138925,"visible":true,"origin":"","legend":"","description":"","filename":"Supplement3.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8745082/v1/e508c0ec308434c8f06e063e.pdf"}],"financialInterests":"Competing interest reported. JL is a co-founder and major shareholder of Symbiotic AI, Inc. All other authors have no conflict of interest to declare.","formattedTitle":"Clinicians vs. Artificial Intelligence in Patient Outcome Prediction in the Intensive Care Unit","fulltext":[{"header":"INTRODUCTION","content":"\u003cp\u003eThe intensive care unit (ICU) is a dynamic environment where clinicians make frequent, high-stakes decisions for critically ill patients.\u003csup\u003e1\u003c/sup\u003e Accurate prediction of patient outcomes such as mortality and length of stay (LOS) can support timely interventions, efficient resource allocation, and improved patient care. Although traditional scoring systems such as APACHE\u003csup\u003e2\u003c/sup\u003e and SAPS\u003csup\u003e3\u003c/sup\u003e are widely used, they offer limited predictive accuracy at the individual level.\u003csup\u003e4\u003c/sup\u003e\u003c/p\u003e \u003cp\u003eThe growing volume and complexity of ICU data exceed human cognitive limits, leading to information overload and potentially suboptimal decisions.\u003csup\u003e5\u003c/sup\u003e Machine learning-based artificial intelligence (AI) offers powerful tools to analyze these data, identify patterns, and generate accurate predictions. Numerous studies have demonstrated strong AI performance in predicting outcomes such as sepsis, delirium, and hospital LOS.\u003csup\u003e6\u003c/sup\u003e\u003c/p\u003e \u003cp\u003eHowever, few studies have directly compared AI and clinician performances in ICU patient outcome prediction, and most evaluations have been limited to retrospective settings. As a result, it remains unclear whether AI prediction performance holds up in prospective, real-world clinical settings or how it compares with clinician prediction performance.\u003csup\u003e7\u003c/sup\u003e\u003c/p\u003e \u003cp\u003eTo address this gap, we compared the predictive performances of clinicians and AI for several ICU patient outcomes using both retrospective and prospective designs.\u003c/p\u003e"},{"header":"METHODS","content":"\u003cp\u003eOverall Study Design and Setting\u003c/p\u003e \u003cp\u003eFigure \u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e shows the overall retrospective and prospective settings designed to compare the patient outcome prediction performances of ICU clinicians and AI. The retrospective and prospective comparisons involved 990 and 238 ICU admissions, respectively. AI models trained on 46,631 ICU admissions provided AI predictions in both settings.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003ePredicted Patient Outcomes\u003c/p\u003e \u003cp\u003eIn both retrospective and prospective settings, the primary patient outcomes were in-hospital mortality, 30-day post-discharge mortality, ICU LOS, and hospital LOS. Secondary outcomes were delirium and acute kidney injury (AKI) during the ICU stay. To determine AKI outcomes, patients were required to have at least two serum creatinine measurements to allow assessment according to the Kidney Disease: Improving Global Outcomes (KDIGO) criteria,\u003csup\u003e8\u003c/sup\u003e defined as an increase in serum creatinine of \u0026ge;\u0026thinsp;0.3 mg/dL (\u0026ge;\u0026thinsp;26.5 \u0026micro;mol/L) within 48 hours or an increase of \u0026ge;\u0026thinsp;1.5 times the baseline value over 7 previous days. Delirium outcomes were determined using the Intensive Care Delirium Screening Checklist (ICDSC),\u003csup\u003e9\u003c/sup\u003e with scores\u0026thinsp;\u0026ge;\u0026thinsp;4 indicating delirium.\u003c/p\u003e \u003cp\u003eAI Model Development\u003c/p\u003e \u003cp\u003eThe AI models used in both the retrospective and prospective settings were trained by applying machine learning to the ABeICU database.\u003csup\u003e10,11\u003c/sup\u003e ABeICU includes 55,689 ICU admissions from 48,672 de-identified adult patients across 15 ICUs in Alberta, Canada, including tertiary-care and community ICUs caring for medical, surgical, trauma, burn, neuroscience, and vascular surgery patients, between February 2012 and December 2019. It integrates eCritical TRACER,\u003csup\u003e12\u003c/sup\u003e an electronic medical record (EMR)-based ICU repository containing demographics, vital signs, laboratory results, acuity scores, clinical assessments, and intervention data, linked with the Vital Statistics database from Alberta Health for out-of-hospital mortality data.\u003c/p\u003e \u003cp\u003eThe inclusion criteria applied to ABeICU were adult patients (18 years or older) with an ICU LOS longer than 24 hours. Also, 1,000 patients from December 2018 to June 2019 were reserved for retrospective comparisons, although data linkage issues led to 990 patients being used. For each outcome, patients with a missing target label were excluded. For AKI and delirium predictions, patients who had already developed AKI and delirium within the first 24 hours of ICU admission were excluded. The final model development cohort diagram for each predicted outcome is shown in eFigure 1.\u003c/p\u003e \u003cp\u003eEach model was developed to use patient data from the first 24 hours in the ICU to predict a given outcome at 24 hours after ICU admission. Five models were trained for each classification outcome (in-hospital mortality, 30-day post-discharge mortality, AKI, and delirium): logistic regression, support vector machine, XGBoost, random forest, and neural network. In addition, five regression models were trained to predict ICU LOS and hospital LOS: elastic net, support vector machine, XGBoost, random forest, and neural network.\u003c/p\u003e \u003cp\u003eClassification model performance was evaluated using sensitivity, specificity, accuracy, balanced accuracy, F1 score, the area under the precision-recall curve (AUPRC), and the area under the receiver operating characteristic curve (AUROC). Regression performance was assessed using mean absolute error (MAE), root mean squared error (RMSE), and the coefficient of determination (R\u0026sup2;).\u003c/p\u003e \u003cp\u003eAll model training followed a standardized pipeline shown in eFigure 2. Hyperparameter tuning, using the Bayesian optimization in Optuna, and best model selection optimized median AUROC and RMSE in 5-fold cross-validation for classification and regression, respectively. The hyperparameter space of each model is shown in eTables 1 and 2. Classification model output thresholds were optimized for AUPRC in 5-fold cross-validation.\u003c/p\u003e \u003cp\u003eAI modeling utilized scikit-learn, XGBoost, and PyTorch in Python (version 3.9.19).\u003c/p\u003e \u003cp\u003eRetrospective Clinician Predictions\u003c/p\u003e \u003cp\u003eThree ICU physicians (two intensivists and one fellow) and four ICU nurses from the Foothills Medical Centre (FMC) in Calgary, Alberta, Canada were recruited to provide retrospective outcome predictions for the 990 ABeICU test set patients. Predictions were made based on de-identified patient data from the first 24 hours after ICU admission in secure REDCap web surveys. Patient data were presented in both tabular and visual formats, with selected variables (e.g., vital signs) displayed as time series plots to facilitate interpretation. An example REDCap survey is provided in Supplement 2.\u003c/p\u003e \u003cp\u003eFifty from the 990 patient cases were randomly selected to be predicted by all seven clinicians. The remaining patients were evenly divided into five prediction sets, and each clinician was assigned to one set, containing approximately 190 patients. Each patient received predictions from one, two, or all seven clinicians. For analysis, the 990 patients were categorized into three groups based on the number of clinician predictions received (eTable 3). Clinicians were given three weeks to predict their assigned cases. To minimize fatigue, clinicians were encouraged to distribute the workload across multiple sessions.\u003c/p\u003e \u003cp\u003eProspective Clinician Predictions\u003c/p\u003e \u003cp\u003eIn the prospective setting, we collected bedside predictions from clinicians at the FMC ICU, a 28-bed tertiary unit providing multidisciplinary care for medical, surgical, trauma, burn, vascular, and neuroscience patients. We included adult patients that had been in the ICU for 16\u0026ndash;24 hours and asked clinicians to predict the primary and secondary outcomes for the patients they were caring for. Clinicians were encouraged to use all information available in routine clinical practice. Clinician predictions were collected with a paper-based survey similar to the one used in the retrospective setting (Supplement 3), capturing clinician type, bed number, admission timestamps, and patient identifier to enable data linkage later for AI prediction and outcome determination.\u003c/p\u003e \u003cp\u003eIn total, 465 predictions from 75 clinicians were collected for 287 patients admitted to the FMC ICU from September 1, 2020 to December 31, 2022, with each patient predicted by 1 to 4 clinicians. For analysis, clinicians were grouped based on their type: subspecialized physicians (attending physicians and fellows), physicians in training (residents and interns), and nurses (eTable 4).\u003c/p\u003e \u003cp\u003eRetrospective and Prospective AI Predictions\u003c/p\u003e \u003cp\u003eFor both the retrospective and prospective settings, the same best AI models predicted all primary and secondary outcomes for the patients that clinicians predicted, based on patient data from the first 24 hours in the ICU.\u003c/p\u003e \u003cp\u003eFor the prospective setting, a data set comparable to ABeICU was extracted for the included patients to provide features (i.e., model input variables) for AI prediction and ground truth outcome labels for prediction performance evaluation. For each outcome, we excluded cases where data linkage was unsuccessful, the outcome could not be determined, or the outcome had already occurred during the first 24 hours in the ICU, leaving 238 ICU admissions from 215 patients for analysis (see eFigure 3 for a cohort diagram).\u003c/p\u003e \u003cp\u003eStatistical Analyses\u003c/p\u003e \u003cp\u003eIn both the retrospective and prospective settings, two-sample Wilcoxon rank-sum tests were used to compare the prediction performance metrics between the clinicians and AI based on 1,000 bootstrapped samples of the test data. These comparisons were stratified by the number of clinicians providing predictions and clinician type in the retrospective and prospective comparisons, respectively.\u003c/p\u003e \u003cp\u003eSince the clinician predictions for the classification outcomes were recorded using a 5-point Likert scale, the top two options (i.e., \u0026ldquo;most likely\u0026rdquo; and \u0026ldquo;likely\u0026rdquo;) were combined to represent a positive prediction. When there were two or more clinicians predicting the same patient, their Likert-scale responses were averaged first and then dichotomized.\u003c/p\u003e \u003cp\u003eFurthermore, the prediction-level inter-rater agreement between clinicians and AI was assessed using Cohen\u0026rsquo;s κ and the intraclass correlation coefficient (ICC) for the classification and regression predictions, respectively. The inter-rater agreement between clinicians when multiple clinicians predicted, in both the retrospective and prospective settings, was also assessed using the same methods.\u003c/p\u003e \u003cp\u003eAll analyses were performed in Python (version 3.9.19).\u003c/p\u003e \u003cp\u003eSensitivity Analysis\u003c/p\u003e \u003cp\u003eA sensitivity analysis on the Likert scale threshold was conducted by grouping \u0026ldquo;most likely\u0026rdquo;, \u0026ldquo;likely\u0026rdquo;, and \u0026ldquo;equally likely\u0026rdquo; to form a positive prediction. Prediction performance comparisons between clinicians and AI were repeated with this new threshold for the classification outcomes.\u003c/p\u003e \u003cp\u003e \u003cstrong\u003eEthics Approval\u003c/strong\u003e \u003cp\u003e This study was approved by the University of Calgary Conjoint Health Research Ethics Board (REB19-2107) and conducted in accordance with the Declaration of Helsinki. All research participants provided informed consent prior to participating in the study.\u003c/p\u003e \u003c/p\u003e"},{"header":"RESULTS","content":"\u003cp\u003ePatient Cohorts\u003c/p\u003e \u003cp\u003eeTable 5 summarizes the patient characteristics of the three cohorts used in this study for AI model training, retrospective comparisons, and prospective comparisons. While baseline characteristics were similar between the model training and retrospective cohorts, the prospective cohort had higher in-hospital mortality and delirium rates, as well as a longer mean hospital LOS. The severity of illness scores also indicate that the prospective cohort was more ill in general.\u003c/p\u003e \u003cp\u003eBest AI Model Selection\u003c/p\u003e \u003cp\u003eeTable 6 lists the features selected for at least one AI model across all outcomes. eTables 7 and 8 report the 5-fold cross-validation prediction performances of all AI models for the classification and regression outcomes, respectively. For all outcomes, XGBoost was the best model based on median AUROC and RMSE for classification and regression, respectively. The performances of the best XGBoost models are comparable to previously reported performances on ABeICU.\u003csup\u003e11\u003c/sup\u003e\u003c/p\u003e \u003cp\u003eRetrospective Clinician vs. AI Comparisons\u003c/p\u003e \u003cp\u003eFigure \u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e and eTable 9 compare the retrospective prediction performances of clinicians and AI for the four classification outcomes. Figure\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e and eTable 10 show the corresponding results for the two regression outcomes.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eFor the classification outcomes, AI generally outperformed clinicians in the one- and two-clinician groups in terms of most performance metrics, with some exceptions such as specificity in the two-clinician group for in-hospital mortality (median: 0.925, 95% confidence interval [CI]: [0.895\u0026ndash;0.953] vs. median: 0.896, 95% CI: [0.861\u0026ndash;0.926]; p\u0026thinsp;\u0026lt;\u0026thinsp;0.001) or sensitivity in the one-clinician group for AKI (median: 0.652, 95% CI: [0.564\u0026ndash;0.738] vs. median: 0.551, 95% CI: [0.464\u0026ndash;0.650]; p\u0026thinsp;\u0026lt;\u0026thinsp;0.001). However, in the seven-clinician group, clinicians tended to outperform AI for all outcomes except delirium; AI outperformed clinicians by the largest margin in delirium prediction. Both clinicians and AI were more specific than sensitive for the two mortality outcomes and AKI that exhibited low event rates. Specificity and sensitivity were more balanced for delirium for both clinicians and AI.\u003c/p\u003e \u003cp\u003eFor the LOS outcomes, AI outperformed clinicians in general except MAE in the two-clinician group for ICU LOS (median: 4.436, 95% CI: [4.109\u0026ndash;4.786] vs. median: 4.534, 95% CI: [4.275\u0026ndash;4.800]; p\u0026thinsp;\u0026lt;\u0026thinsp;0.001) and MAE in the two-clinician group for hospital LOS (median: 16.73, 95% CI: [15.46\u0026ndash;18.08] vs. median: 17.14, 95% CI: [16.18\u0026ndash;18.08]; p\u0026thinsp;\u0026lt;\u0026thinsp;0.001). Both clinicians and AI exhibited poor R\u003csup\u003e2\u003c/sup\u003e scores for both LOS outcomes. Unlike the classification outcomes, clinicians did not outperform AI in the seven-clinician group.\u003c/p\u003e \u003cp\u003eProspective Clinician vs. AI Comparisons\u003c/p\u003e \u003cp\u003eFigure \u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e and eTable 11 compare the prospective prediction performances of clinicians and AI for the classification outcomes. Figure\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e and eTable 12 show the corresponding results for the regression outcomes.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eFor the classification outcomes, subspecialized physicians generally outperformed AI across all outcomes except delirium, with some exceptions such as sensitivity for in-hospital mortality (median: 0.625, 95% CI: [0.384\u0026ndash;0.857] vs. median: 0.700, 95% CI: [0.444\u0026ndash;0.923]; p\u0026thinsp;\u0026lt;\u0026thinsp;0.001). For delirium, there was no significant difference between subspecialized physicians and AI in terms of all metrics. The prediction performances of physicians in training and nurses tended to be lower than those of subspecialized physicians but still outperformed AI for 30-day mortality with respect to most performance metrics. AI generally outperformed trainees and nurses in AKI and delirium predictions with some exceptions such as specificity in nurses for delirium (median: 0.681, 95% CI: [0.500\u0026ndash;0.850] vs. median: 0.640, 95% CI: [0.444\u0026ndash;0.826]; p\u0026thinsp;\u0026lt;\u0026thinsp;0.001).\u003c/p\u003e \u003cp\u003eFor both LOS predictions, all clinician types outperformed AI in terms of MAE except for subspecialized clinicians for hospital LOS (median: 42.646, 95% CI: [28.483\u0026ndash;61.359] vs. median: 37.798, 95% CI: [25.583\u0026ndash;54.494]; p\u0026thinsp;\u0026lt;\u0026thinsp;0.001). AI outperformed clinicians with respect to RMSE and R\u003csup\u003e2\u003c/sup\u003e for both LOS outcomes except for RMSE (median: 9.295, 95% CI: [5.087\u0026ndash;13.839] vs. median: 9.641, 95% CI: [6.108\u0026ndash;14.277]; p\u0026thinsp;\u0026lt;\u0026thinsp;0.001) and R\u003csup\u003e2\u003c/sup\u003e (median: 0.022, 95% CI: [-0.520-0.317] vs. median: -0.056, 95% CI: [-0.551\u0026ndash;0.061]; p\u0026thinsp;\u0026lt;\u0026thinsp;0.001) in trainees for ICU LOS.\u003c/p\u003e \u003cp\u003eInter-rater Agreement Analyses\u003c/p\u003e \u003cp\u003eeTables 13\u0026ndash;14 and 15\u0026ndash;16 present the inter-rater agreements between clinicians in the retrospective and prospective settings, respectively. In general, agreements were poor in the retrospective setting, with fair agreements observed in AKI prediction. Agreements were also poor in the prospective setting but there were moderate agreements in the 2-clinician group for in-hospital mortality and 30-day mortality, in the 2- and 4-clinician groups for AKI, and in the 2-clinician group for hospital LOS.\u003c/p\u003e \u003cp\u003eeTables 17\u0026ndash;18 and 19\u0026ndash;20 show the inter-rater agreements between clinicians and AI in the retrospective and prospective settings, respectively. Overall, the retrospective predictions in the 7-clinician group, except AKI, led to moderate agreements but all other results showed poor to fair agreements.\u003c/p\u003e \u003cp\u003eSensitivity Analysis\u003c/p\u003e \u003cp\u003eeTables 21\u0026ndash;22 report the sensitivity analysis results from adjusting the Likert scale threshold in the retrospective and prospective settings, respectively. As expected from the inherent tradeoffs between performance metrics, grouping the middle Likert scale point into the positive class improved sensitivity and NPV, as well as F1 and balanced accuracy to a lesser extent, at the expense of specificity and PPV. Overall, the 7-clinician group and subspecialized physicians outperformed AI to a lesser extent. The Likert threshold used in the primary analysis seemed to be a better choice for clinician predictions.\u003c/p\u003e"},{"header":"DISCUSSION","content":"\u003cp\u003ePrincipal Findings\u003c/p\u003e \u003cp\u003eThe findings from this study showed that AI generally outperformed clinicians in retrospective patient outcome prediction, while the aggregation of predictions in the 7-clinician group enabled clinicians to outperform AI. In the prospective setting, subspecialized physicians outperformed AI except for delirium. Subspecialized physicians showed better prediction performance than physicians in training and nurses. In general, clinicians exhibited better performance in predicting mortality than AKI or delirium. Both clinicians and AI performed poorly in LOS prediction. The performance gap between clinicians and AI was the largest in retrospective delirium prediction. Limited inter-rater agreements within clinicians and between clinicians and AI showed large prediction variability and corroborated the performance gaps between clinicians and AI.\u003c/p\u003e \u003cp\u003eClinicians performed better against AI in the prospective than the retrospective setting. This may have stemmed from that AI is better than clinicians in analyzing complex, multi-dimensional data, whereas clinicians have access to tacit patient information unavailable in, or unable to be coded into, EMR which may contain important predictive information. Furthermore, the fact that clinicians knew the patients present in the ICU well in the prospective setting could have improved their predictions, in comparison with the retrospective setting where they knew the patients only through data.\u003c/p\u003e \u003cp\u003eComparison with Previous Studies\u003c/p\u003e \u003cp\u003eIn recent years, there has been an increasing number of studies that compared clinicians and AI in various clinical tasks, particularly in diagnosis. The systematic reviews conducted by Takita et al., \u003csup\u003e13\u003c/sup\u003e Salinas et al.,\u003csup\u003e14\u003c/sup\u003e and Shen et al.\u003csup\u003e15\u003c/sup\u003e reviewed the studies that had compared clinicians and AI in medical diagnostics, skin cancer diagnosis, and disease diagnosis, respectively. In general, these reviews found that AI performed better or comparably to clinicians. They also noted that clinician performance increased with experience.\u003c/p\u003e \u003cp\u003eNotably, Goh et al.\u003csup\u003e16\u003c/sup\u003e conducted a randomized clinical trial to investigate if the use of a large language model (LLM) improved diagnostic reasoning compared with conventional resources in a retrospective setting. They found that there was no difference between using the LLM and conventional resources. In fact, the best diagnostic performance resulted from the LLM alone.\u003c/p\u003e \u003cp\u003eIn addition to diagnosis, AI has been shown to yield superior or comparable performance to clinicians in patient question answering, in terms of both response quality and empathy,\u003csup\u003e17\u003c/sup\u003e symptom check, \u003csup\u003e18\u003c/sup\u003e and triage.\u003csup\u003e19\u003c/sup\u003e Furthermore, Huang et al.\u003csup\u003e20\u003c/sup\u003e prospectively compared physician and nurse predictions of 28-day mortality in ICU patients. They found that physicians outperformed nurses by a substantial margin. Nurses with less than 10 years of ICU experience could not make accurate predictions.\u003c/p\u003e \u003cp\u003eOur findings are largely in line with the findings from these previous studies. Our novel contributions include: 1) comparison between retrospective and prospective settings; 2) focus on prognosis as opposed to diagnosis; and 3) investigation of a wide range of ICU patient outcomes.\u003c/p\u003e \u003cp\u003eClinical Implications\u003c/p\u003e \u003cp\u003eClinicians and AI demonstrated distinct but complementary strengths to patient outcome prediction in the ICU. Clinicians\u0026rsquo; better performance in the prospective setting speaks to the value of contextual bedside awareness. Experienced intensivists and aggregation of predictions from 7 clinicians showed how clinician prediction can be improved through clinical experience and multiple clinicians complementing one another. In contrast, AI excelled when only complex, high-dimensional data were available in the retrospective setting.\u003c/p\u003e \u003cp\u003eThese findings indicate that senior intensivists would be in the best position to make patient outcome predictions in real-world clinical settings. If only junior clinicians and nurses are present, aggregation of their predictions would also be effective. While AI may not be as good as senior intensivists, it can still provide more accurate predictions than junior clinicians and nurses in an automated manner.\u003c/p\u003e \u003cp\u003eLimitations\u003c/p\u003e \u003cp\u003eSeveral limitations of this study should be considered. First, prospective data collection was hampered by the erroneous manual entry of patient identifiers resulting in approximately 35% of the clinician predictions failing data linkage. Second, prospective data collection occurred during the COVID pandemic, the unique situation of which may have affected clinical workload, decision-making, and patient characteristics. Third, the retrospective cohort included only seven clinicians, which may limit generalizability. Fourth, in the prospective cohort, only three attending physicians participated, and the loosely defined clinician type grouping may not have fully captured the differences in clinical experience, training, and role. Fifth, prospective predictions were made within 16\u0026ndash;24 hours of ICU admission, which may have disadvantaged clinicians compared with the AI model, which had access to all available data up to 24 hours. Sixth, the AI models were trained on pre-COVID data and used to predict patients during the COVID pandemic. This may have led to decreased AI performance in the prospective setting. Lastly, the exclusion of the patients for AKI and delirium who did not have required measurements or already developed AKI or delirium during the first 24 hours in the ICU reduced the sample size by approximately 40% and may have introduced selection bias.\u003c/p\u003e"},{"header":"CONCLUSIONS","content":"\u003cp\u003eOur findings indicate that clinicians and AI exhibit complementary strengths in ICU patient outcome prediction. AI excels when only complex EMR data can be used to make predictions, whereas clinicians\u0026rsquo; access to contextual information in real-world clinical environments not captured in EMR data led to improved performance. These differences imply that AI may augment, rather than replace, clinician decision-making in real-world ICUs.\u003c/p\u003e"},{"header":"Declarations","content":"\u003ch2\u003eData and Code Availability\u003c/h2\u003e\n\u003cp\u003eThe ABeICU data cannot be shared without permission from the data custodians, Alberta Health Services and Alberta Health. The clinician predictions may be shared upon reasonable request without the corresponding patient data.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eThe source code of all data pre-processing, model training, and analyses is available on GitHub: \u0026nbsp;https://github.com/data-intelligence-for-health-lab/Human_vs_AI_ICU_Patient_Outcome_Prediction.\u0026nbsp;\u003c/p\u003e\n\u003ch2\u003eAuthor Contributions\u003c/h2\u003e\n\u003cp\u003eCK performed all analyses, generated results, and wrote and revised the manuscript. CEV played a major role in the collection of both retrospective and prospective clinician predictions. HTS provided access to the ABeICU database, provided clinical oversight, critically analyzed the study design, and provided some clinician predictions. CBJ provided clinical and epidemiological/biostatistical oversight. JL conceived of and designed the study, provided resources and direct supervision to the research team, and wrote and revised the manuscript. All authors proofread and approved the final manuscript.\u003c/p\u003e\n\u003ch2\u003eConflict of Interest Disclosures\u003c/h2\u003e\n\u003cp\u003eJL is a co-founder and major shareholder of Symbiotic AI, Inc. All other authors have no conflict of interest to declare.\u003c/p\u003e\n\u003ch2\u003eAcknowledgment\u003c/h2\u003e\n\u003cp\u003eThis study was funded by the Cumming School of Medicine and Libin Cardiovascular Institute, University of Calgary, as well as a Discovery Grant from the Natural Sciences and Engineering Research Council of Canada (RGPIN-2021-02588).\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eWe would like to thank Steven Ng, Maliyat Noor, and Anika Achari for their contributions to data collection.\u003cstrong\u003e\u003cbr\u003e\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n \u003cli\u003eJames, F. R., Power, N. \u0026amp; Laha, S. Decision-making in intensive care medicine \u0026ndash; A review. \u003cem\u003eJ Intensive Care Soc\u003c/em\u003e \u003cstrong\u003e19\u003c/strong\u003e, 247\u0026ndash;258 (2018).\u003c/li\u003e\n \u003cli\u003eZimmerman, J. E., Kramer, A. A., McNair, D. S. \u0026amp; Malila, F. M. Acute Physiology and Chronic Health Evaluation (APACHE) IV: Hospital mortality assessment for today\u0026rsquo;s critically ill patients*. \u003cem\u003eCritical Care Medicine\u003c/em\u003e \u003cstrong\u003e34\u003c/strong\u003e, 1297\u0026ndash;1310 (2006).\u003c/li\u003e\n \u003cli\u003eLegall, J. R., Lemeshow, S. \u0026amp; Saulnier, F. A NEW SIMPLIFIED ACUTE PHYSIOLOGY SCORE (SAPS-II) BASED ON A EUROPEAN NORTH-AMERICAN MULTICENTER STUDY. \u003cem\u003eJama-Journal of the American Medical Association\u003c/em\u003e \u003cstrong\u003e270\u003c/strong\u003e, 2957\u0026ndash;2963 (1993).\u003c/li\u003e\n \u003cli\u003ePower, G. S. \u0026amp; Harrison, D. A. Why try to predict ICU outcomes? \u003cem\u003eCurr Opin Crit Care\u003c/em\u003e \u003cstrong\u003e20\u003c/strong\u003e, 544\u0026ndash;549 (2014).\u003c/li\u003e\n \u003cli\u003eEhrenfeld, J. \u0026amp; Cannesson, M. \u003cem\u003eMonitoring Technologies in Acute Care Environments: A Comprehensive Guide to Patient Monitoring Technology\u003c/em\u003e. (2014). doi:10.1007/978-1-4614-8557-5.\u003c/li\u003e\n \u003cli\u003eMokart, D. \u003cem\u003eet al.\u003c/em\u003e Delayed intensive care unit admission is associated with increased mortality in patients with cancer with acute respiratory failure. \u003cem\u003eLeuk Lymphoma\u003c/em\u003e \u003cstrong\u003e54\u003c/strong\u003e, 1724\u0026ndash;1729 (2013).\u003c/li\u003e\n \u003cli\u003eLee, J. Is Artificial Intelligence Better Than Human Clinicians in Predicting Patient Outcomes? \u003cem\u003eJournal of Medical Internet Research\u003c/em\u003e \u003cstrong\u003e22\u003c/strong\u003e, e19918 (2020).\u003c/li\u003e\n \u003cli\u003eOkusa, M. D. \u0026amp; Davenport, A. Reading between the (guide)lines\u0026mdash;the KDIGO practice guideline on acute kidney injury in the individual patient. \u003cem\u003eKidney Int\u003c/em\u003e \u003cstrong\u003e85\u003c/strong\u003e, 10.1038/ki.2013.378 (2014).\u003c/li\u003e\n \u003cli\u003eBergeron, N., Dubois, M. J., Dumont, M., Dial, S. \u0026amp; Skrobik, Y. Intensive Care Delirium Screening Checklist: evaluation of a new screening tool. \u003cem\u003eIntensive Care Med\u003c/em\u003e \u003cstrong\u003e27\u003c/strong\u003e, 859\u0026ndash;864 (2001).\u003c/li\u003e\n \u003cli\u003eLucini, F. R., Stelfox, H. T. \u0026amp; Lee, J. Deep Learning-Based Recurrent Delirium Prediction in Critically Ill Patients. \u003cem\u003eCrit Care Med\u003c/em\u003e \u003cstrong\u003e51\u003c/strong\u003e, 492\u0026ndash;502 (2023).\u003c/li\u003e\n \u003cli\u003eMutnuri, M. K., Stelfox, H. T., Forkert, N. D. \u0026amp; Lee, J. Using Domain Adaptation and Inductive Transfer Learning to Improve Patient Outcome Prediction in the Intensive Care Unit: Retrospective Observational Study. \u003cem\u003eJ Med Internet Res\u003c/em\u003e \u003cstrong\u003e26\u003c/strong\u003e, e52730 (2024).\u003c/li\u003e\n \u003cli\u003eBrundin-Mather, R. \u003cem\u003eet al.\u003c/em\u003e Secondary EMR data for quality improvement and research: A comparison of manual and electronic data collection from an integrated critical care electronic medical record system. \u003cem\u003eJournal of Critical Care\u003c/em\u003e \u003cstrong\u003e47\u003c/strong\u003e, 295\u0026ndash;301 (2018).\u003c/li\u003e\n \u003cli\u003eTakita, H. \u003cem\u003eet al.\u003c/em\u003e A systematic review and meta-analysis of diagnostic performance comparison between generative AI and physicians. \u003cem\u003enpj Digit. Med.\u003c/em\u003e \u003cstrong\u003e8\u003c/strong\u003e, 175 (2025).\u003c/li\u003e\n \u003cli\u003eSalinas, M. P. \u003cem\u003eet al.\u003c/em\u003e A systematic review and meta-analysis of artificial intelligence versus clinicians for skin cancer diagnosis. \u003cem\u003enpj Digit. Med.\u003c/em\u003e \u003cstrong\u003e7\u003c/strong\u003e, 125 (2024).\u003c/li\u003e\n \u003cli\u003eShen, J. \u003cem\u003eet al.\u003c/em\u003e Artificial Intelligence Versus Clinicians in Disease Diagnosis: Systematic Review. \u003cem\u003eJMIR Medical Informatics\u003c/em\u003e \u003cstrong\u003e7\u003c/strong\u003e, e10010 (2019).\u003c/li\u003e\n \u003cli\u003eGoh, E. \u003cem\u003eet al.\u003c/em\u003e Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial. \u003cem\u003eJAMA Netw Open\u003c/em\u003e \u003cstrong\u003e7\u003c/strong\u003e, e2440969 (2024).\u003c/li\u003e\n \u003cli\u003eAyers, J. W. \u003cem\u003eet al.\u003c/em\u003e Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum. \u003cem\u003eJAMA Intern Med\u003c/em\u003e \u003cstrong\u003e183\u003c/strong\u003e, 589\u0026ndash;596 (2023).\u003c/li\u003e\n \u003cli\u003eGr\u0026auml;f, M. \u003cem\u003eet al.\u003c/em\u003e Comparison of physician and artificial intelligence-based symptom checker diagnostic accuracy. \u003cem\u003eRheumatol Int\u003c/em\u003e \u003cstrong\u003e42\u003c/strong\u003e, 2167\u0026ndash;2176 (2022).\u003c/li\u003e\n \u003cli\u003eBaker, A. \u003cem\u003eet al.\u003c/em\u003e A Comparison of Artificial Intelligence and Human Doctors for the Purpose of Triage and Diagnosis. \u003cem\u003eFront. Artif. Intell.\u003c/em\u003e \u003cstrong\u003e3\u003c/strong\u003e, (2020).\u003c/li\u003e\n \u003cli\u003eHuang, Y., Zhang, R., Deng, Y. \u0026amp; Meng, M. Accuracy of physician and nurse predictions for 28-day prognosis in ICU: a single center prospective study. \u003cem\u003eSci Rep\u003c/em\u003e \u003cstrong\u003e13\u003c/strong\u003e, 22023 (2023).\u003cstrong\u003e\u003cbr\u003e\u003c/strong\u003e\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"npj-digital-medicine","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"npjdigitalmed","sideBox":"Learn more about [npj Digital Medicine](http://www.nature.com/npjdigitalmed/)","snPcode":"41746","submissionUrl":"https://submission.springernature.com/new-submission/41746/3","title":"npj Digital Medicine","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"NPJ","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-8745082/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8745082/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eIMPORTANCE: Accurate prediction of patient outcomes in the intensive care unit (ICU) is critical for clinical decision-making. While artificial intelligence (AI) has shown potential in retrospective prediction, direct comparisons with human clinicians, particularly in prospective real-world settings, remain unclear.\u003c/p\u003e \u003cp\u003eOBJECTIVE: To compare the predictive performances of human clinicians and AI in predicting ICU patient outcomes both retrospectively and prospectively.\u003c/p\u003e \u003cp\u003eDESIGN: A mixed retrospective and prospective study conducted to compare clinician and AI performances of ICU patient outcome prediction.\u003c/p\u003e \u003cp\u003eSETTING: Fifteen adult ICUs in Alberta, Canada.\u003c/p\u003e \u003cp\u003ePARTICIPANTS: Retrospective analysis included 990 ICU admissions randomly selected between February 2012 and December 2019, the patient outcomes of which were collectively predicted by 7 clinicians. Prospective analysis involved 238 ICU admissions from 215 adult patients between September 2020 and December 2022, with a total of 75 clinicians making at least one prediction each.\u003c/p\u003e \u003cp\u003eEXPOSURES: Retrospective clinician predictions were made based on patient data. Prospective clinician predictions were collected during active patient care. AI models were trained on retrospective data from 46,631 ICU admissions of 41,096 unique patients to predict the outcomes of the same patients the clinicians predicted in the retrospective and prospective settings.\u003c/p\u003e \u003cp\u003eMAIN OUTCOMES AND MEASURES: Primary patient outcomes were in-hospital mortality, 30-day post-discharge mortality, ICU length of stay (LOS), and hospital LOS. Secondary outcomes included the occurrences of delirium and acute kidney injury during the ICU stay. Classification and regression performance metrics of AI and clinicians were compared using Wilcoxon rank-sum tests. Inter-rater agreements amongst clinicians and between clinicians and AI were analyzed with Cohen\u0026rsquo;s kappa and the intraclass correlation coefficient.\u003c/p\u003e \u003cp\u003eRESULTS: In the retrospective setting, AI generally outperformed clinicians but aggregated predictions from seven clinicians outperformed AI. In the prospective setting, subspecialized physicians generally outperformed AI, whereas physicians in training and nurses generally underperformed AI. Both clinicians and AI performed poorly in LOS prediction. Inter-rater agreement was poor or fair for both amongst clinicians and between clinicians and AI.\u003c/p\u003e \u003cp\u003eCONCLUSIONS AND RELEVANCE: This study provides a comprehensive evaluation of clinician and AI performances in ICU outcome prediction under both retrospective and real-world prospective conditions, setting important prediction performance benchmarks.\u003c/p\u003e","manuscriptTitle":"Clinicians vs. Artificial Intelligence in Patient Outcome Prediction in the Intensive Care Unit","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-02-17 14:35:38","doi":"10.21203/rs.3.rs-8745082/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2026-04-06T23:42:41+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-04-02T10:01:27+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-03-23T05:06:32+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"306593991964804716579798957930117890597","date":"2026-03-12T03:08:09+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"204014272093530182096640188095507656067","date":"2026-03-09T22:46:07+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2026-02-12T04:04:17+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2026-02-05T01:24:29+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2026-02-04T04:14:31+00:00","index":"","fulltext":""},{"type":"submitted","content":"npj Digital Medicine","date":"2026-01-30T21:59:10+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"npj-digital-medicine","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"npjdigitalmed","sideBox":"Learn more about [npj Digital Medicine](http://www.nature.com/npjdigitalmed/)","snPcode":"41746","submissionUrl":"https://submission.springernature.com/new-submission/41746/3","title":"npj Digital Medicine","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"NPJ","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"fef654aa-1a46-47ed-9d0b-005c0c15a2fd","owner":[],"postedDate":"February 17th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"in-revision","subjectAreas":[{"id":62957421,"name":"Health sciences/Diseases"},{"id":62957422,"name":"Health sciences/Health care"},{"id":62957423,"name":"Health sciences/Medical research"},{"id":62957424,"name":"Health sciences/Risk factors"}],"tags":[],"updatedAt":"2026-04-06T23:53:44+00:00","versionOfRecord":[],"versionCreatedAt":"2026-02-17 14:35:38","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8745082","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8745082","identity":"rs-8745082","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.