Generative Artificial Intelligence versus Anesthesia-Intensive Care Resident Physicians: Performance in Clinical Case Analysis Using the R-IDEA Score

preprint OA: closed CC-BY-4.0
📄 Open PDF Full text JSON View at publisher
Full text 60,070 characters · extracted from preprint-html · click to expand
Generative Artificial Intelligence versus Anesthesia-Intensive Care Resident Physicians: Performance in Clinical Case Analysis Using the R-IDEA Score | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Generative Artificial Intelligence versus Anesthesia-Intensive Care Resident Physicians: Performance in Clinical Case Analysis Using the R-IDEA Score ELMAHDI EZZIKOURI, SABAH BENHAMZA, MOHAMMED BENNANI OTHMANI, MOHAMED LAZRAQ This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-9540519/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Background Large language models (LLMs) have shown encouraging results in the medical field, with recent work demonstrating their capacity to encode substantial clinical knowledge [ 12 ] and pass medical licensing examinations [ 5 , 6 ]. However, most existing studies evaluate diagnostic accuracy without assessing the quality of the underlying reasoning process, and none has used a validated instrument in a French-speaking African academic context. This study aimed to compare the quality of clinical reasoning between a generative AI and anesthesia-intensive care resident physicians using the R-IDEA score. Methods Prospective cross-sectional comparative study conducted at CHU Ibn Rochd, Casablanca, Morocco (September–October 2025). Twenty anonymized clinical cases were submitted to all fourth-year anesthesia-intensive care residents and to ChatGPT-5. Two faculty members developed a standardized R-IDEA rubric by consensus. All responses were scored by a single blinded evaluator. Diagnostic accuracy was recorded as a binary variable. Results Fourteen residents were included, generating 56 responses (4 cases each); ChatGPT-5 produced 20 responses. The AI achieved a mean R-IDEA score of 9.5 ± 0.9 versus 7.0 ± 2.4 for residents (difference: +2.5 points; 95% CI: 1.4–3.6; p < 0.001; Cohen's d = 1.25). The AI outperformed residents in all four R-IDEA domains, particularly Interpretive Summary (3.8 vs. 2.8; p < 0.001) and Alternatives (1.9 vs. 1.3; p = 0.002), with substantially greater homogeneity (CV: 9% vs. 34%). Diagnostic accuracy did not differ significantly (75% vs. 68%; p = 0.47). Conclusions In this single-centre study, AI produced more structured and consistent clinical reasoning than end-of-training residents, with comparable diagnostic accuracy. These findings highlight the potential complementary role of AI in medical education and clinical practice, particularly in resource-limited settings. artificial intelligence clinical reasoning R-IDEA ChatGPT anesthesiology medical diagnosis medical education Figures Figure 1 Figure 2 INTRODUCTION Biomedical knowledge is growing at an unprecedented rate: PubMed now indexes more than 35 million publications, and the doubling time of medical knowledge has been estimated at 73 days [ 1 , 2 ]. Integrating this information while maintaining rigorous diagnostic reasoning is a major challenge for clinicians. In the United States alone, diagnostic errors affect an estimated 12 million adults annually, and the majority arise not from knowledge deficits but from failures in the reasoning process itself [ 3 , 4 ]. Against this backdrop, artificial intelligence (AI) has emerged as a potentially transformative tool across multiple domains of healthcare [ 25 ]. In particular, large language models (LLMs) — built on transformer architectures [ 19 ] — have attracted growing interest. Beyond image-based applications such as skin cancer classification [ 17 ] and diabetic retinopathy screening [ 18 ], LLMs have demonstrated the capacity to encode substantial clinical knowledge [ 12 ] and to generate medically relevant reasoning [ 13 ]. Several models now pass United States medical licensing examinations [ 5 , 6 ], and their ability to analyse clinical cases has drawn considerable attention. The question is no longer whether AI can pass an examination, but whether it can produce genuinely useful clinical reasoning when confronted with complex problems. Multiple studies have compared AI and physician performance [ 7 – 10 ]; however, most use diagnostic accuracy as their primary outcome. This approach overlooks a critical dimension: the quality of the reasoning process that leads to the diagnosis. The R-IDEA score, validated by Schaye et al. [ 11 ], was specifically designed to assess this process across four domains — Interpretive Summary, Differential Diagnosis, Illness Explanation, and consideration of Alternatives — yielding a total score of 0 to 10. To our knowledge, no study has applied this instrument in a French-speaking African academic context. The objective of this study was to compare the quality of clinical reasoning between ChatGPT-5 and fourth-year anesthesia-intensive care residents at CHU Ibn Rochd, Casablanca, using the R-IDEA score as the primary outcome measure. METHODS Study Design and Setting We conducted a prospective cross-sectional comparative study in the Department of Anesthesia and Intensive Care, CHU Ibn Rochd, Casablanca, Morocco. Data collection took place between September 17 and October 8, 2025. Participants Fourteen fourth-year (R4) anesthesia-intensive care residents who provided written informed consent were included. The AI system evaluated was ChatGPT-5 (OpenAI, October 2025 version), queried using standardized prompts written in French. Clinical Cases Twenty real, fully anonymized clinical cases were selected from departmental records by M. Lazraq (attending anesthesiologist), covering neurology (n = 7), pulmonology (n = 4), cardiology (n = 3), endocrinology (n = 3), and other specialties (n = 3). Cases were randomly allocated to residents, with each resident receiving 4 cases, generating a total of 56 responses. ChatGPT-5 analysed all 20 cases, producing 20 responses. Evaluation Protocol Residents analysed their assigned cases individually, with no time restriction and access to reference materials. ChatGPT-5 received identical clinical presentations via standardized prompts. Two faculty members (S.B. and M.B.O.) developed a standardized R-IDEA rubric by consensus prior to data collection. All responses — residents and AI — were then scored by M. Lazraq, blinded to the source of each response. Diagnostic accuracy was recorded as a binary outcome (correct/incorrect). Statistical Analysis Normality of distributions was verified using the Shapiro-Wilk test (residents: W = 0.94, p = 0.12; AI: W = 0.91, p = 0.08). R-IDEA scores were compared using a paired Student's t-test, with pairing by clinical case. Diagnostic accuracy was compared using Fisher's exact test. Effect size was estimated using Cohen's d. All tests were two-tailed, with a significance threshold of α = 0.05. Power Calculation A priori power analysis estimated that 64 paired observations were required to detect a medium effect (d = 0.5) at α = 0.05 with 80% power. The final sample of 56 resident responses was slightly below this target; however, the observed effect size greatly exceeded the anticipated value (d = 1.25), and post-hoc analysis confirmed statistical power exceeding 80%. Ethical Considerations The study was conducted in accordance with the principles of the Declaration of Helsinki. All residents provided written informed consent. Clinical cases were fully anonymized and individual performance had no academic consequences. The study received approval from the local ethics committee of CHU Ibn Rochd. RESULTS Fourteen residents participated in the study. A total of 76 responses were evaluated: 56 from residents (4 cases each) and 20 from ChatGPT-5 (one per case). Overall R-IDEA Scores The AI achieved a mean R-IDEA score of 9.5 ± 0.9 versus 7.0 ± 2.4 for residents, a statistically significant difference of 2.5 points (95% CI: 1.4–3.6; p < 0.001; paired t-test; Cohen's d = 1.25). AI responses also showed substantially greater homogeneity, with a coefficient of variation of 9% compared to 34% among residents (Table 1). The distribution of scores is illustrated in Figure 1. Table 1. Overall Comparison of R-IDEA Scores Parameter Residents AI (ChatGPT-5) p-value Mean R-IDEA score 7.0 ± 2.4 9.5 ± 0.9 < 0.001 Median 7.5 10.0 — Range 2.0–10.0 7.0–10.0 — Coefficient of variation (%) 34 9 — Cohen's d — 1.25 — Domain-Level Analysis The AI advantage was consistent across all four R-IDEA domains (Table 2). The largest difference was observed for Interpretive Summary (+1.0 point; p < 0.001), followed by Alternatives (+0.6 point; p = 0.002). AI scores approached ceiling in most domains, while resident scores showed substantial inter-individual variability. Domain-level performance by diagnostic category is presented in Figure 2. Table 2. R-IDEA Scores by Domain Domain Residents AI Difference p-value I – Interpretive Summary (0–4) 2.8 ± 1.0 3.8 ± 0.4 +1.0 < 0.001 D – Differential Diagnosis (0–2) 1.5 ± 0.6 1.9 ± 0.2 +0.4 0.008 E – Illness Explanation (0–2) 1.4 ± 0.7 1.9 ± 0.2 +0.5 0.004 A – Alternatives (0–2) 1.3 ± 0.8 1.9 ± 0.3 +0.6 0.002 Diagnostic Accuracy Despite its superiority in reasoning quality, the AI did not demonstrate a significant advantage in diagnostic accuracy: 75% correct diagnoses versus 68% for residents (p = 0.47, Fisher's exact test). DISCUSSION Our findings demonstrate that ChatGPT-5 produces more structured and consistent clinical reasoning than fourth-year anesthesia-intensive care residents, as measured by the R-IDEA score (9.5 vs. 7.0; p < 0.001; d = 1.25). This advantage in reasoning quality did not translate into superior diagnostic accuracy (75% vs. 68%; p = 0.47). These findings are broadly consistent with prior work by Katz et al. [7] and Cabral et al. [8]; to our knowledge, this is the first study to apply a validated reasoning instrument in a French-speaking African academic context. Domain-level analysis reveals that the AI performed particularly well on Interpretive Summary and Alternatives. These tasks — synthesizing large volumes of clinical data and systematically enumerating competing hypotheses — are well-suited to the structural characteristics of LLMs, which process information in parallel without the working memory constraints that burden human reasoning [14,15]. The pronounced inter-individual variability among residents (CV: 34%) likely reflects differences in experience, susceptibility to cognitive biases, and fatigue effects, consistent with established frameworks of clinical problem-solving [23] and the role of experiential knowledge in expert reasoning [24]. Notably, AI scores approached ceiling in several domains, raising a methodological concern: the R-IDEA rubric was designed for clinicians in training and may lack discriminative power when applied to advanced AI systems optimized for structured text generation. The dissociation between reasoning quality and diagnostic accuracy warrants attention. Residents sometimes reach the correct diagnosis through pattern recognition and clinical intuition — Kahneman's System 1 [14] — even when their written reasoning remains incomplete. Conversely, the AI can produce formally complete reasoning yet fail to discriminate between competing diagnoses in atypical presentations. This phenomenon highlights a specific risk of LLMs: a seemingly rigorous reasoning structure may mask genuine diagnostic uncertainty, with a potential for under-signalling in complex cases [13] — a concern that warrants evaluation in real clinical settings. Error-profile analysis reveals distinct vulnerabilities. Residents committed errors consistent with classic cognitive biases — anchoring and premature closure [4,15,16]. AI errors were less frequent but qualitatively different: failure to account for local epidemiological context (Moroccan disease prevalences), limited integration of culturally relevant factors, and, in several complex cases, an exhaustive enumeration of differential diagnoses that impaired the clarity of the reasoning. These observations are consistent with documented risks of algorithmic systems in healthcare, including susceptibility to bias when local context deviates from training data distributions [22]. This last point aligns with Rudin's critique that formal transparency does not guarantee meaningful interpretability in machine reasoning systems [26]. From a pedagogical standpoint, these findings open concrete possibilities. AI-generated responses could serve as structured references for teaching clinical reasoning, providing residents with an explicit standard against which to benchmark their own approach. When supervised by faculty, such comparison-based feedback could help identify and remediate specific gaps — particularly in Interpretive Summary and systematic consideration of alternatives — without replacing direct clinical supervision [20,21]. Limitations Several limitations must be acknowledged. The study is single-centre with a small sample (14 residents, 20 cases), limiting generalizability. Scoring was performed by a single blinded evaluator; while this ensures internal consistency, the absence of an independent second rater precludes inter-rater reliability calculation, and individual bias cannot be excluded. Conditions were not fully equivalent: although neither group faced time restrictions or resource limitations, residents — unlike the AI — are subject to fatigue and the psychological effects of academic evaluation. Case order was not randomized for residents. Finally, the R-IDEA score, though validated, does not capture all dimensions of clinical competence — notably therapeutic planning, patient communication, and uncertainty management in real-world clinical encounters. Multi-centre studies with larger samples, multiple blinded evaluators, and head-to-head comparisons of several AI models are needed to confirm and generalize these findings. CONCLUSION In this single-centre study conducted in a French-speaking African academic context, ChatGPT-5 demonstrated more structured and homogeneous clinical reasoning than end-of-training anesthesia-intensive care residents, with comparable diagnostic accuracy. These findings do not suggest that AI should replace the physician. Rather, they underscore the complementarity of their respective strengths: AI excels in systematicity, exhaustiveness, and consistency; the clinician brings contextual understanding, diagnostic intuition, and ethical judgment. How to best integrate these complementary strengths — in medical education and in clinical practice — represents a promising and practically relevant research agenda, particularly in resource-limited settings. Declarations Ethics approval and consent to participate The study was conducted in accordance with the principles of the Declaration of Helsinki. Ethical approval was obtained from the local ethics committee of CHU Ibn Rochd, Casablanca, Morocco. All participants provided written informed consent prior to enrolment. Individual performance data had no academic consequences for any participant. Consent for publication Not applicable. This study does not contain data from any individual person. All clinical cases used were fully anonymized prior to analysis. Availability of data and material The anonymized datasets generated and analysed during the current study are available from the corresponding author on reasonable request. Competing interests The authors declare no competing interests. Funding This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors. Authors' contributions EE: conceptualization, methodology, data collection, statistical analysis, manuscript writing, and project administration. SB: supervision, validation, R-IDEA rubric development, and critical revision of the manuscript. MBO: supervision, methodology, R-IDEA rubric development, and manuscript revision. ML: clinical case selection and blinded scoring of all responses. All authors read and approved the final manuscript. Acknowledgements The authors thank all fourth-year anesthesia-intensive care residents at CHU Ibn Rochd who participated in this study. The authors also acknowledge the support of the Faculty of Medicine and Pharmacy, Hassan II University, Casablanca. References Densen P. Challenges and opportunities facing medical education. Trans Am Clin Climatol Assoc. 2011;122:48–58. Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. 2019;25(1):44–56. doi:10.1038/s41591-018-0300-7. Institute of Medicine. Improving Diagnosis in Health Care. Washington, DC: National Academies Press; 2015. Croskerry P. A universal model of diagnostic reasoning. Acad Med. 2009;84(8):1022–1028. doi:10.1097/ACM.0b013e3181ace703. Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE. PLOS Digit Health. 2023;2(2):e0000198. doi:10.1371/journal.pdig.0000198. Nori H, King N, McKinney SM, et al. Capabilities of GPT-4 on Medical Challenge Problems. arXiv:2303.13375. 2023. Katz S, Ngo B, Basu A, et al. GPT Versus Resident Physicians. NEJM AI. 2024;1(4):AIoa2300138. doi:10.1056/AIoa2300138. Cabral S, Restrepo D, Kanjee Z, et al. Clinical Reasoning of a Generative AI Model Compared With Physicians. JAMA Intern Med. 2024;184(5):581–583. doi:10.1001/jamainternmed.2024.0295. Strong E, DiGiammarino A, Weng Y, et al. Chatbot vs Medical Student Performance. JAMA Intern Med. 2023;183(9):1028–1030. doi:10.1001/jamainternmed.2023.2909. Eriksen AV, Möller S, Ryg J. Use of GPT-4 to diagnose complex clinical cases. NEJM AI. 2023;1(1):AIp2300031. Schaye VE, Guzman BJ, Engel H, et al. Validation of the R-IDEA Rubric. J Gen Intern Med. 2022;37(3):507–514. doi:10.1007/s11606-021-07038-z. Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. Nature. 2023;620(7972):172–180. doi:10.1038/s41586-023-06291-2. Lee P, Bubeck S, Petro J. Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. N Engl J Med. 2023;388(13):1233–1239. doi:10.1056/NEJMsr2214184. Kahneman D. Thinking, Fast and Slow. New York: Farrar, Straus and Giroux; 2011. Croskerry P. Achieving quality in clinical decision making. Acad Emerg Med. 2002;9(11):1184–1204. doi:10.1197/aemj.9.11.1184. Norman GR, Eva KW. Diagnostic error and clinical reasoning. Med Educ. 2010;44(1):94–100. doi:10.1111/j.1365-2923.2009.03507.x. Esteva A, Kuprel B, Novoa RA, et al. Dermatologist-level classification of skin cancer. Nature. 2017;542(7639):115–118. doi:10.1038/nature21056. Gulshan V, Peng L, Coram M, et al. Deep Learning Algorithm for Diabetic Retinopathy. JAMA. 2016;316(22):2402–2410. doi:10.1001/jama.2016.17218. Vaswani A, Shazeer N, Parmar N, et al. Attention Is All You Need. Adv Neural Inf Process Syst. 2017;30:5998–6008. Durning SJ, Artino AR, Boulet JR, et al. Contextual factors and clinical reasoning. Adv Health Sci Educ. 2012;17(1):65–79. doi:10.1007/s10459-011-9294-3. Eva KW. What every teacher needs to know about clinical reasoning. Med Educ. 2005;39(1):98–106. doi:10.1111/j.1365-2929.2004.01972.x. Obermeyer Z, et al. Dissecting racial bias in an algorithm. Science. 2019;366(6464):447–453. doi:10.1126/science.aax2342. Elstein AS, Schwarz A. Clinical problem solving. BMJ. 2002;324(7339):729–732. doi:10.1136/bmj.324.7339.729. Patel VL, Groen GJ. The general and specific nature of medical expertise. In: Toward a General Theory of Expertise. Cambridge University Press; 1991:93–125. Davenport T, Kalakota R. Artificial intelligence in healthcare. Future Healthc J. 2019;6(2):94–98. doi:10.7861/futurehosp.6-2-94. Rudin C. Stop explaining black box machine learning models. Nat Mach Intell. 2019;1(5):206–215. doi:10.1038/s42256-019-0048-x. Additional Declarations No competing interests reported. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-9540519","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":630736743,"identity":"fd3ea282-b0e2-4f3c-9634-8cc406f7a237","order_by":0,"name":"ELMAHDI EZZIKOURI","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABCElEQVRIie3QMUvDQBTA8QeCU0o2OVG4TyC8IoQKh5/lHYVMpxZcugvt0IqrxS8RCHS+cFCXWNeUDFZcO9jtBgcTQwchSXFzuP90B/fj7h2Ay/Uv84CRqJYaBoJ+VjTYR8IdwZDgsCS4h0C426CpCLSQs/F9kq1J3PDx3bu2+Hrt84n8XCNw/0jXkiBd9ntE4UWULjCZYn57POrErHhYd/ZE9SRTAZPWIDIF2sNcRotOVBLCvIG8bQJGZJA/Kki+cFmS2LaSzKsIZAqMh7ok8/ZbUnVezoJYzGJOsS9no6t5j5A1z/KcdleWBBY/9rHdDC/lw8FLvLJDwf2TetIY+9txl8vlcv3qG9FVYvCTUX8LAAAAAElFTkSuQmCC","orcid":"","institution":"Hassan II University","correspondingAuthor":true,"prefix":"","firstName":"ELMAHDI","middleName":"","lastName":"EZZIKOURI","suffix":""},{"id":630736747,"identity":"1d455489-08e3-49cc-abbf-fe5f9fd3836b","order_by":1,"name":"SABAH BENHAMZA","email":"","orcid":"","institution":"Hassan II University","correspondingAuthor":false,"prefix":"","firstName":"SABAH","middleName":"","lastName":"BENHAMZA","suffix":""},{"id":630736754,"identity":"ce7821b2-c3fe-49e0-90ab-63ed19c382f0","order_by":2,"name":"MOHAMMED BENNANI OTHMANI","email":"","orcid":"","institution":"Hassan II University","correspondingAuthor":false,"prefix":"","firstName":"MOHAMMED","middleName":"BENNANI","lastName":"OTHMANI","suffix":""},{"id":630736760,"identity":"e115c89a-0d28-4d56-ba9a-4b07feba13ec","order_by":3,"name":"MOHAMED LAZRAQ","email":"","orcid":"","institution":"Hassan II University","correspondingAuthor":false,"prefix":"","firstName":"MOHAMED","middleName":"","lastName":"LAZRAQ","suffix":""}],"badges":[],"createdAt":"2026-04-27 10:53:30","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-9540519/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-9540519/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":108839794,"identity":"7ee33b11-4a5f-447c-9fad-0b73dc74c95b","added_by":"auto","created_at":"2026-05-09 00:50:28","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":96153,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003eDistribution of Overall R-IDEA Scores — Resident Physicians vs. AI (ChatGPT-5).\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003e\u003cem\u003eViolin plots show score distribution; box plots show median and interquartile range; dots represent individual responses (jittered). CV = coefficient of variation.\u003c/em\u003e\u003c/p\u003e","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-9540519/v1/4ec8c8bab9bb397d3e7e0c77.png"},{"id":108839789,"identity":"b8b63935-19f0-4778-9397-596a619b062d","added_by":"auto","created_at":"2026-05-09 00:50:28","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":139454,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003eR-IDEA Scores by Diagnostic Category — Nosological Variability.\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003e\u003cem\u003eError bars = ± SD. Residents showed greater variability in neurology and endocrinology cases. Dashed lines indicate overall group means. AI performance remained stable across all diagnostic categories.\u003c/em\u003e\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-9540519/v1/c6c0d3c8c1d47484199e5dbd.png"},{"id":108977480,"identity":"c5e28662-2977-4a5a-b4d0-124abf59428b","added_by":"auto","created_at":"2026-05-11 11:31:52","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":421439,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-9540519/v1/6d10ea9f-e04c-422b-9dd7-39e9a4cb9af4.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"\u003cp\u003e\u003cstrong\u003eGenerative Artificial Intelligence versus Anesthesia-Intensive Care Resident Physicians: Performance in Clinical Case Analysis Using the R-IDEA Score\u003c/strong\u003e\u003c/p\u003e","fulltext":[{"header":"INTRODUCTION","content":"\u003cp\u003eBiomedical knowledge is growing at an unprecedented rate: PubMed now indexes more than 35\u0026nbsp;million publications, and the doubling time of medical knowledge has been estimated at 73 days [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e, \u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]. Integrating this information while maintaining rigorous diagnostic reasoning is a major challenge for clinicians. In the United States alone, diagnostic errors affect an estimated 12\u0026nbsp;million adults annually, and the majority arise not from knowledge deficits but from failures in the reasoning process itself [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e, \u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e]. Against this backdrop, artificial intelligence (AI) has emerged as a potentially transformative tool across multiple domains of healthcare [\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eIn particular, large language models (LLMs) \u0026mdash; built on transformer architectures [\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e] \u0026mdash; have attracted growing interest. Beyond image-based applications such as skin cancer classification [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e] and diabetic retinopathy screening [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e], LLMs have demonstrated the capacity to encode substantial clinical knowledge [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e] and to generate medically relevant reasoning [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e]. Several models now pass United States medical licensing examinations [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e, \u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e], and their ability to analyse clinical cases has drawn considerable attention. The question is no longer whether AI can pass an examination, but whether it can produce genuinely useful clinical reasoning when confronted with complex problems.\u003c/p\u003e \u003cp\u003eMultiple studies have compared AI and physician performance [\u003cspan additionalcitationids=\"CR8 CR9\" citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e]; however, most use diagnostic accuracy as their primary outcome. This approach overlooks a critical dimension: the quality of the reasoning process that leads to the diagnosis. The R-IDEA score, validated by Schaye et al. [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e], was specifically designed to assess this process across four domains \u0026mdash; Interpretive Summary, Differential Diagnosis, Illness Explanation, and consideration of Alternatives \u0026mdash; yielding a total score of 0 to 10. To our knowledge, no study has applied this instrument in a French-speaking African academic context.\u003c/p\u003e \u003cp\u003eThe objective of this study was to compare the quality of clinical reasoning between ChatGPT-5 and fourth-year anesthesia-intensive care residents at CHU Ibn Rochd, Casablanca, using the R-IDEA score as the primary outcome measure.\u003c/p\u003e"},{"header":"METHODS","content":"\u003cp\u003e\u003cstrong\u003e\u003cem\u003eStudy Design and Setting\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eWe conducted a prospective cross-sectional comparative study in the Department of Anesthesia and Intensive Care, CHU Ibn Rochd, Casablanca, Morocco. Data collection took place between September 17 and October 8, 2025.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cem\u003eParticipants\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eFourteen fourth-year (R4) anesthesia-intensive care residents who provided written informed consent were included. The AI system evaluated was ChatGPT-5 (OpenAI, October 2025 version), queried using standardized prompts written in French.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cem\u003eClinical Cases\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTwenty real, fully anonymized clinical cases were selected from departmental records by M. Lazraq (attending anesthesiologist), covering neurology (n = 7), pulmonology (n = 4), cardiology (n = 3), endocrinology (n = 3), and other specialties (n = 3). Cases were randomly allocated to residents, with each resident receiving 4 cases, generating a total of 56 responses. ChatGPT-5 analysed all 20 cases, producing 20 responses.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cem\u003eEvaluation Protocol\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eResidents analysed their assigned cases individually, with no time restriction and access to reference materials. ChatGPT-5 received identical clinical presentations via standardized prompts. Two faculty members (S.B. and M.B.O.) developed a standardized R-IDEA rubric by consensus prior to data collection. All responses \u0026mdash; residents and AI \u0026mdash; were then scored by M. Lazraq, blinded to the source of each response. Diagnostic accuracy was recorded as a binary outcome (correct/incorrect).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cem\u003eStatistical Analysis\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNormality of distributions was verified using the Shapiro-Wilk test (residents: W = 0.94, p = 0.12; AI: W = 0.91, p = 0.08). R-IDEA scores were compared using a paired Student\u0026apos;s t-test, with pairing by clinical case. Diagnostic accuracy was compared using Fisher\u0026apos;s exact test. Effect size was estimated using Cohen\u0026apos;s d. All tests were two-tailed, with a significance threshold of \u0026alpha; = 0.05.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cem\u003ePower Calculation\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eA priori power analysis estimated that 64 paired observations were required to detect a medium effect (d = 0.5) at \u0026alpha; = 0.05 with 80% power. The final sample of 56 resident responses was slightly below this target; however, the observed effect size greatly exceeded the anticipated value (d = 1.25), and post-hoc analysis confirmed statistical power exceeding 80%.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cem\u003eEthical Considerations\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe study was conducted in accordance with the principles of the Declaration of Helsinki. All residents provided written informed consent. Clinical cases were fully anonymized and individual performance had no academic consequences. The study received approval from the local ethics committee of CHU Ibn Rochd.\u003c/p\u003e"},{"header":"RESULTS","content":"\u003cp\u003eFourteen residents participated in the study. A total of 76 responses were evaluated: 56 from residents (4 cases each) and 20 from ChatGPT-5 (one per case).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cem\u003eOverall R-IDEA Scores\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe AI achieved a mean R-IDEA score of 9.5 \u0026plusmn; 0.9 versus 7.0 \u0026plusmn; 2.4 for residents, a statistically significant difference of 2.5 points (95% CI: 1.4\u0026ndash;3.6; p \u0026lt; 0.001; paired t-test; Cohen\u0026apos;s d = 1.25). AI responses also showed substantially greater homogeneity, with a coefficient of variation of 9% compared to 34% among residents (Table 1). The distribution of scores is illustrated in Figure 1.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTable 1. Overall Comparison of R-IDEA Scores\u003c/strong\u003e\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\" width=\"624\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 227px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eParameter\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 147px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eResidents\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 147px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eAI (ChatGPT-5)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 104px;\"\u003e\n \u003cp\u003e\u003cstrong\u003ep-value\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 227px;\"\u003e\n \u003cp\u003eMean R-IDEA score\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 147px;\"\u003e\n \u003cp\u003e7.0 \u0026plusmn; 2.4\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 147px;\"\u003e\n \u003cp\u003e9.5 \u0026plusmn; 0.9\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 104px;\"\u003e\n \u003cp\u003e\u0026lt; 0.001\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 227px;\"\u003e\n \u003cp\u003eMedian\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 147px;\"\u003e\n \u003cp\u003e7.5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 147px;\"\u003e\n \u003cp\u003e10.0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 104px;\"\u003e\n \u003cp\u003e\u0026mdash;\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 227px;\"\u003e\n \u003cp\u003eRange\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 147px;\"\u003e\n \u003cp\u003e2.0\u0026ndash;10.0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 147px;\"\u003e\n \u003cp\u003e7.0\u0026ndash;10.0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 104px;\"\u003e\n \u003cp\u003e\u0026mdash;\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 227px;\"\u003e\n \u003cp\u003eCoefficient of variation (%)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 147px;\"\u003e\n \u003cp\u003e34\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 147px;\"\u003e\n \u003cp\u003e9\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 104px;\"\u003e\n \u003cp\u003e\u0026mdash;\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 227px;\"\u003e\n \u003cp\u003eCohen\u0026apos;s d\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 147px;\"\u003e\n \u003cp\u003e\u0026mdash;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 147px;\"\u003e\n \u003cp\u003e1.25\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 104px;\"\u003e\n \u003cp\u003e\u0026mdash;\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cem\u003eDomain-Level Analysis\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe AI advantage was consistent across all four R-IDEA domains (Table 2). The largest difference was observed for Interpretive Summary (+1.0 point; p \u0026lt; 0.001), followed by Alternatives (+0.6 point; p = 0.002). AI scores approached ceiling in most domains, while resident scores showed substantial inter-individual variability. Domain-level performance by diagnostic category is presented in Figure 2.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTable 2. R-IDEA Scores by Domain\u003c/strong\u003e\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\" width=\"624\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 207px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eDomain\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 107px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eResidents\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 107px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eAI\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 100px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eDifference\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 104px;\"\u003e\n \u003cp\u003e\u003cstrong\u003ep-value\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 207px;\"\u003e\n \u003cp\u003eI \u0026ndash; Interpretive Summary (0\u0026ndash;4)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 107px;\"\u003e\n \u003cp\u003e2.8 \u0026plusmn; 1.0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 107px;\"\u003e\n \u003cp\u003e3.8 \u0026plusmn; 0.4\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 100px;\"\u003e\n \u003cp\u003e+1.0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 104px;\"\u003e\n \u003cp\u003e\u0026lt; 0.001\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 207px;\"\u003e\n \u003cp\u003eD \u0026ndash; Differential Diagnosis (0\u0026ndash;2)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 107px;\"\u003e\n \u003cp\u003e1.5 \u0026plusmn; 0.6\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 107px;\"\u003e\n \u003cp\u003e1.9 \u0026plusmn; 0.2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 100px;\"\u003e\n \u003cp\u003e+0.4\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 104px;\"\u003e\n \u003cp\u003e0.008\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 207px;\"\u003e\n \u003cp\u003eE \u0026ndash; Illness Explanation (0\u0026ndash;2)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 107px;\"\u003e\n \u003cp\u003e1.4 \u0026plusmn; 0.7\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 107px;\"\u003e\n \u003cp\u003e1.9 \u0026plusmn; 0.2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 100px;\"\u003e\n \u003cp\u003e+0.5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 104px;\"\u003e\n \u003cp\u003e0.004\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 207px;\"\u003e\n \u003cp\u003eA \u0026ndash; Alternatives (0\u0026ndash;2)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 107px;\"\u003e\n \u003cp\u003e1.3 \u0026plusmn; 0.8\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 107px;\"\u003e\n \u003cp\u003e1.9 \u0026plusmn; 0.3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 100px;\"\u003e\n \u003cp\u003e+0.6\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 104px;\"\u003e\n \u003cp\u003e0.002\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cem\u003eDiagnostic Accuracy\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eDespite its superiority in reasoning quality, the AI did not demonstrate a significant advantage in diagnostic accuracy: 75% correct diagnoses versus 68% for residents (p = 0.47, Fisher\u0026apos;s exact test).\u003c/p\u003e"},{"header":"DISCUSSION","content":"\u003cp\u003eOur findings demonstrate that ChatGPT-5 produces more structured and consistent clinical reasoning than fourth-year anesthesia-intensive care residents, as measured by the R-IDEA score (9.5 vs. 7.0; p \u0026lt; 0.001; d = 1.25). This advantage in reasoning quality did not translate into superior diagnostic accuracy (75% vs. 68%; p = 0.47). These findings are broadly consistent with prior work by Katz et al. [7] and Cabral et al. [8]; to our knowledge, this is the first study to apply a validated reasoning instrument in a French-speaking African academic context.\u003c/p\u003e\n\u003cp\u003eDomain-level analysis reveals that the AI performed particularly well on Interpretive Summary and Alternatives. These tasks \u0026mdash; synthesizing large volumes of clinical data and systematically enumerating competing hypotheses \u0026mdash; are well-suited to the structural characteristics of LLMs, which process information in parallel without the working memory constraints that burden human reasoning [14,15]. The pronounced inter-individual variability among residents (CV: 34%) likely reflects differences in experience, susceptibility to cognitive biases, and fatigue effects, consistent with established frameworks of clinical problem-solving [23] and the role of experiential knowledge in expert reasoning [24]. Notably, AI scores approached ceiling in several domains, raising a methodological concern: the R-IDEA rubric was designed for clinicians in training and may lack discriminative power when applied to advanced AI systems optimized for structured text generation.\u003c/p\u003e\n\u003cp\u003eThe dissociation between reasoning quality and diagnostic accuracy warrants attention. Residents sometimes reach the correct diagnosis through pattern recognition and clinical intuition \u0026mdash; Kahneman\u0026apos;s System 1 [14] \u0026mdash; even when their written reasoning remains incomplete. Conversely, the AI can produce formally complete reasoning yet fail to discriminate between competing diagnoses in atypical presentations. This phenomenon highlights a specific risk of LLMs: a seemingly rigorous reasoning structure may mask genuine diagnostic uncertainty, with a potential for under-signalling in complex cases [13] \u0026mdash; a concern that warrants evaluation in real clinical settings.\u003c/p\u003e\n\u003cp\u003eError-profile analysis reveals distinct vulnerabilities. Residents committed errors consistent with classic cognitive biases \u0026mdash; anchoring and premature closure [4,15,16]. AI errors were less frequent but qualitatively different: failure to account for local epidemiological context (Moroccan disease prevalences), limited integration of culturally relevant factors, and, in several complex cases, an exhaustive enumeration of differential diagnoses that impaired the clarity of the reasoning. These observations are consistent with documented risks of algorithmic systems in healthcare, including susceptibility to bias when local context deviates from training data distributions [22]. This last point aligns with Rudin\u0026apos;s critique that formal transparency does not guarantee meaningful interpretability in machine reasoning systems [26].\u003c/p\u003e\n\u003cp\u003eFrom a pedagogical standpoint, these findings open concrete possibilities. AI-generated responses could serve as structured references for teaching clinical reasoning, providing residents with an explicit standard against which to benchmark their own approach. When supervised by faculty, such comparison-based feedback could help identify and remediate specific gaps \u0026mdash; particularly in Interpretive Summary and systematic consideration of alternatives \u0026mdash; without replacing direct clinical supervision [20,21].\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cem\u003eLimitations\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eSeveral limitations must be acknowledged. The study is single-centre with a small sample (14 residents, 20 cases), limiting generalizability. Scoring was performed by a single blinded evaluator; while this ensures internal consistency, the absence of an independent second rater precludes inter-rater reliability calculation, and individual bias cannot be excluded. Conditions were not fully equivalent: although neither group faced time restrictions or resource limitations, residents \u0026mdash; unlike the AI \u0026mdash; are subject to fatigue and the psychological effects of academic evaluation. Case order was not randomized for residents. Finally, the R-IDEA score, though validated, does not capture all dimensions of clinical competence \u0026mdash; notably therapeutic planning, patient communication, and uncertainty management in real-world clinical encounters. Multi-centre studies with larger samples, multiple blinded evaluators, and head-to-head comparisons of several AI models are needed to confirm and generalize these findings.\u003c/p\u003e"},{"header":"CONCLUSION","content":"\u003cp\u003eIn this single-centre study conducted in a French-speaking African academic context, ChatGPT-5 demonstrated more structured and homogeneous clinical reasoning than end-of-training anesthesia-intensive care residents, with comparable diagnostic accuracy. These findings do not suggest that AI should replace the physician. Rather, they underscore the complementarity of their respective strengths: AI excels in systematicity, exhaustiveness, and consistency; the clinician brings contextual understanding, diagnostic intuition, and ethical judgment. How to best integrate these complementary strengths \u0026mdash; in medical education and in clinical practice \u0026mdash; represents a promising and practically relevant research agenda, particularly in resource-limited settings.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003e\u003cem\u003eEthics approval and consent to participate\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe study was conducted in accordance with the principles of the Declaration of Helsinki. Ethical approval was obtained from the local ethics committee of CHU Ibn Rochd, Casablanca, Morocco. All participants provided written informed consent prior to enrolment. Individual performance data had no academic consequences for any participant.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cem\u003eConsent for publication\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable. This study does not contain data from any individual person. All clinical cases used were fully anonymized prior to analysis.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cem\u003eAvailability of data and material\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe anonymized datasets generated and analysed during the current study are available from the corresponding author on reasonable request.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cem\u003eCompeting interests\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors declare no competing interests.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cem\u003eFunding\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cem\u003eAuthors\u0026apos; contributions\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eEE: conceptualization, methodology, data collection, statistical analysis, manuscript writing, and project administration. SB: supervision, validation, R-IDEA rubric development, and critical revision of the manuscript. MBO: supervision, methodology, R-IDEA rubric development, and manuscript revision. ML: clinical case selection and blinded scoring of all responses. All authors read and approved the final manuscript.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cem\u003eAcknowledgements\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors thank all fourth-year anesthesia-intensive care residents at CHU Ibn Rochd who participated in this study. The authors also acknowledge the support of the Faculty of Medicine and Pharmacy, Hassan II University, Casablanca.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eDensen P. Challenges and opportunities facing medical education. Trans Am Clin Climatol Assoc. 2011;122:48\u0026ndash;58.\u003c/li\u003e\n\u003cli\u003eTopol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. 2019;25(1):44\u0026ndash;56. doi:10.1038/s41591-018-0300-7.\u003c/li\u003e\n\u003cli\u003eInstitute of Medicine. Improving Diagnosis in Health Care. Washington, DC: National Academies Press; 2015.\u003c/li\u003e\n\u003cli\u003eCroskerry P. A universal model of diagnostic reasoning. Acad Med. 2009;84(8):1022\u0026ndash;1028. doi:10.1097/ACM.0b013e3181ace703.\u003c/li\u003e\n\u003cli\u003eKung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE. PLOS Digit Health. 2023;2(2):e0000198. doi:10.1371/journal.pdig.0000198.\u003c/li\u003e\n\u003cli\u003eNori H, King N, McKinney SM, et al. Capabilities of GPT-4 on Medical Challenge Problems. arXiv:2303.13375. 2023.\u003c/li\u003e\n\u003cli\u003eKatz S, Ngo B, Basu A, et al. GPT Versus Resident Physicians. NEJM AI. 2024;1(4):AIoa2300138. doi:10.1056/AIoa2300138.\u003c/li\u003e\n\u003cli\u003eCabral S, Restrepo D, Kanjee Z, et al. Clinical Reasoning of a Generative AI Model Compared With Physicians. JAMA Intern Med. 2024;184(5):581\u0026ndash;583. doi:10.1001/jamainternmed.2024.0295.\u003c/li\u003e\n\u003cli\u003eStrong E, DiGiammarino A, Weng Y, et al. Chatbot vs Medical Student Performance. JAMA Intern Med. 2023;183(9):1028\u0026ndash;1030. doi:10.1001/jamainternmed.2023.2909.\u003c/li\u003e\n\u003cli\u003eEriksen AV, M\u0026ouml;ller S, Ryg J. Use of GPT-4 to diagnose complex clinical cases. NEJM AI. 2023;1(1):AIp2300031.\u003c/li\u003e\n\u003cli\u003eSchaye VE, Guzman BJ, Engel H, et al. Validation of the R-IDEA Rubric. J Gen Intern Med. 2022;37(3):507\u0026ndash;514. doi:10.1007/s11606-021-07038-z.\u003c/li\u003e\n\u003cli\u003eSinghal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. Nature. 2023;620(7972):172\u0026ndash;180. doi:10.1038/s41586-023-06291-2.\u003c/li\u003e\n\u003cli\u003eLee P, Bubeck S, Petro J. Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. N Engl J Med. 2023;388(13):1233\u0026ndash;1239. doi:10.1056/NEJMsr2214184.\u003c/li\u003e\n\u003cli\u003eKahneman D. Thinking, Fast and Slow. New York: Farrar, Straus and Giroux; 2011.\u003c/li\u003e\n\u003cli\u003eCroskerry P. Achieving quality in clinical decision making. Acad Emerg Med. 2002;9(11):1184\u0026ndash;1204. doi:10.1197/aemj.9.11.1184.\u003c/li\u003e\n\u003cli\u003eNorman GR, Eva KW. Diagnostic error and clinical reasoning. Med Educ. 2010;44(1):94\u0026ndash;100. doi:10.1111/j.1365-2923.2009.03507.x.\u003c/li\u003e\n\u003cli\u003eEsteva A, Kuprel B, Novoa RA, et al. Dermatologist-level classification of skin cancer. Nature. 2017;542(7639):115\u0026ndash;118. doi:10.1038/nature21056.\u003c/li\u003e\n\u003cli\u003eGulshan V, Peng L, Coram M, et al. Deep Learning Algorithm for Diabetic Retinopathy. JAMA. 2016;316(22):2402\u0026ndash;2410. doi:10.1001/jama.2016.17218.\u003c/li\u003e\n\u003cli\u003eVaswani A, Shazeer N, Parmar N, et al. Attention Is All You Need. Adv Neural Inf Process Syst. 2017;30:5998\u0026ndash;6008.\u003c/li\u003e\n\u003cli\u003eDurning SJ, Artino AR, Boulet JR, et al. Contextual factors and clinical reasoning. Adv Health Sci Educ. 2012;17(1):65\u0026ndash;79. doi:10.1007/s10459-011-9294-3.\u003c/li\u003e\n\u003cli\u003eEva KW. What every teacher needs to know about clinical reasoning. Med Educ. 2005;39(1):98\u0026ndash;106. doi:10.1111/j.1365-2929.2004.01972.x.\u003c/li\u003e\n\u003cli\u003eObermeyer Z, et al. Dissecting racial bias in an algorithm. Science. 2019;366(6464):447\u0026ndash;453. doi:10.1126/science.aax2342.\u003c/li\u003e\n\u003cli\u003eElstein AS, Schwarz A. Clinical problem solving. BMJ. 2002;324(7339):729\u0026ndash;732. doi:10.1136/bmj.324.7339.729.\u003c/li\u003e\n\u003cli\u003ePatel VL, Groen GJ. The general and specific nature of medical expertise. In: Toward a General Theory of Expertise. Cambridge University Press; 1991:93\u0026ndash;125.\u003c/li\u003e\n\u003cli\u003eDavenport T, Kalakota R. Artificial intelligence in healthcare. Future Healthc J. 2019;6(2):94\u0026ndash;98. doi:10.7861/futurehosp.6-2-94.\u003c/li\u003e\n\u003cli\u003eRudin C. Stop explaining black box machine learning models. Nat Mach Intell. 2019;1(5):206\u0026ndash;215. doi:10.1038/s42256-019-0048-x.\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"artificial intelligence, clinical reasoning, R-IDEA, ChatGPT, anesthesiology, medical diagnosis, medical education","lastPublishedDoi":"10.21203/rs.3.rs-9540519/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-9540519/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003eBackground\u003c/h2\u003e \u003cp\u003eLarge language models (LLMs) have shown encouraging results in the medical field, with recent work demonstrating their capacity to encode substantial clinical knowledge [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e] and pass medical licensing examinations [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e, \u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e]. However, most existing studies evaluate diagnostic accuracy without assessing the quality of the underlying reasoning process, and none has used a validated instrument in a French-speaking African academic context. This study aimed to compare the quality of clinical reasoning between a generative AI and anesthesia-intensive care resident physicians using the R-IDEA score.\u003c/p\u003e\u003ch2\u003eMethods\u003c/h2\u003e \u003cp\u003eProspective cross-sectional comparative study conducted at CHU Ibn Rochd, Casablanca, Morocco (September\u0026ndash;October 2025). Twenty anonymized clinical cases were submitted to all fourth-year anesthesia-intensive care residents and to ChatGPT-5. Two faculty members developed a standardized R-IDEA rubric by consensus. All responses were scored by a single blinded evaluator. Diagnostic accuracy was recorded as a binary variable.\u003c/p\u003e\u003ch2\u003eResults\u003c/h2\u003e \u003cp\u003eFourteen residents were included, generating 56 responses (4 cases each); ChatGPT-5 produced 20 responses. The AI achieved a mean R-IDEA score of 9.5\u0026thinsp;\u0026plusmn;\u0026thinsp;0.9 versus 7.0\u0026thinsp;\u0026plusmn;\u0026thinsp;2.4 for residents (difference: +2.5 points; 95% CI: 1.4\u0026ndash;3.6; p\u0026thinsp;\u0026lt;\u0026thinsp;0.001; Cohen's d\u0026thinsp;=\u0026thinsp;1.25). The AI outperformed residents in all four R-IDEA domains, particularly Interpretive Summary (3.8 vs. 2.8; p\u0026thinsp;\u0026lt;\u0026thinsp;0.001) and Alternatives (1.9 vs. 1.3; p\u0026thinsp;=\u0026thinsp;0.002), with substantially greater homogeneity (CV: 9% vs. 34%). Diagnostic accuracy did not differ significantly (75% vs. 68%; p\u0026thinsp;=\u0026thinsp;0.47).\u003c/p\u003e\u003ch2\u003eConclusions\u003c/h2\u003e \u003cp\u003eIn this single-centre study, AI produced more structured and consistent clinical reasoning than end-of-training residents, with comparable diagnostic accuracy. These findings highlight the potential complementary role of AI in medical education and clinical practice, particularly in resource-limited settings.\u003c/p\u003e","manuscriptTitle":"Generative Artificial Intelligence versus Anesthesia-Intensive Care Resident Physicians: Performance in Clinical Case Analysis Using the R-IDEA Score","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-05-09 00:50:23","doi":"10.21203/rs.3.rs-9540519/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"18c28d7e-2ee3-4814-bd8b-32efcec17d81","owner":[],"postedDate":"May 9th, 2026","published":true,"recentEditorialEvents":[{"type":"decision","content":"Rejected","date":"2026-05-10T17:33:45+00:00","index":"","fulltext":""}],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2026-05-10T17:39:50+00:00","versionOfRecord":[],"versionCreatedAt":"2026-05-09 00:50:23","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-9540519","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-9540519","identity":"rs-9540519","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall
last seen: 2026-05-23T02:00:01.238055+00:00
License: CC-BY-4.0