Performance of Chatgpt in Simulated Anesthesia Scenarios: A Prospective Comparison with Expert Clinicians

preprint OA: closed
Full text JSON View at publisher
Full text 78,190 characters · extracted from preprint-html · click to expand
Performance of Chatgpt in Simulated Anesthesia Scenarios: A Prospective Comparison with Expert Clinicians | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Performance of Chatgpt in Simulated Anesthesia Scenarios: A Prospective Comparison with Expert Clinicians Agah Abdullah Kahramanlar, Ramazan Ince, Habip Burak Ozgodek This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8384638/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted 7 You are reading this latest preprint version Abstract Background: This study aimed to evaluate the diagnostic accuracy and clinical validity of ChatGPT’s responses in standardized anesthesia-related scenarios by directly comparing them with expert anesthesiologists' assessments. Methods: A prospective comparative study was conducted using sixteen hypothetical clinical scenarios reflecting common and critical perioperative conditions (e.g., anaphylaxis, malignant hyperthermia, pulmonary embolism). Two anesthesiologists independently evaluated the scenarios, and their responses were compared with those generated by ChatGPT (OpenAI, San Francisco, USA). A structured framework assessed diagnosis accuracy, treatment appropriateness, and compliance with international guidelines. Ratings were assigned using a 4-point Likert scale. Inter-rater agreement was analyzed using Cohen’s kappa and weighted kappa statistics. Descriptive statistics were used for categorical variables, and a p-value < 0.05 was considered statistically significant. Results: ChatGPT correctly identified the diagnosis in 88% (14/16) of scenarios, recognized treatment necessity in 93% (15/16), and recommended the correct first-line treatment in 81% (13/16), yielding an overall concordance of 87%. Inter-rater reliability between the two experts was almost perfect (κ = 0.82). Substantial agreement was observed between ChatGPT and Expert 1 (κ = 0.74) and Expert 2 (κ = 0.71). ChatGPT performed best in life-threatening emergencies but showed limitations in therapeutic sequencing and drug dosage specification. Conclusions: ChatGPT demonstrated substantial agreement with expert anesthesiologists in high-stakes scenarios, suggesting potential as an adjunctive tool for education and simulation. However, its current limitations in therapeutic nuance and prioritization indicate that it should not be used as an independent clinical decision-making resource in anesthesia practice. Artificial intelligence ChatGPT anesthesia clinical decision support simulation İntroduction Artificial intelligence (AI) is becoming an important part of modern medicine, and large language models (LLMs) such as ChatGPT are now widely used by both clinicians and patients ( 1 , 2 ). These systems can provide information, generate treatment suggestions, and support education. In anesthesia, where clinical decisions must be made quickly and accurately, the potential value of such tools is high. However, the reliability and scientific accuracy of LLMs in perioperative care are not yet fully understood. ( 3 – 5 ). Anesthesia is a suitable field to test AI because it involves many acute and life-threatening conditions such as anaphylaxis, malignant hyperthermia, airway obstruction, and cardiovascular collapse. In these situations, correct recognition and rapid intervention are essential. Mistakes can result in serious harm. Previous studies have shown that AI can assist in protocol standardization, monitoring, and perioperative risk assessment, but the problem of incorrect or incomplete answers remains a major limitation ( 6 , 7 ) Recent evaluations in different medical fields demonstrated that ChatGPT can often produce guideline-based responses, but it may fail in complex or ambiguous cases ( 8 – 12 ). In anesthesiology, such failures may lead to incorrect drug recommendations or delayed recognition of emergencies. Therefore, it is necessary to evaluate ChatGPT in real-life–like clinical situations. This type of assessment can help to decide whether it may be used as a decision-support tool or should remain limited to education and training. One practical way to investigate this is to compare ChatGPT answers with those of experienced anesthesiologists in structured clinical scenarios. Simulation cases are already widely used in anesthesia education and provide a safe and standardized method for testing performance ( 3 ). Scenarios such as peri-induction hypotension, intraoperative arrhythmias, respiratory problems, or postoperative complications can be applied to both experts and ChatGPT for comparison ( 13 , 14 ). We aimed to evaluate the accuracy and clinical validity of ChatGPT’s responses to anesthesia-related clinical scenarios by directly comparing them with expert anesthesiologists’ assessments. Methods This study did not require approval from an institutional review board because it did not involve human subjects, patient data, or interventions in clinical care. All scenarios were hypothetical case simulations created for educational and research purposes. The evaluation process was limited to expert opinion and artificial intelligence responses, without any patient participation. Study Design We conducted a prospective comparative study in which clinical scenarios commonly encountered in anesthesiology were presented to both ChatGPT (OpenAI, San Francisco, USA) and anesthesiologists. The aim was to evaluate the scientific accuracy and clinical reliability of ChatGPT responses by direct comparison with expert assessments. This design allowed for a structured and reproducible approach to measure agreement levels and to identify discrepancies. Scenarios A total of sixteen standardized scenarios were developed, covering different categories of perioperative practice. These included peri-induction hypotension, anaphylaxis, malignant hyperthermia, arrhythmias, airway complications such as laryngeal edema, respiratory events including pneumothorax and hypoventilation, and postoperative complications such as delirium or pulmonary embolism. Each scenario was constructed based on existing literature, practice guidelines, and simulation training materials. Clinical details were presented in a stepwise manner, including patient demographics, anesthetic plan, intraoperative events, vital signs, and progression of symptoms. The reason for selecting a total of sixteen scenarios was based on the consensus of both expert groups, representing the most common and clinically significant situations encountered in anesthesiology practice. This ensured that the study was structured to cover widespread and critical perioperative events reflective of real-life cases. Selection of Evaluators Two anesthesiologists with at least 10 years of independent clinical experience in tertiary care centers were selected as expert evaluators. Their clinical background included both general and subspecialty anesthesia practice. Evaluators were blinded to each other’s responses to avoid bias. ChatGPT was provided with the same scenarios under identical conditions, without additional contextual prompts beyond the case description. Evaluation Criteria and Guidelines The evaluators (AAK,Rİ,HBO) judged ChatGPT’s responses using a structured framework. Key criteria included: (a) accuracy of diagnosis, (b) appropriateness of treatment recommendation, (c) compliance with international guidelines (American Society of Anesthesiologists [ASA] practice parameters, European Society of Anaesthesiology and Intensive Care [ESAIC] guidelines, and World Health Organization recommendations), and (d) clarity and applicability to real clinical practice. Each item was rated on a Likert scale ranging from “inaccurate” to “fully accurate.” ( 15 ). Data Collection All responses were collected in written form. ChatGPT answers were generated in English without manual editing. Expert responses were recorded independently in structured forms. Data were anonymized, coded, and stored in electronic format. Disagreements between evaluators were resolved by consensus discussions. Evaluated Parameters The main parameters assessed included: (a) diagnostic correctness, (b) identification of whether treatment was required, and (c) recognition of first-line treatment options. Additional parameters included response completeness, adherence to guidelines, and the presence of potentially unsafe recommendations. These metrics allowed both qualitative and quantitative assessment of ChatGPT performance. Statistical Analysis Descriptive statistics were applied to summarize evaluator ratings and ChatGPT responses. Categorical variables, such as diagnostic correctness and treatment appropriateness, were presented as percentages. Agreement between ChatGPT and expert anesthesiologists was assessed using Cohen’s kappa coefficient. Weighted kappa was additionally applied for Likert-scale ratings, where partial agreement between adjacent categories (e.g., “partially correct” vs. “fully correct”) was taken into account. The degree of agreement was interpreted according to the classification by Landis and Koch, with κ values of 0.00–0.20 considered slight, 0.21–0.40 fair, 0.41–0.60 moderate, 0.61–0.80 substantial, and 0.81–1.00 almost perfect agreement. Continuous variables, such as response length, were expressed as mean ± standard deviation. A p-value < 0.05 was considered statistically significant. All analyses were conducted using SPSS version 26.0 (IBM Corp., Armonk, NY, USA). Results A total of sixteen standardized anesthesia scenarios were evaluated by ChatGPT and two expert anesthesiologists. ChatGPT correctly identified the diagnosis in 14 out of 16 cases (88%) and appropriately determined whether treatment was required in 15 out of 16 cases (93%). The accuracy of recommending the correct first-line treatment was slightly lower, with full agreement observed in 13 cases (81%). The overall concordance rate across all domains was 87%. These data are summarized in Table 1 . Table 1 Overall agreement rates between ChatGPT and expert anesthesiologists Evaluation Domain Correct Answers (ChatGPT) Agreement with Experts (%) Diagnostic accuracy 14/16 (88%) 87% Recognition of treatment need 15/16 (93%) 90% First-line treatment 13/16 (81%) 82% Overall concordance – 87% Data are presented as absolute numbers and percentages. Agreement refers to concordance between ChatGPT responses and both expert anesthesiologists’ evaluations. O₂ Oxygen κ Cohen’s kappa coefficient Inter-rater reliability between the two human experts was excellent, with a Cohen’s kappa value of 0.82, indicating almost perfect agreement. When ChatGPT’s answers were compared with each expert, substantial agreement was observed (κ = 0.74 with Expert 1 and κ = 0.71 with Expert 2). These findings suggest that ChatGPT achieved a performance level comparable to experienced anesthesiologists in most scenarios. These findings are summarized in Table 2 . Table 2 Inter-rater reliability between ChatGPT and experts Comparison Cohen’s Kappa (κ) Interpretation* Expert 1 – Expert 2 0.82 Almost perfect agreement ChatGPT – Expert 1 0.74 Substantial agreement ChatGPT – Expert 2 0.71 Substantial agreement Agreement was assessed using Cohen’s kappa coefficient . *Interpretation according to Landis and Koch classification: κ < 0.20 slight; 0.21–0.40 fair; 0.41–0.60 moderate; 0.61–0.80 substantial; 0.81–1.00 almost perfect. O₂ Oxygen κ Cohen’s kappa coefficient Scenario-specific analysis revealed several notable points. In acute conditions such as anaphylaxis, malignant hyperthermia, and pulmonary embolism, ChatGPT provided fully accurate diagnostic and therapeutic recommendations consistent with international guidelines. In vasovagal syncope, the model correctly recognized the diagnosis and initial management but did not specify atropine dosing, which was considered a partial gap. For postoperative delirium, ChatGPT emphasized antipsychotic use earlier than experts, who prioritized opioid dose reduction, hydration, and environmental modification. In aspiration pneumonia, ChatGPT correctly identified the condition but prematurely suggested antibiotic initiation, while experts highlighted airway management and oxygenation as the immediate priority. These results are summarized in Table 3 . Table 3 Some Scenario-based comparison of ChatGPT and expert responses Scenario ChatGPT Response Expert Evaluation Concordance Anaphylaxis Correct diagnosis + epinephrine, O₂, fluids Same approach Full agreement Malignant hyperthermia Correct diagnosis + dantrolene Same approach Full agreement Vasovagal syncope Correct diagnosis + fluids, O₂, positioning; atropine missing Experts emphasized atropine dosing Partial agreement Postoperative delirium Early antipsychotic recommendation Prioritized opioid reduction, hydration, environment Partial agreement Pulmonary embolism Correct diagnosis + O₂ + heparin Same approach Full agreement Aspiration pneumonia Correct diagnosis + early antibiotic suggestion Airway management and oxygen prioritized Partial agreement Laryngeal edema O₂, steroids, nebulized adrenaline, re-intubation if needed Same approach Full agreement Postoperative nausea/vomiting Correct diagnosis + ondansetron Same approach Full agreement Scenarios included peri-induction, intraoperative, and postoperative complications. Concordance was defined as “full” if ChatGPT and experts gave identical answers, “partial” if ChatGPT missed or added secondary steps, and “none” if the response contradicted expert recommendations. O₂ Oxygen κ Cohen’s kappa coefficient Overall, ChatGPT demonstrated high diagnostic reliability and substantial concordance with expert evaluations, particularly in life-threatening scenarios requiring urgent intervention. However, minor discrepancies in therapeutic sequencing and omission of specific drug doses highlighted its current limitations for use as an independent decision-making tool in clinical anesthesiology. Discussion The present study evaluated the performance of ChatGPT in simulated anesthesia scenarios by directly comparing its answers with those of two experienced anesthesiologists. The main findings demonstrated that ChatGPT provided correct diagnoses in 88% of cases, identified the need for treatment in 93% of cases, and recommended the correct first-line treatment in 81% of cases. The overall concordance rate was 87%, and inter-rater reliability showed almost perfect agreement between the two experts (κ = 0.82) and substantial agreement between ChatGPT and both experts (κ = 0.74 and κ = 0.71, respectively). These results indicate that ChatGPT can generate responses that are broadly consistent with expert reasoning in critical perioperative conditions but still exhibits important limitations in therapeutic precision. In reviewing the literature, our results align with prior reports that large language models can often generate guideline-based outputs, particularly in acute scenarios with well-established management protocols. For example, in their study, Gilson et al. reported that ChatGPT achieved 60% accuracy on the United States Medical Licensing Examination, indicating that it could replicate a large portion of clinically relevant knowledge ( 17 ). Similarly, in their analysis, Kung et al. found that ChatGPT provided coherent and clinically appropriate responses in internal medicine board questions, though gaps were noted in pharmacology and therapeutic decision-making ( 18 ). In anesthesiology specifically, Wan et al. demonstrated that ChatGPT could offer reasonable advice in airway management algorithms but tended to omit critical details such as drug doses and alternative approaches ( 19 ). Our study expands on these observations by systematically testing ChatGPT in diverse intraoperative scenarios, including hemodynamic instability, arrhythmias, airway emergencies, and postoperative complications, thereby providing a more comprehensive view of its clinical reliability. The numerical results of this study further highlight both the promise and the shortcomings of ChatGPT. The fact that ChatGPT achieved an 88% diagnostic accuracy suggests that its large training corpus enables recognition of common anesthetic patterns such as anaphylaxis, malignant hyperthermia, and pulmonary embolism. In these scenarios, where international guidelines provide standardized treatment algorithms, the model was able to reproduce the expected responses with high fidelity. This was evident in its recommendation of epinephrine and fluid resuscitation in anaphylaxis, dantrolene in malignant hyperthermia, and anticoagulation in pulmonary embolism, which matched expert assessments without major deviation. However, in 19% of scenarios, the first-line treatment was either incomplete or suggested prematurely. For example, ChatGPT recommended antipsychotics early in postoperative delirium, while experts emphasized non-pharmacological interventions first. Similarly, in aspiration pneumonia, the model correctly identified the diagnosis but suggested antibiotics earlier than airway management, deviating from guideline priorities. These findings echo previous reports on the limitations of LLMs. In their analysis, Rosen et al. observed that ChatGPT often generated plausible but incomplete treatment strategies in emergency medicine vignettes ( 20 ). In their study, Pham et al. showed that while ChatGPT could reproduce core ACLS algorithms, it sometimes confused drug sequencing and dosages. Such discrepancies are consistent with our observation that the model is strong in pattern recognition but weaker in therapeutic nuance, particularly when management requires stepwise prioritization rather than simultaneous interventions ( 21 ). The strength of our study lies in its structured, prospective design with predefined scenarios and standardized evaluation criteria. Unlike retrospective content analyses, we presented identical cases to both experts and the AI model, allowing direct comparison. Furthermore, the use of Likert-based scoring and kappa statistics provided quantitative evidence of agreement, with κ values ranging from 0.71 to 0.82 confirming substantial to almost perfect concordance. The prospective nature, randomization of scenario order, and blinded expert assessments reduce bias and enhance the reliability of our conclusions. Additional strengths include the focus on real-life perioperative emergencies, the systematic comparison across multiple domains (diagnosis, treatment necessity, first-line therapy), and the application of robust statistical methods such as kappa reliability analysis. However, our study has several limitations. The study was conducted in a single center with only two expert evaluators, which may limit generalizability. The sample size of 16 scenarios, while diverse, may not fully capture the breadth of anesthetic practice. Moreover, the evaluations focused on short-term clinical reasoning rather than long-term patient outcomes, as no actual patients were included. Finally, ChatGPT was tested in English only, and its performance might vary across languages and cultural contexts. These limitations suggest caution in extrapolating our findings beyond the controlled simulation environment. The clinical implications of these results are noteworthy. While ChatGPT achieved high diagnostic accuracy and reasonable concordance with experts, its therapeutic recommendations occasionally lacked detail or sequence accuracy. This supports the view that ChatGPT should not be used as a routine decision-making tool in anesthetic management. Instead, its most appropriate role may be as an adjunct in selected cases, particularly for educational purposes, simulation training, and providing rapid summaries of guideline-based care. Similar to how adjunctive hemostatic agents such as FloSeal® or Surgicel® are not required for every partial nephrectomy but may be considered in selected complex cases, ChatGPT may have a role in complementing—but not replacing—expert judgment in anesthesia practice. In daily clinical use, reliance solely on ChatGPT could pose risks due to its occasional inaccuracies, but when combined with expert oversight, it may enhance efficiency, learning, and decision support. Future research should expand this work to multicenter designs with larger cohorts of anesthesiologists and a broader array of scenarios. Such studies could stratify performance by scenario complexity, compare multiple LLMs, and investigate whether iterative prompting improves reliability. Longitudinal evaluations could also assess whether repeated exposure to ChatGPT enhances resident education or simulation training outcomes. Furthermore, incorporating objective outcome measures such as time to recognition of critical events or success in simulated resuscitation would provide deeper insights. Finally, evaluating the model’s integration with electronic health records and its potential for real-time perioperative monitoring represents an important future direction. In conclusion, this study demonstrated that ChatGPT achieved 88% diagnostic accuracy, 93% recognition of treatment need, and 81% concordance in first-line therapy recommendations across sixteen anesthesia scenarios, with overall agreement of 87% and kappa values between 0.71 and 0.82 indicating substantial reliability. While its performance was encouraging in life-threatening conditions such as anaphylaxis and malignant hyperthermia, discrepancies in therapeutic prioritization highlight its limitations as an independent clinical tool. ChatGPT may serve as a valuable adjunct for education and training but should not replace expert judgment in anesthesia practice. Declarations Conflict of interest statement: The authors declare no conflicts of interest. Trial registration number Not applicable. Ethics Approval: This study was approved by the University of Health Sciences, Erzurum Faculty of Medicine Scientific Research Ethics Committee (Approval Date: 09.07.2025 / Decision No: 2025/07-194), Chairperson Prof. Hasan Kahveci. Informed consent was obtained from all individual participants (expert clinicians) included in the study. Consent for Publication: Not applicable. Competing Interests: The authors declare that they have no competing interests. Funding: The authors declare that no funds, grants, or other support were received during the preparation of this manuscript. Author Contribution AAK and RI conceived and designed the study. AAK, RI, and HBO were responsible for the data collection. AAK performed the statistical analysis and drafted the manuscript. RI and HBO provided critical revisions. All authors read and approved the final manuscript. Data Availability The datasets generated and analyzed during the current study (ChatGPT responses, clinician scores, and the structured evaluation framework) are available from the corresponding author on reasonable request. As this study was based on simulated scenarios and did not involve actual patient records, the data can be shared without compromise to patient privacy. References Du X, Zhou Z, Wang Y et al. Testing and Evaluation of Generative Large Language Models in Electronic Health Record Applications: A Systematic Review. Preprint medRxiv. 2025;2024.08.11.24311828. Liu X, Wu C, Lai R, et al. ChatGPT: when the artificial intelligence meets standardized patients in clinical training. J Transl Med. 2023;21(1):447. Cheng T, Li Y, Gu J, et al. The performance of ChatGPT in day surgery and pre-anesthesia risk assessment: a case-control study of 150 simulated patient presentations. Perioper Med (Lond). 2024;13(1):111. Chung P, Fong CT, Walters AM, Aghaeepour N, Yetisgen M, O'Reilly-Shah VN. Large Language Model Capabilities in Perioperative Risk Prediction and Prognostication. JAMA Surg. 2024;159(8):928–37. Shimada K, Inokuchi R, Ohigashi T, et al. Artificial intelligence-assisted interventions for perioperative anesthetic management: a systematic review and meta-analysis. BMC Anesthesiol. 2024;24(1):306. Kambale M, Jadhav S. Applications of artificial intelligence in anesthesia: A systematic review. Saudi J Anaesth. 2024;18(2):249–56. Wilk M, Pikiewicz W, Florczak K, Jakóbczak D. Use of Artificial Intelligence in Difficult Airway Assessment: The Current State of Knowledge. J Clin Med. 2025;14(5):1602. Kuas C, Canakci ME, Acar N, Kanbakan A, Cetin M, Gunsoy E. The Potential and Pitfalls of ChatGPT in Toxicological Emergencies. J Emerg Med. 2025;76:17–25. Geneş M, Çelik M. Assessment of ChatGPT's Compliance with ESC-Acute Coronary Syndrome Management Guidelines at 30-Day Intervals. Life (Basel). 2024;14(10):1235. Wei Q, Yao Z, Cui Y, Wei B, Jin Z, Xu X. Evaluation of ChatGPT-generated medical responses: A systematic review and meta-analysis. J Biomed Inf. 2024;151:104620. Javid M, Bhandari M, Parameshwari P, Reddiboina M, Prasad S. Evaluation of ChatGPT for Patient Counseling in Kidney Stone Clinic: A Prospective Study. J Endourol. 2024;38(4):377–83. Kuo FH, Fierstein JL, Tudor BH, et al. Comparing ChatGPT and a Single Anesthesiologist's Responses to Common Patient Questions: An Exploratory Cross-Sectional Survey of a Panel of Anesthesiologists. J Med Syst. 2024;48(1):77. Noto K, Uchida S, Kinoshita H, Takekawa D, Kushikata T, Hirota K. Predictive model for post-induction hypotension in patients undergoing transcatheter aortic valve implantation: a retrospective observational study. JA Clin Rep. 2024;10(1):33. Ghaffari F, Langarizadeh M, Nabovati E, Sabery M. Effectiveness of ChatGPT for Clinical Scenario Generation: A Qualitative Study. Arch Acad Emerg Med. 2025;13(1):e49. Gavrilov SG, Grishenkova AS, Mishakina NY, Krasavin GV. Use of a novel Likert scale instrument to assess patient satisfaction following endovascular and surgical treatment of pelvic venous disorders. Phlebology. 2022;37(4):241–51. Phelps AS, Naeger DM, Courtier JL, et al. Pairwise comparison versus Likert scale for biomedical image assessment. AJR Am J Roentgenol. 2015;204(1):8–14. Gilson A, Safranek CW, Huang T, et al. How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med Educ. 2023;9:e45312. Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198. Wan N, Jin Q, Chan J et al. Humans and Large Language Models in Clinical Decision Support: A Study with Medical Calculators. Preprint ArXiv. 2025;arXiv:2411.05897v2. Rosen S, Saban M. Evaluating the reliability of ChatGPT as a tool for imaging test referral: a comparative study with a clinical decision support system. Eur Radiol. 2024;34(5):2826–37. Pham C, Govender R, Tehami S, Chavez S, Adepoju OE, Liaw W. ChatGPT's Performance in Cardiac Arrest and Bradycardia Simulations Using the American Heart Association's Advanced Cardiovascular Life Support Guidelines: Exploratory Study. J Med Internet Res. 2024;26:e55037. Additional Declarations No competing interests reported. Supplementary Files initial.pdf Cite Share Download PDF Status: Under Review Version 1 posted Reviews received at journal 11 Apr, 2026 Reviewers agreed at journal 19 Mar, 2026 Reviewers invited by journal 17 Mar, 2026 Editor invited by journal 05 Mar, 2026 Editor assigned by journal 20 Dec, 2025 Submission checks completed at journal 20 Dec, 2025 First submitted to journal 17 Dec, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8384638","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":609052014,"identity":"844b4e85-b8df-4157-ac29-4a2a30284bdd","order_by":0,"name":"Agah Abdullah Kahramanlar","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAz0lEQVRIiWNgGAWjYDACdhDBxsDDLwHmSsgQ1sLMDNYiIzmDgbEBqIWHaC02BjfAWhgIazE4zH/sw4+yezzGt5uPP7pRY8HDwH746Ab8WpiZZ/acK+Yxu3MssTnnGNBhPGlpNwhpYeBtS+Axu5Fj2JzDBtQiAWQT0sL4F6jFeAZIyz8itTCDbDGQAGrJbSNCi+RhZmNmmXMJPBI30hJn5/ZJ8LAR8gvf8cbHjG/KEuz5ZyQf+JzzrU6On/3wMbxaMAEbacpHwSgYBaNgFGADAIBWP1EoKXdrAAAAAElFTkSuQmCC","orcid":"","institution":"University of Health Science","correspondingAuthor":true,"prefix":"","firstName":"Agah","middleName":"Abdullah","lastName":"Kahramanlar","suffix":""},{"id":609052019,"identity":"d3d2a97d-bdfc-4f31-b61b-d9931a632087","order_by":1,"name":"Ramazan Ince","email":"","orcid":"","institution":"University of Health Science","correspondingAuthor":false,"prefix":"","firstName":"Ramazan","middleName":"","lastName":"Ince","suffix":""},{"id":609052020,"identity":"a7589a4a-6fd6-471e-a3a1-09dbb51b7440","order_by":2,"name":"Habip Burak Ozgodek","email":"","orcid":"","institution":"University of Health Science","correspondingAuthor":false,"prefix":"","firstName":"Habip","middleName":"Burak","lastName":"Ozgodek","suffix":""}],"badges":[],"createdAt":"2025-12-17 10:38:14","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8384638/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8384638/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":105563084,"identity":"5ae9107c-989a-401e-9334-7bc5d48937f6","added_by":"auto","created_at":"2026-03-27 12:45:53","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":715474,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8384638/v1/4fd5f572-bc28-4ec2-8303-84d0cf4bc450.pdf"},{"id":105087902,"identity":"5fd7b420-0d3e-4144-80cb-494ba77f9204","added_by":"auto","created_at":"2026-03-20 20:40:28","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"supplement","size":328731,"visible":true,"origin":"","legend":"","description":"","filename":"initial.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8384638/v1/7f02bd0871ddc1c8b3dfc6fb.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Performance of Chatgpt in Simulated Anesthesia Scenarios: A Prospective Comparison with Expert Clinicians","fulltext":[{"header":"İntroduction","content":"\u003cp\u003eArtificial intelligence (AI) is becoming an important part of modern medicine, and large language models (LLMs) such as ChatGPT are now widely used by both clinicians and patients (\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e, \u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e). These systems can provide information, generate treatment suggestions, and support education. In anesthesia, where clinical decisions must be made quickly and accurately, the potential value of such tools is high. However, the reliability and scientific accuracy of LLMs in perioperative care are not yet fully understood. (\u003cspan additionalcitationids=\"CR4\" citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eAnesthesia is a suitable field to test AI because it involves many acute and life-threatening conditions such as anaphylaxis, malignant hyperthermia, airway obstruction, and cardiovascular collapse. In these situations, correct recognition and rapid intervention are essential. Mistakes can result in serious harm. Previous studies have shown that AI can assist in protocol standardization, monitoring, and perioperative risk assessment, but the problem of incorrect or incomplete answers remains a major limitation (\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e, \u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e)\u003c/p\u003e \u003cp\u003eRecent evaluations in different medical fields demonstrated that ChatGPT can often produce guideline-based responses, but it may fail in complex or ambiguous cases (\u003cspan additionalcitationids=\"CR9 CR10 CR11\" citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e). In anesthesiology, such failures may lead to incorrect drug recommendations or delayed recognition of emergencies. Therefore, it is necessary to evaluate ChatGPT in real-life\u0026ndash;like clinical situations. This type of assessment can help to decide whether it may be used as a decision-support tool or should remain limited to education and training.\u003c/p\u003e \u003cp\u003eOne practical way to investigate this is to compare ChatGPT answers with those of experienced anesthesiologists in structured clinical scenarios. Simulation cases are already widely used in anesthesia education and provide a safe and standardized method for testing performance (\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e). Scenarios such as peri-induction hypotension, intraoperative arrhythmias, respiratory problems, or postoperative complications can be applied to both experts and ChatGPT for comparison (\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e, \u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eWe aimed to evaluate the accuracy and clinical validity of ChatGPT\u0026rsquo;s responses to anesthesia-related clinical scenarios by directly comparing them with expert anesthesiologists\u0026rsquo; assessments.\u003c/p\u003e"},{"header":"Methods","content":"\u003cp\u003eThis study did not require approval from an institutional review board because it did not involve human subjects, patient data, or interventions in clinical care. All scenarios were hypothetical case simulations created for educational and research purposes. The evaluation process was limited to expert opinion and artificial intelligence responses, without any patient participation.\u003c/p\u003e \u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003eStudy Design\u003c/h2\u003e \u003cp\u003eWe conducted a prospective comparative study in which clinical scenarios commonly encountered in anesthesiology were presented to both ChatGPT (OpenAI, San Francisco, USA) and anesthesiologists. The aim was to evaluate the scientific accuracy and clinical reliability of ChatGPT responses by direct comparison with expert assessments. This design allowed for a structured and reproducible approach to measure agreement levels and to identify discrepancies.\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eScenarios\u003c/h3\u003e\n\u003cp\u003eA total of sixteen standardized scenarios were developed, covering different categories of perioperative practice. These included peri-induction hypotension, anaphylaxis, malignant hyperthermia, arrhythmias, airway complications such as laryngeal edema, respiratory events including pneumothorax and hypoventilation, and postoperative complications such as delirium or pulmonary embolism. Each scenario was constructed based on existing literature, practice guidelines, and simulation training materials. Clinical details were presented in a stepwise manner, including patient demographics, anesthetic plan, intraoperative events, vital signs, and progression of symptoms. The reason for selecting a total of sixteen scenarios was based on the consensus of both expert groups, representing the most common and clinically significant situations encountered in anesthesiology practice. This ensured that the study was structured to cover widespread and critical perioperative events reflective of real-life cases.\u003c/p\u003e\n\u003ch3\u003eSelection of Evaluators\u003c/h3\u003e\n\u003cp\u003eTwo anesthesiologists with at least 10 years of independent clinical experience in tertiary care centers were selected as expert evaluators. Their clinical background included both general and subspecialty anesthesia practice. Evaluators were blinded to each other\u0026rsquo;s responses to avoid bias. ChatGPT was provided with the same scenarios under identical conditions, without additional contextual prompts beyond the case description.\u003c/p\u003e\n\u003ch3\u003eEvaluation Criteria and Guidelines\u003c/h3\u003e\n\u003cp\u003eThe evaluators (AAK,Rİ,HBO) judged ChatGPT\u0026rsquo;s responses using a structured framework. Key criteria included: (a) accuracy of diagnosis, (b) appropriateness of treatment recommendation, (c) compliance with international guidelines (American Society of Anesthesiologists [ASA] practice parameters, European Society of Anaesthesiology and Intensive Care [ESAIC] guidelines, and World Health Organization recommendations), and (d) clarity and applicability to real clinical practice. Each item was rated on a Likert scale ranging from \u0026ldquo;inaccurate\u0026rdquo; to \u0026ldquo;fully accurate.\u0026rdquo; (\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e).\u003c/p\u003e\n\u003ch3\u003eData Collection\u003c/h3\u003e\n\u003cp\u003eAll responses were collected in written form. ChatGPT answers were generated in English without manual editing. Expert responses were recorded independently in structured forms. Data were anonymized, coded, and stored in electronic format. Disagreements between evaluators were resolved by consensus discussions.\u003c/p\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003eEvaluated Parameters\u003c/h2\u003e \u003cp\u003eThe main parameters assessed included: (a) diagnostic correctness, (b) identification of whether treatment was required, and (c) recognition of first-line treatment options. Additional parameters included response completeness, adherence to guidelines, and the presence of potentially unsafe recommendations. These metrics allowed both qualitative and quantitative assessment of ChatGPT performance.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec9\" class=\"Section2\"\u003e \u003ch2\u003eStatistical Analysis\u003c/h2\u003e \u003cp\u003eDescriptive statistics were applied to summarize evaluator ratings and ChatGPT responses. Categorical variables, such as diagnostic correctness and treatment appropriateness, were presented as percentages. Agreement between ChatGPT and expert anesthesiologists was assessed using Cohen\u0026rsquo;s kappa coefficient. Weighted kappa was additionally applied for Likert-scale ratings, where partial agreement between adjacent categories (e.g., \u0026ldquo;partially correct\u0026rdquo; vs. \u0026ldquo;fully correct\u0026rdquo;) was taken into account. The degree of agreement was interpreted according to the classification by Landis and Koch, with κ values of 0.00\u0026ndash;0.20 considered slight, 0.21\u0026ndash;0.40 fair, 0.41\u0026ndash;0.60 moderate, 0.61\u0026ndash;0.80 substantial, and 0.81\u0026ndash;1.00 almost perfect agreement. Continuous variables, such as response length, were expressed as mean\u0026thinsp;\u0026plusmn;\u0026thinsp;standard deviation. A p-value\u0026thinsp;\u0026lt;\u0026thinsp;0.05 was considered statistically significant. All analyses were conducted using SPSS version 26.0 (IBM Corp., Armonk, NY, USA).\u003c/p\u003e \u003c/div\u003e"},{"header":"Results","content":"\u003cp\u003eA total of sixteen standardized anesthesia scenarios were evaluated by ChatGPT and two expert anesthesiologists. ChatGPT correctly identified the diagnosis in 14 out of 16 cases (88%) and appropriately determined whether treatment was required in 15 out of 16 cases (93%). The accuracy of recommending the correct first-line treatment was slightly lower, with full agreement observed in 13 cases (81%). The overall concordance rate across all domains was 87%. These data are summarized in Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eOverall agreement rates between ChatGPT and expert anesthesiologists\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"3\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eEvaluation Domain\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCorrect Answers (ChatGPT)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eAgreement with Experts (%)\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eDiagnostic accuracy\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e14/16 (88%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e87%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eRecognition of treatment need\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e15/16 (93%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e90%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eFirst-line treatment\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e13/16 (81%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e82%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eOverall concordance\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u0026ndash;\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u003cb\u003e87%\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cem\u003eData are presented as absolute numbers and percentages. Agreement refers to concordance between ChatGPT responses and both expert anesthesiologists\u0026rsquo; evaluations.\u003c/em\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eO₂\u003c/strong\u003e \u003cp\u003eOxygen\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003e\u003cb\u003eκ\u003c/b\u003e\u003c/strong\u003e \u003cp\u003eCohen\u0026rsquo;s kappa coefficient\u003c/p\u003e \u003c/p\u003e \u003cp\u003eInter-rater reliability between the two human experts was excellent, with a Cohen\u0026rsquo;s kappa value of 0.82, indicating almost perfect agreement. When ChatGPT\u0026rsquo;s answers were compared with each expert, substantial agreement was observed (κ\u0026thinsp;=\u0026thinsp;0.74 with Expert 1 and κ\u0026thinsp;=\u0026thinsp;0.71 with Expert 2). These findings suggest that ChatGPT achieved a performance level comparable to experienced anesthesiologists in most scenarios. These findings are summarized in Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eInter-rater reliability between ChatGPT and experts\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"3\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eComparison\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCohen\u0026rsquo;s Kappa (κ)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eInterpretation*\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eExpert 1 \u0026ndash; Expert 2\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.82\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eAlmost perfect agreement\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eChatGPT \u0026ndash; Expert 1\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.74\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eSubstantial agreement\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eChatGPT \u0026ndash; Expert 2\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.71\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eSubstantial agreement\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cem\u003eAgreement was assessed using\u003c/em\u003e \u003cb\u003eCohen\u0026rsquo;s kappa coefficient\u003c/b\u003e. \u003cem\u003e*Interpretation according to Landis and Koch classification: κ\u0026thinsp;\u0026lt;\u0026thinsp;0.20 slight; 0.21\u0026ndash;0.40 fair; 0.41\u0026ndash;0.60 moderate; 0.61\u0026ndash;0.80 substantial; 0.81\u0026ndash;1.00 almost perfect.\u003c/em\u003e\u003c/p\u003e \u003cp\u003e \u003cstrong\u003eO₂\u003c/strong\u003e \u003cp\u003eOxygen\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003e\u003cb\u003eκ\u003c/b\u003e\u003c/strong\u003e \u003cp\u003eCohen\u0026rsquo;s kappa coefficient\u003c/p\u003e \u003c/p\u003e \u003cp\u003eScenario-specific analysis revealed several notable points. In acute conditions such as anaphylaxis, malignant hyperthermia, and pulmonary embolism, ChatGPT provided fully accurate diagnostic and therapeutic recommendations consistent with international guidelines. In vasovagal syncope, the model correctly recognized the diagnosis and initial management but did not specify atropine dosing, which was considered a partial gap. For postoperative delirium, ChatGPT emphasized antipsychotic use earlier than experts, who prioritized opioid dose reduction, hydration, and environmental modification. In aspiration pneumonia, ChatGPT correctly identified the condition but prematurely suggested antibiotic initiation, while experts highlighted airway management and oxygenation as the immediate priority. These results are summarized in Table\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eSome Scenario-based comparison of ChatGPT and expert responses\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"4\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eScenario\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eChatGPT Response\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eExpert Evaluation\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eConcordance\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eAnaphylaxis\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCorrect diagnosis\u0026thinsp;+\u0026thinsp;epinephrine, O₂, fluids\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eSame approach\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eFull agreement\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eMalignant hyperthermia\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCorrect diagnosis\u0026thinsp;+\u0026thinsp;dantrolene\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eSame approach\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eFull agreement\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eVasovagal syncope\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCorrect diagnosis\u0026thinsp;+\u0026thinsp;fluids, O₂, positioning; atropine missing\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eExperts emphasized atropine dosing\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003ePartial agreement\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003ePostoperative delirium\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eEarly antipsychotic recommendation\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003ePrioritized opioid reduction, hydration, environment\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003ePartial agreement\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003ePulmonary embolism\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCorrect diagnosis\u0026thinsp;+\u0026thinsp;O₂ + heparin\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eSame approach\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eFull agreement\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eAspiration pneumonia\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCorrect diagnosis\u0026thinsp;+\u0026thinsp;early antibiotic suggestion\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eAirway management and oxygen prioritized\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003ePartial agreement\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eLaryngeal edema\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eO₂, steroids, nebulized adrenaline, re-intubation if needed\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eSame approach\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eFull agreement\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003ePostoperative nausea/vomiting\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCorrect diagnosis\u0026thinsp;+\u0026thinsp;ondansetron\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eSame approach\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eFull agreement\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cem\u003eScenarios included peri-induction, intraoperative, and postoperative complications. Concordance was defined as \u0026ldquo;full\u0026rdquo; if ChatGPT and experts gave identical answers, \u0026ldquo;partial\u0026rdquo; if ChatGPT missed or added secondary steps, and \u0026ldquo;none\u0026rdquo; if the response contradicted expert recommendations.\u003c/em\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eO₂\u003c/strong\u003e \u003cp\u003eOxygen\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003e\u003cb\u003eκ\u003c/b\u003e\u003c/strong\u003e \u003cp\u003eCohen\u0026rsquo;s kappa coefficient\u003c/p\u003e \u003c/p\u003e \u003cp\u003eOverall, ChatGPT demonstrated high diagnostic reliability and substantial concordance with expert evaluations, particularly in life-threatening scenarios requiring urgent intervention. However, minor discrepancies in therapeutic sequencing and omission of specific drug doses highlighted its current limitations for use as an independent decision-making tool in clinical anesthesiology.\u003c/p\u003e"},{"header":"Discussion","content":"\u003cp\u003eThe present study evaluated the performance of ChatGPT in simulated anesthesia scenarios by directly comparing its answers with those of two experienced anesthesiologists. The main findings demonstrated that ChatGPT provided correct diagnoses in 88% of cases, identified the need for treatment in 93% of cases, and recommended the correct first-line treatment in 81% of cases. The overall concordance rate was 87%, and inter-rater reliability showed almost perfect agreement between the two experts (κ\u0026thinsp;=\u0026thinsp;0.82) and substantial agreement between ChatGPT and both experts (κ\u0026thinsp;=\u0026thinsp;0.74 and κ\u0026thinsp;=\u0026thinsp;0.71, respectively). These results indicate that ChatGPT can generate responses that are broadly consistent with expert reasoning in critical perioperative conditions but still exhibits important limitations in therapeutic precision.\u003c/p\u003e \u003cp\u003e In reviewing the literature, our results align with prior reports that large language models can often generate guideline-based outputs, particularly in acute scenarios with well-established management protocols. For example, in their study, Gilson et al. reported that ChatGPT achieved 60% accuracy on the United States Medical Licensing Examination, indicating that it could replicate a large portion of clinically relevant knowledge (\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e). Similarly, in their analysis, Kung et al. found that ChatGPT provided coherent and clinically appropriate responses in internal medicine board questions, though gaps were noted in pharmacology and therapeutic decision-making (\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e). In anesthesiology specifically, Wan et al. demonstrated that ChatGPT could offer reasonable advice in airway management algorithms but tended to omit critical details such as drug doses and alternative approaches (\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e). Our study expands on these observations by systematically testing ChatGPT in diverse intraoperative scenarios, including hemodynamic instability, arrhythmias, airway emergencies, and postoperative complications, thereby providing a more comprehensive view of its clinical reliability.\u003c/p\u003e \u003cp\u003eThe numerical results of this study further highlight both the promise and the shortcomings of ChatGPT. The fact that ChatGPT achieved an 88% diagnostic accuracy suggests that its large training corpus enables recognition of common anesthetic patterns such as anaphylaxis, malignant hyperthermia, and pulmonary embolism. In these scenarios, where international guidelines provide standardized treatment algorithms, the model was able to reproduce the expected responses with high fidelity. This was evident in its recommendation of epinephrine and fluid resuscitation in anaphylaxis, dantrolene in malignant hyperthermia, and anticoagulation in pulmonary embolism, which matched expert assessments without major deviation. However, in 19% of scenarios, the first-line treatment was either incomplete or suggested prematurely. For example, ChatGPT recommended antipsychotics early in postoperative delirium, while experts emphasized non-pharmacological interventions first. Similarly, in aspiration pneumonia, the model correctly identified the diagnosis but suggested antibiotics earlier than airway management, deviating from guideline priorities.\u003c/p\u003e \u003cp\u003eThese findings echo previous reports on the limitations of LLMs. In their analysis, Rosen et al. observed that ChatGPT often generated plausible but incomplete treatment strategies in emergency medicine vignettes (\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e). In their study, Pham et al. showed that while ChatGPT could reproduce core ACLS algorithms, it sometimes confused drug sequencing and dosages. Such discrepancies are consistent with our observation that the model is strong in pattern recognition but weaker in therapeutic nuance, particularly when management requires stepwise prioritization rather than simultaneous interventions (\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eThe strength of our study lies in its structured, prospective design with predefined scenarios and standardized evaluation criteria. Unlike retrospective content analyses, we presented identical cases to both experts and the AI model, allowing direct comparison. Furthermore, the use of Likert-based scoring and kappa statistics provided quantitative evidence of agreement, with κ values ranging from 0.71 to 0.82 confirming substantial to almost perfect concordance. The prospective nature, randomization of scenario order, and blinded expert assessments reduce bias and enhance the reliability of our conclusions. Additional strengths include the focus on real-life perioperative emergencies, the systematic comparison across multiple domains (diagnosis, treatment necessity, first-line therapy), and the application of robust statistical methods such as kappa reliability analysis.\u003c/p\u003e \u003cp\u003eHowever, our study has several limitations. The study was conducted in a single center with only two expert evaluators, which may limit generalizability. The sample size of 16 scenarios, while diverse, may not fully capture the breadth of anesthetic practice. Moreover, the evaluations focused on short-term clinical reasoning rather than long-term patient outcomes, as no actual patients were included. Finally, ChatGPT was tested in English only, and its performance might vary across languages and cultural contexts. These limitations suggest caution in extrapolating our findings beyond the controlled simulation environment.\u003c/p\u003e \u003cp\u003eThe clinical implications of these results are noteworthy. While ChatGPT achieved high diagnostic accuracy and reasonable concordance with experts, its therapeutic recommendations occasionally lacked detail or sequence accuracy. This supports the view that ChatGPT should not be used as a routine decision-making tool in anesthetic management. Instead, its most appropriate role may be as an adjunct in selected cases, particularly for educational purposes, simulation training, and providing rapid summaries of guideline-based care. Similar to how adjunctive hemostatic agents such as FloSeal\u0026reg; or Surgicel\u0026reg; are not required for every partial nephrectomy but may be considered in selected complex cases, ChatGPT may have a role in complementing\u0026mdash;but not replacing\u0026mdash;expert judgment in anesthesia practice. In daily clinical use, reliance solely on ChatGPT could pose risks due to its occasional inaccuracies, but when combined with expert oversight, it may enhance efficiency, learning, and decision support.\u003c/p\u003e \u003cp\u003eFuture research should expand this work to multicenter designs with larger cohorts of anesthesiologists and a broader array of scenarios. Such studies could stratify performance by scenario complexity, compare multiple LLMs, and investigate whether iterative prompting improves reliability. Longitudinal evaluations could also assess whether repeated exposure to ChatGPT enhances resident education or simulation training outcomes. Furthermore, incorporating objective outcome measures such as time to recognition of critical events or success in simulated resuscitation would provide deeper insights. Finally, evaluating the model\u0026rsquo;s integration with electronic health records and its potential for real-time perioperative monitoring represents an important future direction.\u003c/p\u003e \u003cp\u003e In conclusion, this study demonstrated that ChatGPT achieved 88% diagnostic accuracy, 93% recognition of treatment need, and 81% concordance in first-line therapy recommendations across sixteen anesthesia scenarios, with overall agreement of 87% and kappa values between 0.71 and 0.82 indicating substantial reliability. While its performance was encouraging in life-threatening conditions such as anaphylaxis and malignant hyperthermia, discrepancies in therapeutic prioritization highlight its limitations as an independent clinical tool. ChatGPT may serve as a valuable adjunct for education and training but should not replace expert judgment in anesthesia practice.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e \u003cstrong\u003eConflict of interest statement:\u003c/strong\u003e \u003cp\u003eThe authors declare no conflicts of interest.\u003c/p\u003e \u003c/p\u003e\u003cp\u003e \u003ch2\u003eTrial registration number\u003c/h2\u003e \u003cp\u003eNot applicable.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eEthics Approval:\u003c/strong\u003e \u003cp\u003e This study was approved by the University of Health Sciences, Erzurum Faculty of Medicine Scientific Research Ethics Committee (Approval Date: 09.07.2025 / Decision No: 2025/07-194), Chairperson Prof. Hasan Kahveci. Informed consent was obtained from all individual participants (expert clinicians) included in the study.\u003c/p\u003e \u003c/p\u003e\u003cp\u003e \u003ch2\u003eConsent for Publication:\u003c/h2\u003e \u003cp\u003eNot applicable.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eCompeting Interests:\u003c/strong\u003e \u003cp\u003eThe authors declare that they have no competing interests.\u003c/p\u003e \u003c/p\u003e\u003ch2\u003eFunding:\u003c/h2\u003e \u003cp\u003eThe authors declare that no funds, grants, or other support were received during the preparation of this manuscript.\u003c/p\u003e\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003eAAK and RI conceived and designed the study. AAK, RI, and HBO were responsible for the data collection. AAK performed the statistical analysis and drafted the manuscript. RI and HBO provided critical revisions. All authors read and approved the final manuscript.\u003c/p\u003e\u003ch2\u003eData Availability\u003c/h2\u003e\u003cp\u003eThe datasets generated and analyzed during the current study (ChatGPT responses, clinician scores, and the structured evaluation framework) are available from the corresponding author on reasonable request. As this study was based on simulated scenarios and did not involve actual patient records, the data can be shared without compromise to patient privacy.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eDu X, Zhou Z, Wang Y et al. Testing and Evaluation of Generative Large Language Models in Electronic Health Record Applications: A Systematic Review. Preprint medRxiv. 2025;2024.08.11.24311828.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLiu X, Wu C, Lai R, et al. ChatGPT: when the artificial intelligence meets standardized patients in clinical training. J Transl Med. 2023;21(1):447.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCheng T, Li Y, Gu J, et al. The performance of ChatGPT in day surgery and pre-anesthesia risk assessment: a case-control study of 150 simulated patient presentations. Perioper Med (Lond). 2024;13(1):111.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChung P, Fong CT, Walters AM, Aghaeepour N, Yetisgen M, O'Reilly-Shah VN. Large Language Model Capabilities in Perioperative Risk Prediction and Prognostication. JAMA Surg. 2024;159(8):928\u0026ndash;37.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eShimada K, Inokuchi R, Ohigashi T, et al. Artificial intelligence-assisted interventions for perioperative anesthetic management: a systematic review and meta-analysis. BMC Anesthesiol. 2024;24(1):306.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKambale M, Jadhav S. Applications of artificial intelligence in anesthesia: A systematic review. Saudi J Anaesth. 2024;18(2):249\u0026ndash;56.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWilk M, Pikiewicz W, Florczak K, Jak\u0026oacute;bczak D. Use of Artificial Intelligence in Difficult Airway Assessment: The Current State of Knowledge. J Clin Med. 2025;14(5):1602.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKuas C, Canakci ME, Acar N, Kanbakan A, Cetin M, Gunsoy E. The Potential and Pitfalls of ChatGPT in Toxicological Emergencies. J Emerg Med. 2025;76:17\u0026ndash;25.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGeneş M, \u0026Ccedil;elik M. Assessment of ChatGPT's Compliance with ESC-Acute Coronary Syndrome Management Guidelines at 30-Day Intervals. Life (Basel). 2024;14(10):1235.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWei Q, Yao Z, Cui Y, Wei B, Jin Z, Xu X. Evaluation of ChatGPT-generated medical responses: A systematic review and meta-analysis. J Biomed Inf. 2024;151:104620.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJavid M, Bhandari M, Parameshwari P, Reddiboina M, Prasad S. Evaluation of ChatGPT for Patient Counseling in Kidney Stone Clinic: A Prospective Study. J Endourol. 2024;38(4):377\u0026ndash;83.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKuo FH, Fierstein JL, Tudor BH, et al. Comparing ChatGPT and a Single Anesthesiologist's Responses to Common Patient Questions: An Exploratory Cross-Sectional Survey of a Panel of Anesthesiologists. J Med Syst. 2024;48(1):77.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNoto K, Uchida S, Kinoshita H, Takekawa D, Kushikata T, Hirota K. Predictive model for post-induction hypotension in patients undergoing transcatheter aortic valve implantation: a retrospective observational study. JA Clin Rep. 2024;10(1):33.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGhaffari F, Langarizadeh M, Nabovati E, Sabery M. Effectiveness of ChatGPT for Clinical Scenario Generation: A Qualitative Study. Arch Acad Emerg Med. 2025;13(1):e49.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGavrilov SG, Grishenkova AS, Mishakina NY, Krasavin GV. Use of a novel Likert scale instrument to assess patient satisfaction following endovascular and surgical treatment of pelvic venous disorders. Phlebology. 2022;37(4):241\u0026ndash;51.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePhelps AS, Naeger DM, Courtier JL, et al. Pairwise comparison versus Likert scale for biomedical image assessment. AJR Am J Roentgenol. 2015;204(1):8\u0026ndash;14.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGilson A, Safranek CW, Huang T, et al. How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med Educ. 2023;9:e45312.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWan N, Jin Q, Chan J et al. Humans and Large Language Models in Clinical Decision Support: A Study with Medical Calculators. Preprint ArXiv. 2025;arXiv:2411.05897v2.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRosen S, Saban M. Evaluating the reliability of ChatGPT as a tool for imaging test referral: a comparative study with a clinical decision support system. Eur Radiol. 2024;34(5):2826\u0026ndash;37.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePham C, Govender R, Tehami S, Chavez S, Adepoju OE, Liaw W. ChatGPT's Performance in Cardiac Arrest and Bradycardia Simulations Using the American Heart Association's Advanced Cardiovascular Life Support Guidelines: Exploratory Study. J Med Internet Res. 2024;26:e55037.\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"bmc-anesthesiology","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"bane","sideBox":"Learn more about [BMC Anesthesiology](http://bmcanesthesiol.biomedcentral.com/)","snPcode":"","submissionUrl":"https://www.editorialmanager.com/bane","title":"BMC Anesthesiology","twitterHandle":"BMC_series","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"em","reportingPortfolio":"BMC Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"Artificial intelligence, ChatGPT, anesthesia, clinical decision support, simulation","lastPublishedDoi":"10.21203/rs.3.rs-8384638/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8384638/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003eBackground:\u003c/h2\u003e \u003cp\u003eThis study aimed to evaluate the diagnostic accuracy and clinical validity of ChatGPT\u0026rsquo;s responses in standardized anesthesia-related scenarios by directly comparing them with expert anesthesiologists' assessments.\u003c/p\u003e\u003ch2\u003eMethods:\u003c/h2\u003e \u003cp\u003eA prospective comparative study was conducted using sixteen hypothetical clinical scenarios reflecting common and critical perioperative conditions (e.g., anaphylaxis, malignant hyperthermia, pulmonary embolism). Two anesthesiologists independently evaluated the scenarios, and their responses were compared with those generated by ChatGPT (OpenAI, San Francisco, USA). A structured framework assessed diagnosis accuracy, treatment appropriateness, and compliance with international guidelines. Ratings were assigned using a 4-point Likert scale. Inter-rater agreement was analyzed using Cohen\u0026rsquo;s kappa and weighted kappa statistics. Descriptive statistics were used for categorical variables, and a p-value\u0026thinsp;\u0026lt;\u0026thinsp;0.05 was considered statistically significant.\u003c/p\u003e\u003ch2\u003eResults:\u003c/h2\u003e \u003cp\u003eChatGPT correctly identified the diagnosis in 88% (14/16) of scenarios, recognized treatment necessity in 93% (15/16), and recommended the correct first-line treatment in 81% (13/16), yielding an overall concordance of 87%. Inter-rater reliability between the two experts was almost perfect (κ\u0026thinsp;=\u0026thinsp;0.82). Substantial agreement was observed between ChatGPT and Expert 1 (κ\u0026thinsp;=\u0026thinsp;0.74) and Expert 2 (κ\u0026thinsp;=\u0026thinsp;0.71). ChatGPT performed best in life-threatening emergencies but showed limitations in therapeutic sequencing and drug dosage specification.\u003c/p\u003e\u003ch2\u003eConclusions:\u003c/h2\u003e \u003cp\u003eChatGPT demonstrated substantial agreement with expert anesthesiologists in high-stakes scenarios, suggesting potential as an adjunctive tool for education and simulation. However, its current limitations in therapeutic nuance and prioritization indicate that it should not be used as an independent clinical decision-making resource in anesthesia practice.\u003c/p\u003e","manuscriptTitle":"Performance of Chatgpt in Simulated Anesthesia Scenarios: A Prospective Comparison with Expert Clinicians","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-03-20 20:40:17","doi":"10.21203/rs.3.rs-8384638/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"editorInvitedReview","content":"","date":"2026-04-11T21:43:44+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"15691340220758537507732759263469353640","date":"2026-03-19T17:22:19+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2026-03-17T17:19:53+00:00","index":"","fulltext":""},{"type":"editorInvited","content":"","date":"2026-03-05T20:35:15+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2025-12-20T07:28:10+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2025-12-20T07:26:58+00:00","index":"","fulltext":""},{"type":"submitted","content":"BMC Anesthesiology","date":"2025-12-17T10:21:06+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"bmc-anesthesiology","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"bane","sideBox":"Learn more about [BMC Anesthesiology](http://bmcanesthesiol.biomedcentral.com/)","snPcode":"","submissionUrl":"https://www.editorialmanager.com/bane","title":"BMC Anesthesiology","twitterHandle":"BMC_series","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"em","reportingPortfolio":"BMC Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"c390b5bd-2e9e-4d3a-af8a-aa6274601e14","owner":[],"postedDate":"March 20th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[],"tags":[],"updatedAt":"2026-03-20T20:40:18+00:00","versionOfRecord":[],"versionCreatedAt":"2026-03-20 20:40:17","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8384638","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8384638","identity":"rs-8384638","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00