Beyond Text: The Impact of Clinical Context on GPT-4’s 12-lead ECG Interpretation Accuracy

preprint OA: closed
Full text JSON View at publisher
Full text 74,221 characters · extracted from preprint-html · click to expand
Beyond Text: The Impact of Clinical Context on GPT-4’s 12-lead ECG Interpretation Accuracy | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Beyond Text: The Impact of Clinical Context on GPT-4’s 12-lead ECG Interpretation Accuracy Ante Lisicic, Ana Jordan, Ana Serman, Ivana Jurin, Andrej Novak, and 3 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-4047752/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Introduction Artificial intelligence (AI) and large language models (LLMs), such as OpenAI's Chat Generative Pre-trained Transformer – version 4 (GPT-4), are being increasingly explored for medical applications, including clinical decision support. The introduction of the capability to analyze graphical inputs marks a significant advancement in the functionality of GPT-4. Despite the promising potential of AI in enhancing diagnostic accuracy, the effectiveness of GPT-4 in interpreting complex 12-lead electrocardiograms (ECGs) remains to be assessed. Methods This study utilized GPT-4 to interpret 150 12-lead ECGs from the Cardiology Research Dubrava (CaRD) registry, spanning a wide range of cardiac pathologies. The ECGs were classified into four categories for analysis: Arrhythmias (Category 1), Conduction System abnormalities (Category 2), Acute Coronary Syndrome (Category 3), and Other (Category 4). Two experiments were conducted: one where GPT-4 interpreted ECGs without clinical context and another with added clinical scenarios. A panel of experienced cardiologists evaluated the accuracy of GPT-4's interpretations. Statistical significance was determined using the Shapiro-Wilk test for distribution, Mann-Whitney U test for continuous variables, and Chi-square/Fisher's exact tests for categorical variables. Results In this cross-sectional, observational study, GPT-4 demonstrated a correct interpretation rate of 19% without clinical context and a significantly improved rate of 45% with context (p < 0.001). The addition of clinical scenarios significantly enhanced interpretative accuracy, particularly in the Category 3 (Acute Coronary Syndrome) (10 vs. 70%, p < 0.0.01). Unlike Category 4 (Other) which showed no impact (51 vs. 59%, p = 0.640), an impact with a trend toward significance was observed in Category 1 (Arrhythmias) (9.7 vs. 32%, p = 0.059) and Category 2 (Conduction System abnormalities) (4.8 vs. 19%, p = 0.088) when tasked with context. Conclusion While GPT-4 shows some potential in aiding ECG interpretation, its effectiveness varies significantly depending on the presence of clinical context. The study suggests that, in its current form, GPT-4 alone may not suffice for accurate ECG interpretation across a broad spectrum of cardiac conditions. Health sciences/Cardiology Health sciences/Medical research GPT-4 12-Lead Electrocardiogram Interpretation Large Language Models AI in Healthcare Diagnostic Accuracy Figures Figure 1 Figure 2 Figure 3 Introduction Large language models (LLMs) are artificial intelligence (AI) systems developed using machine learning techniques such as deep learning and natural language processing and are increasingly being applied in healthcare and clinical settings ( 1 – 4 ). These models, trained on extensive datasets, excel at understanding and generating human-like text, demonstrating significant potential for interpreting complex medical data and enhancing clinical decision-making processes ( 5 – 7 ). A notable LLM, OpenAI’s Chat Generative Pre-trained Transformer – version 4 (GPT-4) ( https://chat.openai.com ), was introduced on March 14, 2023. GPT has shown promising performance on the United States Medical Licensing Examination (USMLE), achieving scores at or above the passing threshold for all three sections ( 8 ). In another study, GPT-4's performance on USMLE soft skills, focusing on interpersonal skills, professionalism, legal and ethical issues, cultural competence, and organizational behavior, exceeded that of AMBOSS's past users ( 9 ). This led to further evaluations of GPT's clinical reasoning across various levels of complexity through standardized questions ( 10 , 11 ). GPT’s comprehension of the American Heart Association’s Basic Life Support (BLS) and Advanced Cardiovascular Life Support (ACLS) exam was also tested ( 12 ). Despite not reaching the passing threshold initially, GPT provided on average very relevant, accurate, and guideline-aligned responses than other AI systems, including reasoned explanations. Additionally, GPT's accuracy in responding to the Ophthalmic Knowledge Assessment Program (OKAP) exam was notable, particularly in general ophthalmology, though performance varied across ophthalmic subspecialties ( 13 ). A recent study evaluated GPT-3.5 and GPT-4 on the Polish Medical Final Examination (MFE) in English and Polish, with GPT-4 achieving mean accuracies of 79.7% and passing all MFE versions ( 14 ). However, GPT-4 generally scored below the medical students' average, revealing a correlation between answer accuracy and question difficulty. These findings underscore GPT's potential and variability in medical knowledge application, partially motivating this study. Late in 2023, GPT-4's capabilities for generating and interpreting graphical data were introduced. We aim to explore the model's effectiveness in interpreting image data, both with and without supplementary textual information, to understand the extent of the LLM's diagnostic accuracy in a medical context. Given the recent introduction of this capability, research on this topic is, at best, scarce. A few studies have begun exploring GPT-4's utility in answering medical questions related to various imaging modalities, albeit relying on textual interpretations as inputs ( 15 – 17 ). In study by Barash et al., authors investigate GPT-4's potential to improve radiology referrals in the emergency department (ED) by selecting appropriate imaging exams and generating referrals ( 16 ). GPT-4's recommendations closely aligned with the ACR Appropriateness Criteria and actual ED practices, earning high marks for referral clarity, clinical relevance, and differential diagnosis. However, it's important to note that the inputs to the LLM were textual, based on clinical notes from the ED. For a broader overview of AI applications in clinical cardiology, we refer to Rajpurkar et al. and Mesko et al. ( 15 , 18 ). The correct interpretation of 12-lead electrocardiograms (ECGs) can sometimes pose challenges, and reports suggest that cardiologists routinely identify between 50% and 95% of the abnormalities in ECGs ( 19 , 20 ). AI-powered ECG interpretation has shown promising results in improving the detection of arrhythmias, ST-segment changes, QT prolongation, and other ECG abnormalities ( 15 ). However, it has yet to be determined whether GPT-4's ECG interpretation capabilities are comparable to those of experienced cardiologists. Thus, the aim of this study was to evaluate GPT-4's ability to interpret 12-lead ECGs solely based on image data in the first experiment and the combination of ECG images augmented by a realistic clinical scenario in the subsequent experiment. Methods We specifically tasked GPT-4 with interpreting 12-lead ECGs for research purposes, utilizing text-based chat interfaces provided by OpenAI in January 2024. The study included 12-lead ECGs from patients included in the prospective Cardiology Research Dubrava (CaRD) registry (NCT06090591), containing a heterogeneous group of cardiac pathologies. The ECG recordings were uploaded to the model as digital images (obtained with the iPhone 13 Pro, Apple, USA) in the PNG format (Fig. 1 ). According to the pathology presented, ECGs were classified into four categories: Category One - Arrhythmias (31 ECGs), primarily atrial fibrillation; Category Two - Conduction System abnormalities and/or pacemaker rhythms, including 42 ECGs; Category Three - Acute Coronary Syndrome (40 ECGs); and Category Four - Other (37 ECGs), mostly including ECGs with normal sinus rhythm with or without ventricular preexcitation, premature supra/ventricular beats, etc. We performed two experiments. In first (EXP1), we tasked GPT-4 with interpreting 12-lead ECGs without any clinical context. In second (EXP2), in addition to patients’ 12-lead ECG, GPT-4 was provided with realistic clinical scenarios and assessments that considered the quality, severity, and duration of symptoms to assess the accuracy of interpretation ( 20 ). A panel of four experienced cardiologists reviewed GPT-4's ECG interpretations, with the primary endpoint being to determine the correctness of the most likely diagnosis provided by the GPT-4. The secondary endpoint involved binary grading of its performance on a scale of 0 to 7 points based on the provided information on rhythm, axis, P-wave, PR interval, QRS complex, ST segment, and T wave (one point for each correct information). We utilized the text-based chat interfaces provided by OpenAI for our inquiries and data collection. For our research, we kept the model's default temperature set to 1, adjustable within a range of 0 to 2 in the OpenAI.com Playground interface. This study used de-identified data, with all participants providing written informed consent for inclusion in the CaRD registry. The hospital’s Ethics Committee approved the study, conducted in accordance with the Declaration of Helsinki. Patients or the public were not involved in the study’s design, conduct, reporting, or dissemination. Statistical analysis The distribution of variables was assessed using the Shapiro-Wilk test. Continuous variables are reported as mean ± standard deviation when the distribution conformed to normality, and as median with interquartile range (IQR) when the distribution was non-normal. For comparing continuous variables, the Mann-Whitney U test was employed as a non-parametric test for independent samples, due to the presence of non-normal distributions in our dataset. Categorical variables are presented as counts and percentages. Comparisons of discrete variables were conducted using the Chi-square test and Fisher's exact test for samples of smaller size. A pre-defined p-value threshold of < 0.05 was established to denote statistical significance. Statistical analyses were performed using Python's scientific libraries. Results This cross-sectional, observational study included 150 ECGs with various pathologies, divided into the previously explained four main categories, and uploaded twice to GPT-4 utilizing text-based chat interfaces for interpretation, without (EXP1) and with accompanying clinical scenarios (EPX2), using separate threads to ensure that the experiments remained independent. The model provided information on rate, rhythm, axis, P-wave, PR interval, QRS complex, ST segment and T wave, “ most likely diagnosis ”, differential diagnosis, and further diagnostic steps. When tasked with interpreting ECGs without any relevant clinical symptoms described (EXP1), GPT-4 correctly identified the “ most likely diagnosis ” in 19% (28/150) of cases and scored 4.35 out of 7 points for information on rhythm, axis, P-wave, PR interval, QRS complex, ST segment and T wave. GPT-4 performed best when interpreting normal ECGs with sinus rhythm (Category 4 - Other), achieving an accuracy rate of 51% and 5.64 points. This was followed by accuracy rates of 10%, 9.7% and 4.8% for ECGs presenting with acute ischemic changes (Category 3), arrhythmias (Category 1), and conduction abnormalities or pacing rhythms (Category 2), respectively (Fig. 2 ). When tasked with ECG interpretation alongside clinical scenarios (EXP2), GPT-4 correctly identified the “ most likely diagnosis ” in 45% (68/150) of cases and achieved 4.36 out of 7 points for additional information. GPT-4 performed best in interpreting ECGs with acute ischemic changes (Category 3), attaining a 70% accuracy rate. This was followed by accuracy rates of 59% for ECGs primarily exhibiting sinus rhythm (Category 4 - Other), 32% for arrhythmias (Category 1), and 19% for conduction abnormalities or pacing rhythms (Category 2), respectively. However, in terms of additional information, GPT-4 achieved the best results in the Category 4 - Other (5.46/7 points), followed by 4.75, 3.70, and 3.51 for ECGs with arrhythmias (Category 1), acute ischemic changes (Category 3), and conduction abnormalities or pacing rhythms (Category 2), respectively. The Chi-squared test indicated a statistically significant difference in the accuracy of the “ most likely diagnosis ” between EXP1 and EXP2 (19 vs. 45%, p < 0.001). Category 1 (Arrhythmias) showed an increase in correct interpretations from 3 in EXP1 to 10 in EXP2 (when the context was provided) with a trend toward significance (9.7 vs. 32%, p = 0.059). Category 2 (Conduction abnormalities or pacing rhythms) mirrored the trend observed in Category 1, observing a slight improvement, with correct interpretations rising from 2 in EXP1 to 8 in EXP2, albeit without reaching statistical significance (4.8 vs. 19%, p = 0.088). Category 3 (Acute Coronary Syndrome) experienced the most significant enhancement, with correct interpretations jumping from 4 in EXP1 to 28 in EXP2 (10 vs. 70%, p < 0.0.01). Category 4 (Other) also saw an increase in accuracy, with correct interpretations moving from 19 in EXP1 to 22 in EXP2, however indicated no significant impact of additional context on interpretation accuracy within this category (51 vs. 59%, p = 0.640). In the analysis of the secondary endpoint, comparing the average scores of evaluations provided by the expert panel on aspects such as rhythm, axis, P-wave, PR interval, QRS complex, ST segment, and T wave evaluated by GPT-4, no statistically significant differences were observed between EXP1 and EXP2 in the overall averages (p = 0.684) (Fig. 3 ). Similarly, when examining the results within specific four categories, no statistically significant differences were detected: Category 1 (p = 0.935), Category 2 (p = 0.978), Category 3 (p = 0.155), and Category 4 (p = 0.706). Regarding the additional information on further diagnostic step(s) after ECG interpretation, GPT-4 mostly gave uniform recommendations which included the note that its interpretations must be validated with clinical correlation by a healthcare professional. Discussion To the best of our knowledge, this is the first study evaluating the value of GPT-4 in 12-lead ECG interpretation, particularly considering that the computational framework to handle image data was introduced late in 2023. The main findings of this cross-sectional, observational study including 150 ECGs with various cardiac abnormalities, are the following: 1) GPT-4 answered correctly for " most likely diagnosis " in only 19% of cases when tasked without describing any relevant clinical symptoms, 2) GPT-4 performed significantly better when tasked with describing accompanying clinical symptoms, achieving 45% accuracy rate, 3) when tasked without clinical scenarios, GPT-4 performed the best when interpreting ECGs with normal sinus rhythm (51%), and best interpreting ECGs with acute ischemic changes (70% accuracy rate) when tasked with relevant symptoms; 4) GPT-4 mostly gave uniform recommendations regarding the information on further diagnostic step(s), regardless of the underlying pathology on the ECG. These results highlight the substantial impact of incorporating contextual information into the ECG interpretation process, particularly noted in the overall accuracy improvement from EXP1 to EXP2 and the remarkable enhancement observed within Category 3 – Acute Coronary Syndrome. Moreover, this indicates that certain types of ECG recording, possibly those with more subtle complex patterns, benefit more significantly from the addition of contextual information. While GPT-4 demonstrated some ability to improve diagnostic accuracy with the inclusion of clinical scenarios in our study, its performance remained variable across different types of cardiac pathologies, which is only partially in line with recent similar studies. Namely, GPT has recently demonstrated remarkable success in non-medical and medical tests such as the MBBS, USMLE Step 1 and Step 2 examinations, and Dutch Family Medicine Examination ( 15 , 21 – 25 ). These tests primarily involved written questions, where imaging graphical data supporting the vignette were interpreted using a standard format and terminology. With the introduction of image data handling capabilities in GPT-4 research in this area remains scarce. For instance, Massey et al. revealed that GPT versions 3.5 Turbo and 4.0 exhibited greater accuracy in text-based queries, similar to our findings, especially in orthopedic assessment examinations ( 17 – 19 ). Our research further indicates that GPT-4's accuracy more than doubles when provided with descriptions of clinical symptoms, leading us to question whether the „ most likely diagnosis “ is influenced more by the interpretation of ECG images or derived from the textual descriptions of relevant clinical symptoms. Despite the observed improvement in diagnostic accuracy after introducing textual contextual data (EXP2), the performance of GPT-4 in (secondary endpoint) evaluations of rhythm, axis, P-wave, PR interval, QRS complex, ST segment, and T wave, as assessed by the expert panel, followed the same trend, regardless of the additional contextual information provided. This suggests that the enhancement in accuracy may be attributed to the provided textual information and underscores the importance of contextual information in enhancing AI's interpretative accuracy ( 26 ), potentially suggesting that LLM’s utility in clinical diagnostics could be maximized when combined with detailed patient histories and clinical presentations. Additionally, let us mention the research conducted by Currie et al. ( 27 ), which explores the capabilities of the earlier GPT-3.5 model in the context of medical imaging higher education, while also pointing out its limitations in deep and domain-specific knowledge. GPT showed proficiency in foundational subjects, demonstrating its potential in disseminating general medical knowledge. The limitations of GPT become evident in specialized tasks requiring in-depth analysis, such as ECG interpretation. Although this study was conducted using textual data, its observations are consistent with our findings. Echoing Currie et al.'s observations, our study also noted GPT-4's variable accuracy in diagnosing complex cardiac conditions, January 2024. Highlighting the critical role of contextual knowledge and expertise in medical diagnostics ( 27 ). Finally, in comparison to highly specialized and research-oriented deep learning tools, our results show lower accuracy. When developed to classify twelve rhythm classes using large numbers of single-lead ECGs, a deep neural network achieved an average area under the receiver operating characteristic curve (AUC) of 0.97 and an average F1 score, which represents the mean of the positive predictive value and sensitivity, of 0.837, exceeding that of average cardiologists (0.780) ( 28 ). However, this study included single-lead ECG, and a model was target-developed on large number of ECGs. This could also be due to inadequate modelling of GPT-4 by the engineers and lack of the specific literature on 12-lead ECG and ECG images used to train it. It could also be the complexity of 12-lead ECG even when accompanied with relevant clinical symptoms. Our findings suggest there might be a significant gap in the AI’s performance when it comes to interpreting image-based medical data as opposed to text-based data. Limitations The results of the present study should be interpreted in the light of several limitations. First, the ECGs were uploaded to the model as digital images (obtained with the iPhone13Pro, Apple, USA) and not through a dedicated application, which could have affected the results. The use of digital images, while reflecting a real-life clinical scenario, may introduce variability in image quality and interpretation accuracy. Secondly, although 150 ECGs were included in the study, this number might still be relatively small for machine learning models to generalize across the wide spectrum of cardiac pathologies. Additionally, the diversity of the sample in terms of patient demographics (age, sex, ethnicity) and the range of cardiac conditions could significantly affect the model's performance. A more heterogeneous sample might provide insights into the model's applicability to a broader patient population. Third, the practicality of integrating GPT-4 into the clinical workflow for ECG interpretation was not assessed. Real-world applicability depends on factors such as ease of use, time efficiency, and compatibility with existing healthcare IT systems. Fourth, the reliance on GPT-4, a proprietary model, introduces considerations regarding accessibility and equity in AI-assisted diagnostics. The cost associated with using such advanced AI tools may limit their availability to healthcare providers in resource-limited settings. Fifth, the study provides a snapshot of GPT-4's performance at a single point in time spanning over January 2024. AI models, particularly those that continue to learn from new data, may experience shifts in performance over time. Lastly, the study does not address the ethical and legal implications of using AI for diagnostic purposes, such as patient consent, data privacy, and liability in case of diagnostic errors. Conclusion This is the first study evaluating the value of GPT-4 in 12-lead ECG interpretation. GPT-4 exhibited low accuracy in interpreting different categories of pathology in 12-lead ECG, yielding less than 20% accuracy rate when tasked without clinical scenarios and significantly better (45% accuracy rate) when tasked with relevant clinical symptoms provided. This suggests that the enhancement in accuracy may be attributed to the provided textual information and underscores the importance of contextual information in enhancing AI's interpretative accuracy. While the use of an AI in medical diagnostics is an appealing concept in theory, the results suggests that GPT-4, in its current form, would not provide significant aid for ECG interpretation in clinical settings, especially if not provided with clinical symptoms. Continuous strategies and evaluations to improve GPT's accuracy in ECG interpretation remain crucial. Declarations Author Contribution A.L.Conceptualization, Formal analysis, Methodology, Software, Writing – original draft, Writing – review & editing; A.J. Data curation, Methodology, Visualization, Writing – original draft, Writing – review & editing; A.Š. Data curation, Formal analysis, Methodology, Visualization, Writing – original draft; I.J.: Data curation, Formal analysis, Methodology, Visualization, Writing – original draft; A.N.: Conceptualization, Formal analysis, Methodology, Software, Writing – review & editing; N.P.: Data curation, Methodology, Supervision, Writing – original draft, Writing – review & editing; Š.M.: Data curation, Methodology, Supervision, Writing – original draft, Writing – review & editing; I.Z.: Formal analysis, Methodology, Visualization, Writing – original draft, Writing – review & editing. Acknowledgments We gratefully acknowledge the Luxembourg School of Business for providing the technological support for this research. Additionally, the authors would like to acknowledge the contribution of the COST Action CA21169 (DYNALIFE), supported by COST (European Cooperation in Science and Technology). Data availability statement Data supporting this study are available upon reasonable request from the corresponding author. References Singhal K, et al. Large language models encode clinical knowledge. Nature. 2023;620(7972):172–80. van Dis EAM, Bollen J, Zuidema W, van Rooij R, Bockting CL. ChatGPT: five priorities for research. Nature. 2023;614(7947):224–6. Sezgin E., Sirrianni J., Linwood S.L. Operationalizing and Implementing Pretrained, Large Artificial Intelligence Linguistic Models in the US Health Care System: Outlook of Generative Pretrained Transformer 3 (GPT-3) as a Service Model. JMIR Med. Inform. 2022;10:e32875. Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023;29(8):1930–1940. Gala D, Makaryus AN. The Utility of Language Models in Cardiology: A Narrative Review of the Benefits and Concerns of ChatGPT-4. Int J Environ Res Public Health. 2023;20(15):6438. Nov O, Singh N, Mann D. Putting ChatGPT's Medical Advice to the (Turing) Test: Survey Study. JMIR Med Educ. 2023;9:e46939. Lim ZW, et al. Benchmarking large language models' performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard. EBioMedicine. 2023;95:104770. Kung TH, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198. Brin D, et al. Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments. Sci Rep. 2023;13(1):16492. Novak A, et al. The Pulse of Artificial Intelligence in Cardiology: A Comprehensive Evaluation of State-of-the-art Large Language Models for Potential Use in Clinical Cardiology. medRxiv. 2023:2023-08. (preprint) Moons P, Van Bulck L. ChatGPT: can artificial intelligence language models be of value for cardiovascular nurses and allied health professionals. Eur J Cardiovasc Nurs. 2023;22(7):e55-e59. Fijačko N, Gosak L, Štiglic G, Picard CT, John Douma M. Can ChatGPT pass the life support exams without entering the American heart association course? Resuscitation. 2023;185:109732. Antaki F, Touma S, Milad D, El-Khoury J, Duval R. Evaluating the Performance of ChatGPT in Ophthalmology: An Analysis of Its Successes and Shortcomings. Ophthalmol Sci. 2023;3(4):100324. Rosoł M, Gąsior JS, Łaba J, Korzeniewski K, Młyńczak M. Evaluation of the performance of GPT-3.5 and GPT-4 on the Polish Medical Final Examination. Sci Rep. 2023;13(1):20512. Rajpurkar P, Chen E, Banerjee O, Topol EJ. AI in health and medicine. Nature medicine. 2022;28(1):31–8. Barash Y, Klang E, Konen E, Sorin V. ChatGPT-4 Assistance in Optimizing Emergency Department Radiology Referrals and Imaging Selection. J Am Coll Radiol. 2023;20(10):998–1003. Massey PA, Montgomery C, Zhang AS. Comparison of ChatGPT-3.5, ChatGPT-4, and Orthopaedic Resident Performance on Orthopaedic Assessment Examinations. J Am Acad Orthop Surg. 2023;31(23):1173–9. Meskó B, Görög M. A short guide for medical professionals in the era of artificial intelligence. NPJ Digit Med. 2020;3:126. Cairns A, et al. A computer human interaction model to improve the diagnostic accuracy and clinical decision making during 12-lead electrocardiogram interpretation. J Biomed Inform 2016;64:93–107. Birnbaum Y, et al. The role of the ECG in diagnosis, risk estimation, and catheterization laboratory activation in patients with acute coronary syndromes: a consensus document. Ann Noninvasive Electrocardiol. 2014;19(5):412–25. Katz DM, Bommarito MJ, Gao S, Arredondo P. GPT-4 Passes the bar exam. Social Science Research Network. 2023. Kung JE, Marshall C, Gauthier C, Gonzalez TA, Jackson JB 3rd. Evaluating ChatGPT Performance on the Orthopaedic In-Training Examination. JB JS Open Access. 2023;8(3):e23.00056. Gilson A, et al.: How does ChatGPT perform on the United States medical licensing examination? The implications of large language models for medical education and knowledge assessment. JMIR Med Educ 2023;9:e45312. Subramani M, Jaleel I, Krishna Mohan S. Evaluating the performance of ChatGPT in medical physiology university examination of phase I MBBS. Adv Physiol Educ. 2023;47(2):270–1. Morreel S, Mathysen D, Verhoeven V. Aye, AI! ChatGPT passes multiple-choice family medicine exam. Med Teach. 2023;45(6):665–6. Chitale PA, Gala J, Dabre R. An Empirical Analysis of In-context Learning Abilities of LLMs for MT. arXiv:2024;2401.12097 (arXiv preprint). Currie G, Singh C, Nelson T, Nabasenja C, Al-Hayek Y, Spuur K. ChatGPT in medical imaging higher education. Radiography. 2023;29(4):792–9. Hannun AY, et al. Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network. Nat Med. 2019;25(1):65–9. Additional Declarations No competing interests reported. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-4047752","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":283449163,"identity":"c04df00b-61d4-4942-ab5a-d9517208ba4c","order_by":0,"name":"Ante Lisicic","email":"","orcid":"","institution":"Dubrava University Hospital","correspondingAuthor":false,"prefix":"","firstName":"Ante","middleName":"","lastName":"Lisicic","suffix":""},{"id":283449164,"identity":"9dcf9858-138d-488a-8089-73d038a9ef5e","order_by":1,"name":"Ana Jordan","email":"","orcid":"","institution":"Dubrava University Hospital","correspondingAuthor":false,"prefix":"","firstName":"Ana","middleName":"","lastName":"Jordan","suffix":""},{"id":283449165,"identity":"fe09a092-895d-4a03-8439-cea4df8d83c7","order_by":2,"name":"Ana Serman","email":"","orcid":"","institution":"University of Zagreb","correspondingAuthor":false,"prefix":"","firstName":"Ana","middleName":"","lastName":"Serman","suffix":""},{"id":283449166,"identity":"9ce55be3-955e-4671-aee9-82592789be15","order_by":3,"name":"Ivana Jurin","email":"","orcid":"","institution":"Dubrava University Hospital","correspondingAuthor":false,"prefix":"","firstName":"Ivana","middleName":"","lastName":"Jurin","suffix":""},{"id":283449167,"identity":"1e113905-1154-4634-b4eb-e435d3f45bbd","order_by":4,"name":"Andrej Novak","email":"","orcid":"","institution":"University of Zagreb","correspondingAuthor":false,"prefix":"","firstName":"Andrej","middleName":"","lastName":"Novak","suffix":""},{"id":283449168,"identity":"e07d3ead-e067-4e29-b015-86433af1be40","order_by":5,"name":"Nikola Pavlovic","email":"","orcid":"","institution":"Dubrava University Hospital","correspondingAuthor":false,"prefix":"","firstName":"Nikola","middleName":"","lastName":"Pavlovic","suffix":""},{"id":283449169,"identity":"189e7d7f-9f9d-494f-a2e4-d9910ce34d00","order_by":6,"name":"Sime Manola","email":"","orcid":"","institution":"Dubrava University Hospital","correspondingAuthor":false,"prefix":"","firstName":"Sime","middleName":"","lastName":"Manola","suffix":""},{"id":283449170,"identity":"ae449902-767a-43d1-9701-12050ea60c93","order_by":7,"name":"Ivan Zeljkovic","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA6UlEQVRIiWNgGAWjYBACNghlwSzBwMD4AMji4SNSiwRIC7MBSAsbkZZJACEDmwSSIbgBn0Tys4c/KiTYJdt7zCq/5tjJsDEwP3x0A5/DJNLMDSTOSDBL85wxuy27LRnoMDZj4xx8WngOmEkYtkkwy0mkpd2W3MYM1MLDJo1fy/FvEon/IFqKJbfVE6GFvcdM4mAD0GESyccYP247TJSWMsmGYxLMkj2HD0szbjvOw8ZMwC/yzezbJH/U2CRLHG9s/PhzW7U9P3vzw8f4tMBAMohg5gGTRCgHATsQwfiDSNWjYBSMglEwsgAAy+I5k5+90ycAAAAASUVORK5CYII=","orcid":"","institution":"Dubrava University Hospital","correspondingAuthor":true,"prefix":"","firstName":"Ivan","middleName":"","lastName":"Zeljkovic","suffix":""}],"badges":[],"createdAt":"2024-03-08 18:05:14","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-4047752/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-4047752/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":53533106,"identity":"dee440b1-3ba6-453c-b1a1-07b76320c20b","added_by":"auto","created_at":"2024-03-27 06:55:01","extension":"jpg","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":5088153,"visible":true,"origin":"","legend":"\u003cp\u003eAn example of a 12-lead ECG recording selected from the Cardiology Research Dubrava (CaRD) registry, depicting one of the 150 cases of varied cardiac pathologies used in this study. This image pertains to Category 3, presenting acute ST segment elevation in inferior leads and ST denivelation in V1,V2, D1 and aVL leads. The image was obtained using an iPhone 13 Pro (Apple, USA) and is presented in PNG format.\u003c/p\u003e","description":"","filename":"Figure1.jpg","url":"https://assets-eu.researchsquare.com/files/rs-4047752/v1/abec8eb59d6669aaae1ef2ac.jpg"},{"id":53534269,"identity":"c2246161-2564-4ccd-a408-6f14e9b4ebc1","added_by":"auto","created_at":"2024-03-27 07:11:01","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":232623,"visible":true,"origin":"","legend":"\u003cp\u003eCounts of Correct Answers by Category for Experiment 1 (without clinical context) and Experiment 2 (with clinical context)\u003c/p\u003e\n\u003cp\u003eEXP1 – Experiment 1; EXP2 –Experiment 2; Outcome 0 – incorrect answer by GPT-4; Outcome 1- correct answer by GPT-4.\u003c/p\u003e","description":"","filename":"Figure2.png","url":"https://assets-eu.researchsquare.com/files/rs-4047752/v1/f1e0060156c530acd8676c8e.png"},{"id":53533696,"identity":"cdf095d6-80b8-4ec8-855b-21cd64927561","added_by":"auto","created_at":"2024-03-27 07:03:01","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":316358,"visible":true,"origin":"","legend":"\u003cp\u003eThe figure presents a side-by-side violin plot comparison of the distribution of scores given by the expert panel without and with clinical context. Experiment 1 (without context) is represented by the orange violins, while Experiment 2 (with context) is depicted in blue. Each violin plot illustrates the density of the data at different values, with wider sections indicating a higher frequency of data points at those values.\u003c/p\u003e","description":"","filename":"Figure3.png","url":"https://assets-eu.researchsquare.com/files/rs-4047752/v1/5064101f01968ec636ddaa7f.png"},{"id":54673503,"identity":"b736328a-3efa-451b-9e5c-efdccc69d1bd","added_by":"auto","created_at":"2024-04-15 05:52:47","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":641378,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4047752/v1/71c2e2c9-84c1-4819-9c0a-e714508d181a.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Beyond Text: The Impact of Clinical Context on GPT-4’s 12-lead ECG Interpretation Accuracy","fulltext":[{"header":"Introduction","content":"\u003cp\u003eLarge language models (LLMs) are artificial intelligence (AI) systems developed using machine learning techniques such as deep learning and natural language processing and are increasingly being applied in healthcare and clinical settings (\u003cspan additionalcitationids=\"CR2 CR3\" citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e). These models, trained on extensive datasets, excel at understanding and generating human-like text, demonstrating significant potential for interpreting complex medical data and enhancing clinical decision-making processes (\u003cspan additionalcitationids=\"CR6\" citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eA notable LLM, OpenAI\u0026rsquo;s Chat Generative Pre-trained Transformer \u0026ndash; version 4 (GPT-4) (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://chat.openai.com\u003c/span\u003e\u003cspan address=\"https://chat.openai.com\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e), was introduced on March 14, 2023. GPT has shown promising performance on the United States Medical Licensing Examination (USMLE), achieving scores at or above the passing threshold for all three sections (\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e). In another study, GPT-4's performance on USMLE soft skills, focusing on interpersonal skills, professionalism, legal and ethical issues, cultural competence, and organizational behavior, exceeded that of AMBOSS's past users (\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e). This led to further evaluations of GPT's clinical reasoning across various levels of complexity through standardized questions (\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e, \u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e). GPT\u0026rsquo;s comprehension of the American Heart Association\u0026rsquo;s Basic Life Support (BLS) and Advanced Cardiovascular Life Support (ACLS) exam was also tested (\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e). Despite not reaching the passing threshold initially, GPT provided on average very relevant, accurate, and guideline-aligned responses than other AI systems, including reasoned explanations. Additionally, GPT's accuracy in responding to the Ophthalmic Knowledge Assessment Program (OKAP) exam was notable, particularly in general ophthalmology, though performance varied across ophthalmic subspecialties (\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e). A recent study evaluated GPT-3.5 and GPT-4 on the Polish Medical Final Examination (MFE) in English and Polish, with GPT-4 achieving mean accuracies of 79.7% and passing all MFE versions (\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e). However, GPT-4 generally scored below the medical students' average, revealing a correlation between answer accuracy and question difficulty. These findings underscore GPT's potential and variability in medical knowledge application, partially motivating this study.\u003c/p\u003e \u003cp\u003eLate in 2023, GPT-4's capabilities for generating and interpreting graphical data were introduced. We aim to explore the model's effectiveness in interpreting image data, both with and without supplementary textual information, to understand the extent of the LLM's diagnostic accuracy in a medical context. Given the recent introduction of this capability, research on this topic is, at best, scarce. A few studies have begun exploring GPT-4's utility in answering medical questions related to various imaging modalities, albeit relying on textual interpretations as inputs (\u003cspan additionalcitationids=\"CR16\" citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e). In study by Barash et al., authors investigate GPT-4's potential to improve radiology referrals in the emergency department (ED) by selecting appropriate imaging exams and generating referrals (\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e). GPT-4's recommendations closely aligned with the ACR Appropriateness Criteria and actual ED practices, earning high marks for referral clarity, clinical relevance, and differential diagnosis. However, it's important to note that the inputs to the LLM were textual, based on clinical notes from the ED. For a broader overview of AI applications in clinical cardiology, we refer to Rajpurkar et al. and Mesko et al. (\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e, \u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eThe correct interpretation of 12-lead electrocardiograms (ECGs) can sometimes pose challenges, and reports suggest that cardiologists routinely identify between 50% and 95% of the abnormalities in ECGs (\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e, \u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e). AI-powered ECG interpretation has shown promising results in improving the detection of arrhythmias, ST-segment changes, QT prolongation, and other ECG abnormalities (\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e). However, it has yet to be determined whether GPT-4's ECG interpretation capabilities are comparable to those of experienced cardiologists. Thus, the aim of this study was to evaluate GPT-4's ability to interpret 12-lead ECGs solely based on image data in the first experiment and the combination of ECG images augmented by a realistic clinical scenario in the subsequent experiment.\u003c/p\u003e"},{"header":"Methods","content":"\u003cp\u003e We specifically tasked GPT-4 with interpreting 12-lead ECGs for research purposes, utilizing text-based chat interfaces provided by OpenAI in January 2024. The study included 12-lead ECGs from patients included in the prospective Cardiology Research Dubrava (CaRD) registry (NCT06090591), containing a heterogeneous group of cardiac pathologies. The ECG recordings were uploaded to the model as digital images (obtained with the iPhone 13 Pro, Apple, USA) in the PNG format (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eAccording to the pathology presented, ECGs were classified into four categories: Category One - Arrhythmias (31 ECGs), primarily atrial fibrillation; Category Two - Conduction System abnormalities and/or pacemaker rhythms, including 42 ECGs; Category Three - Acute Coronary Syndrome (40 ECGs); and Category Four - Other (37 ECGs), mostly including ECGs with normal sinus rhythm with or without ventricular preexcitation, premature supra/ventricular beats, etc. We performed two experiments. In first (EXP1), we tasked GPT-4 with interpreting 12-lead ECGs without any clinical context. In second (EXP2), in addition to patients\u0026rsquo; 12-lead ECG, GPT-4 was provided with realistic clinical scenarios and assessments that considered the quality, severity, and duration of symptoms to assess the accuracy of interpretation (\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eA panel of four experienced cardiologists reviewed GPT-4's ECG interpretations, with the primary endpoint being to determine the correctness of the most likely diagnosis provided by the GPT-4. The secondary endpoint involved binary grading of its performance on a scale of 0 to 7 points based on the provided information on rhythm, axis, P-wave, PR interval, QRS complex, ST segment, and T wave (one point for each correct information). We utilized the text-based chat interfaces provided by OpenAI for our inquiries and data collection. For our research, we kept the model's default temperature set to 1, adjustable within a range of 0 to 2 in the OpenAI.com Playground interface.\u003c/p\u003e \u003cp\u003eThis study used de-identified data, with all participants providing written informed consent for inclusion in the CaRD registry. The hospital\u0026rsquo;s Ethics Committee approved the study, conducted in accordance with the Declaration of Helsinki. Patients or the public were not involved in the study\u0026rsquo;s design, conduct, reporting, or dissemination.\u003c/p\u003e \u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003eStatistical analysis\u003c/h2\u003e \u003cp\u003eThe distribution of variables was assessed using the Shapiro-Wilk test. Continuous variables are reported as mean\u0026thinsp;\u0026plusmn;\u0026thinsp;standard deviation when the distribution conformed to normality, and as median with interquartile range (IQR) when the distribution was non-normal. For comparing continuous variables, the Mann-Whitney U test was employed as a non-parametric test for independent samples, due to the presence of non-normal distributions in our dataset. Categorical variables are presented as counts and percentages. Comparisons of discrete variables were conducted using the Chi-square test and Fisher's exact test for samples of smaller size. A pre-defined p-value threshold of \u0026lt;\u0026thinsp;0.05 was established to denote statistical significance. Statistical analyses were performed using Python's scientific libraries.\u003c/p\u003e \u003c/div\u003e"},{"header":"Results","content":"\u003cp\u003eThis cross-sectional, observational study included 150 ECGs with various pathologies, divided into the previously explained four main categories, and uploaded twice to GPT-4 utilizing text-based chat interfaces for interpretation, without (EXP1) and with accompanying clinical scenarios (EPX2), using separate threads to ensure that the experiments remained independent.\u003c/p\u003e \u003cp\u003eThe model provided information on rate, rhythm, axis, P-wave, PR interval, QRS complex, ST segment and T wave, \u0026ldquo;\u003cem\u003emost likely diagnosis\u003c/em\u003e\u0026rdquo;, differential diagnosis, and further diagnostic steps.\u003c/p\u003e \u003cp\u003eWhen tasked with interpreting ECGs without any relevant clinical symptoms described (EXP1), GPT-4 correctly identified the \u0026ldquo;\u003cem\u003emost likely diagnosis\u003c/em\u003e\u0026rdquo; in 19% (28/150) of cases and scored 4.35 out of 7 points for information on rhythm, axis, P-wave, PR interval, QRS complex, ST segment and T wave. GPT-4 performed best when interpreting normal ECGs with sinus rhythm (Category 4 - Other), achieving an accuracy rate of 51% and 5.64 points. This was followed by accuracy rates of 10%, 9.7% and 4.8% for ECGs presenting with acute ischemic changes (Category 3), arrhythmias (Category 1), and conduction abnormalities or pacing rhythms (Category 2), respectively (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eWhen tasked with ECG interpretation alongside clinical scenarios (EXP2), GPT-4 correctly identified the \u0026ldquo;\u003cem\u003emost likely diagnosis\u003c/em\u003e\u0026rdquo; in 45% (68/150) of cases and achieved 4.36 out of 7 points for additional information. GPT-4 performed best in interpreting ECGs with acute ischemic changes (Category 3), attaining a 70% accuracy rate. This was followed by accuracy rates of 59% for ECGs primarily exhibiting sinus rhythm (Category 4 - Other), 32% for arrhythmias (Category 1), and 19% for conduction abnormalities or pacing rhythms (Category 2), respectively. However, in terms of additional information, GPT-4 achieved the best results in the Category 4 - Other (5.46/7 points), followed by 4.75, 3.70, and 3.51 for ECGs with arrhythmias (Category 1), acute ischemic changes (Category 3), and conduction abnormalities or pacing rhythms (Category 2), respectively.\u003c/p\u003e \u003cp\u003eThe Chi-squared test indicated a statistically significant difference in the accuracy of the \u0026ldquo;\u003cem\u003emost likely diagnosis\u003c/em\u003e\u0026rdquo; between EXP1 and EXP2 (19 vs. 45%, p\u0026thinsp;\u0026lt;\u0026thinsp;0.001). Category 1 (Arrhythmias) showed an increase in correct interpretations from 3 in EXP1 to 10 in EXP2 (when the context was provided) with a trend toward significance (9.7 vs. 32%, p\u0026thinsp;=\u0026thinsp;0.059). Category 2 (Conduction abnormalities or pacing rhythms) mirrored the trend observed in Category 1, observing a slight improvement, with correct interpretations rising from 2 in EXP1 to 8 in EXP2, albeit without reaching statistical significance (4.8 vs. 19%, p\u0026thinsp;=\u0026thinsp;0.088). Category 3 (Acute Coronary Syndrome) experienced the most significant enhancement, with correct interpretations jumping from 4 in EXP1 to 28 in EXP2 (10 vs. 70%, p\u0026thinsp;\u0026lt;\u0026thinsp;0.0.01). Category 4 (Other) also saw an increase in accuracy, with correct interpretations moving from 19 in EXP1 to 22 in EXP2, however indicated no significant impact of additional context on interpretation accuracy within this category (51 vs. 59%, p\u0026thinsp;=\u0026thinsp;0.640).\u003c/p\u003e \u003cp\u003eIn the analysis of the secondary endpoint, comparing the average scores of evaluations provided by the expert panel on aspects such as rhythm, axis, P-wave, PR interval, QRS complex, ST segment, and T wave evaluated by GPT-4, no statistically significant differences were observed between EXP1 and EXP2 in the overall averages (p\u0026thinsp;=\u0026thinsp;0.684) (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e). Similarly, when examining the results within specific four categories, no statistically significant differences were detected: Category 1 (p\u0026thinsp;=\u0026thinsp;0.935), Category 2 (p\u0026thinsp;=\u0026thinsp;0.978), Category 3 (p\u0026thinsp;=\u0026thinsp;0.155), and Category 4 (p\u0026thinsp;=\u0026thinsp;0.706).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eRegarding the additional information on further diagnostic step(s) after ECG interpretation, GPT-4 mostly gave uniform recommendations which included the note that its interpretations must be validated with clinical correlation by a healthcare professional.\u003c/p\u003e"},{"header":"Discussion","content":"\u003cp\u003eTo the best of our knowledge, this is the first study evaluating the value of GPT-4 in 12-lead ECG interpretation, particularly considering that the computational framework to handle image data was introduced late in 2023. The main findings of this cross-sectional, observational study including 150 ECGs with various cardiac abnormalities, are the following: 1) GPT-4 answered correctly for \"\u003cem\u003emost likely diagnosis\u003c/em\u003e\" in only 19% of cases when tasked without describing any relevant clinical symptoms, 2) GPT-4 performed significantly better when tasked with describing accompanying clinical symptoms, achieving 45% accuracy rate, 3) when tasked without clinical scenarios, GPT-4 performed the best when interpreting ECGs with normal sinus rhythm (51%), and best interpreting ECGs with acute ischemic changes (70% accuracy rate) when tasked with relevant symptoms; 4) GPT-4 mostly gave uniform recommendations regarding the information on further diagnostic step(s), regardless of the underlying pathology on the ECG.\u003c/p\u003e \u003cp\u003eThese results highlight the substantial impact of incorporating contextual information into the ECG interpretation process, particularly noted in the overall accuracy improvement from EXP1 to EXP2 and the remarkable enhancement observed within Category 3 \u0026ndash; Acute Coronary Syndrome. Moreover, this indicates that certain types of ECG recording, possibly those with more subtle complex patterns, benefit more significantly from the addition of contextual information.\u003c/p\u003e \u003cp\u003eWhile GPT-4 demonstrated some ability to improve diagnostic accuracy with the inclusion of clinical scenarios in our study, its performance remained variable across different types of cardiac pathologies, which is only partially in line with recent similar studies. Namely, GPT has recently demonstrated remarkable success in non-medical and medical tests such as the MBBS, USMLE Step 1 and Step 2 examinations, and Dutch Family Medicine Examination (\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e, \u003cspan additionalcitationids=\"CR22 CR23 CR24\" citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e). These tests primarily involved written questions, where imaging graphical data supporting the vignette were interpreted using a standard format and terminology. With the introduction of image data handling capabilities in GPT-4 research in this area remains scarce. For instance, Massey et al. revealed that GPT versions 3.5 Turbo and 4.0 exhibited greater accuracy in text-based queries, similar to our findings, especially in orthopedic assessment examinations (\u003cspan additionalcitationids=\"CR18\" citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e). Our research further indicates that GPT-4's accuracy more than doubles when provided with descriptions of clinical symptoms, leading us to question whether the \u0026bdquo;\u003cem\u003emost likely diagnosis\u003c/em\u003e\u0026ldquo; is influenced more by the interpretation of ECG images or derived from the textual descriptions of relevant clinical symptoms. Despite the observed improvement in diagnostic accuracy after introducing textual contextual data (EXP2), the performance of GPT-4 in (secondary endpoint) evaluations of rhythm, axis, P-wave, PR interval, QRS complex, ST segment, and T wave, as assessed by the expert panel, followed the same trend, regardless of the additional contextual information provided. This suggests that the enhancement in accuracy may be attributed to the provided textual information and underscores the importance of contextual information in enhancing AI's interpretative accuracy (\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e), potentially suggesting that LLM\u0026rsquo;s utility in clinical diagnostics could be maximized when combined with detailed patient histories and clinical presentations.\u003c/p\u003e \u003cp\u003eAdditionally, let us mention the research conducted by Currie et al. (\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e), which explores the capabilities of the earlier GPT-3.5 model in the context of medical imaging higher education, while also pointing out its limitations in deep and domain-specific knowledge. GPT showed proficiency in foundational subjects, demonstrating its potential in disseminating general medical knowledge. The limitations of GPT become evident in specialized tasks requiring in-depth analysis, such as ECG interpretation. Although this study was conducted using textual data, its observations are consistent with our findings. Echoing Currie et al.'s observations, our study also noted GPT-4's variable accuracy in diagnosing complex cardiac conditions, January 2024. Highlighting the critical role of contextual knowledge and expertise in medical diagnostics (\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e). Finally, in comparison to highly specialized and research-oriented deep learning tools, our results show lower accuracy. When developed to classify twelve rhythm classes using large numbers of single-lead ECGs, a deep neural network achieved an average area under the receiver operating characteristic curve (AUC) of 0.97 and an average F1 score, which represents the mean of the positive predictive value and sensitivity, of 0.837, exceeding that of average cardiologists (0.780) (\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e). However, this study included single-lead ECG, and a model was target-developed on large number of ECGs. This could also be due to inadequate modelling of GPT-4 by the engineers and lack of the specific literature on 12-lead ECG and ECG images used to train it. It could also be the complexity of 12-lead ECG even when accompanied with relevant clinical symptoms. Our findings suggest there might be a significant gap in the AI\u0026rsquo;s performance when it comes to interpreting image-based medical data as opposed to text-based data.\u003c/p\u003e\n\u003ch3\u003eLimitations\u003c/h3\u003e\n\u003cp\u003eThe results of the present study should be interpreted in the light of several limitations. First, the ECGs were uploaded to the model as digital images (obtained with the iPhone13Pro, Apple, USA) and not through a dedicated application, which could have affected the results. The use of digital images, while reflecting a real-life clinical scenario, may introduce variability in image quality and interpretation accuracy.\u003c/p\u003e \u003cp\u003eSecondly, although 150 ECGs were included in the study, this number might still be relatively small for machine learning models to generalize across the wide spectrum of cardiac pathologies. Additionally, the diversity of the sample in terms of patient demographics (age, sex, ethnicity) and the range of cardiac conditions could significantly affect the model's performance. A more heterogeneous sample might provide insights into the model's applicability to a broader patient population. Third, the practicality of integrating GPT-4 into the clinical workflow for ECG interpretation was not assessed. Real-world applicability depends on factors such as ease of use, time efficiency, and compatibility with existing healthcare IT systems. Fourth, the reliance on GPT-4, a proprietary model, introduces considerations regarding accessibility and equity in AI-assisted diagnostics. The cost associated with using such advanced AI tools may limit their availability to healthcare providers in resource-limited settings. Fifth, the study provides a snapshot of GPT-4's performance at a single point in time spanning over January 2024. AI models, particularly those that continue to learn from new data, may experience shifts in performance over time. Lastly, the study does not address the ethical and legal implications of using AI for diagnostic purposes, such as patient consent, data privacy, and liability in case of diagnostic errors.\u003c/p\u003e"},{"header":"Conclusion","content":"\u003cp\u003eThis is the first study evaluating the value of GPT-4 in 12-lead ECG interpretation. GPT-4 exhibited low accuracy in interpreting different categories of pathology in 12-lead ECG, yielding less than 20% accuracy rate when tasked without clinical scenarios and significantly better (45% accuracy rate) when tasked with relevant clinical symptoms provided. This suggests that the enhancement in accuracy may be attributed to the provided textual information and underscores the importance of contextual information in enhancing AI's interpretative accuracy. While the use of an AI in medical diagnostics is an appealing concept in theory, the results suggests that GPT-4, in its current form, would not provide significant aid for ECG interpretation in clinical settings, especially if not provided with clinical symptoms. Continuous strategies and evaluations to improve GPT's accuracy in ECG interpretation remain crucial.\u003c/p\u003e"},{"header":"Declarations","content":"\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003eA.L.Conceptualization, Formal analysis, Methodology, Software, Writing \u0026ndash; original draft, Writing \u0026ndash; review \u0026amp; editing; A.J. Data curation, Methodology, Visualization, Writing \u0026ndash; original draft, Writing \u0026ndash; review \u0026amp; editing; A.Š. Data curation, Formal analysis, Methodology, Visualization, Writing \u0026ndash; original draft; I.J.: Data curation, Formal analysis, Methodology, Visualization, Writing \u0026ndash; original draft; A.N.: Conceptualization, Formal analysis, Methodology, Software, Writing \u0026ndash; review \u0026amp; editing; N.P.: Data curation, Methodology, Supervision, Writing \u0026ndash; original draft, Writing \u0026ndash; review \u0026amp; editing; Š.M.: Data curation, Methodology, Supervision, Writing \u0026ndash; original draft, Writing \u0026ndash; review \u0026amp; editing; I.Z.: Formal analysis, Methodology, Visualization, Writing \u0026ndash; original draft, Writing \u0026ndash; review \u0026amp; editing.\u003c/p\u003e\u003ch2\u003eAcknowledgments\u003c/h2\u003e \u003cp\u003eWe gratefully acknowledge the Luxembourg School of Business for providing the technological support for this research. Additionally, the authors would like to acknowledge the contribution of the COST Action CA21169 (DYNALIFE), supported by COST (European Cooperation in Science and Technology).\u003c/p\u003e\u003ch2\u003eData availability statement\u003c/h2\u003e \u003cp\u003eData supporting this study are available upon reasonable request from the corresponding author.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eSinghal K, et al. Large language models encode clinical knowledge. Nature. 2023;620(7972):172\u0026ndash;80.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003evan Dis EAM, Bollen J, Zuidema W, van Rooij R, Bockting CL. ChatGPT: five priorities for research. Nature. 2023;614(7947):224\u0026ndash;6.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSezgin E., Sirrianni J., Linwood S.L. Operationalizing and Implementing Pretrained, Large Artificial Intelligence Linguistic Models in the US Health Care System: Outlook of Generative Pretrained Transformer 3 (GPT-3) as a Service Model. JMIR Med. Inform. 2022;10:e32875.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eThirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023;29(8):1930\u0026ndash;1940.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGala D, Makaryus AN. The Utility of Language Models in Cardiology: A Narrative Review of the Benefits and Concerns of ChatGPT-4. Int J Environ Res Public Health. 2023;20(15):6438.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNov O, Singh N, Mann D. Putting ChatGPT's Medical Advice to the (Turing) Test: Survey Study. JMIR Med Educ. 2023;9:e46939.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLim ZW, et al. Benchmarking large language models' performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard. EBioMedicine. 2023;95:104770.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKung TH, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBrin D, et al. Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments. Sci Rep. 2023;13(1):16492.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNovak A, et al. The Pulse of Artificial Intelligence in Cardiology: A Comprehensive Evaluation of State-of-the-art Large Language Models for Potential Use in Clinical Cardiology. medRxiv. 2023:2023-08. (preprint)\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMoons P, Van Bulck L. ChatGPT: can artificial intelligence language models be of value for cardiovascular nurses and allied health professionals. Eur J Cardiovasc Nurs. 2023;22(7):e55-e59.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFijačko N, Gosak L, Štiglic G, Picard CT, John Douma M. Can ChatGPT pass the life support exams without entering the American heart association course? Resuscitation. 2023;185:109732.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAntaki F, Touma S, Milad D, El-Khoury J, Duval R. Evaluating the Performance of ChatGPT in Ophthalmology: An Analysis of Its Successes and Shortcomings. Ophthalmol Sci. 2023;3(4):100324.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRosoł M, Gąsior JS, Łaba J, Korzeniewski K, Młyńczak M. Evaluation of the performance of GPT-3.5 and GPT-4 on the Polish Medical Final Examination. Sci Rep. 2023;13(1):20512.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRajpurkar P, Chen E, Banerjee O, Topol EJ. AI in health and medicine. Nature medicine. 2022;28(1):31\u0026ndash;8.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBarash Y, Klang E, Konen E, Sorin V. ChatGPT-4 Assistance in Optimizing Emergency Department Radiology Referrals and Imaging Selection. J Am Coll Radiol. 2023;20(10):998\u0026ndash;1003.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMassey PA, Montgomery C, Zhang AS. Comparison of ChatGPT-3.5, ChatGPT-4, and Orthopaedic Resident Performance on Orthopaedic Assessment Examinations. J Am Acad Orthop Surg. 2023;31(23):1173\u0026ndash;9.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMesk\u0026oacute; B, G\u0026ouml;r\u0026ouml;g M. A short guide for medical professionals in the era of artificial intelligence. NPJ Digit Med. 2020;3:126.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCairns A, et al. A computer human interaction model to improve the diagnostic accuracy and clinical decision making during 12-lead electrocardiogram interpretation. J Biomed Inform 2016;64:93\u0026ndash;107.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBirnbaum Y, et al. The role of the ECG in diagnosis, risk estimation, and catheterization laboratory activation in patients with acute coronary syndromes: a consensus document. Ann Noninvasive Electrocardiol. 2014;19(5):412\u0026ndash;25.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKatz DM, Bommarito MJ, Gao S, Arredondo P. GPT-4 Passes the bar exam. Social Science Research Network. 2023.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKung JE, Marshall C, Gauthier C, Gonzalez TA, Jackson JB 3rd. Evaluating ChatGPT Performance on the Orthopaedic In-Training Examination. JB JS Open Access. 2023;8(3):e23.00056.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGilson A, et al.: How does ChatGPT perform on the United States medical licensing examination? The implications of large language models for medical education and knowledge assessment. JMIR Med Educ 2023;9:e45312.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSubramani M, Jaleel I, Krishna Mohan S. Evaluating the performance of ChatGPT in medical physiology university examination of phase I MBBS. Adv Physiol Educ. 2023;47(2):270\u0026ndash;1.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMorreel S, Mathysen D, Verhoeven V. Aye, AI! ChatGPT passes multiple-choice family medicine exam. Med Teach. 2023;45(6):665\u0026ndash;6.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChitale PA, Gala J, Dabre R. An Empirical Analysis of In-context Learning Abilities of LLMs for MT. arXiv:2024;2401.12097 (arXiv preprint).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCurrie G, Singh C, Nelson T, Nabasenja C, Al-Hayek Y, Spuur K. ChatGPT in medical imaging higher education. Radiography. 2023;29(4):792\u0026ndash;9.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHannun AY, et al. Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network. Nat Med. 2019;25(1):65\u0026ndash;9.\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"GPT-4, 12-Lead Electrocardiogram, Interpretation, Large Language Models, AI in Healthcare, Diagnostic Accuracy","lastPublishedDoi":"10.21203/rs.3.rs-4047752/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-4047752/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003e\u003cstrong\u003eIntroduction\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eArtificial intelligence (AI) and large language models (LLMs), such as OpenAI's Chat Generative Pre-trained Transformer – version 4 (GPT-4), are being increasingly explored for medical applications, including clinical decision support. The introduction of the capability to analyze graphical inputs marks a significant advancement in the functionality of GPT-4. Despite the promising potential of AI in enhancing diagnostic accuracy, the effectiveness of GPT-4 in interpreting complex 12-lead electrocardiograms (ECGs) remains to be assessed.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eMethods\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis study utilized GPT-4 to interpret 150 12-lead ECGs from the Cardiology Research Dubrava (CaRD) registry, spanning a wide range of cardiac pathologies. The ECGs were classified into four categories for analysis: Arrhythmias (Category 1), Conduction System abnormalities (Category 2), Acute Coronary Syndrome (Category 3), and Other (Category 4). Two experiments were conducted: one where GPT-4 interpreted ECGs without clinical context and another with added clinical scenarios. A panel of experienced cardiologists evaluated the accuracy of GPT-4's interpretations. Statistical significance was determined using the Shapiro-Wilk test for distribution, Mann-Whitney U test for continuous variables, and Chi-square/Fisher's exact tests for categorical variables.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eResults\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eIn this cross-sectional, observational study, GPT-4 demonstrated a correct interpretation rate of 19% without clinical context and a significantly improved rate of 45% with context (p \u0026lt; 0.001). The addition of clinical scenarios significantly enhanced interpretative accuracy, particularly in the Category 3 (Acute Coronary Syndrome) (10 vs. 70%, p \u0026lt; 0.0.01). Unlike Category 4 (Other) which showed no impact (51 vs. 59%, p = 0.640), an impact with a trend toward significance was observed in Category 1 (Arrhythmias) (9.7 vs. 32%, p = 0.059) and Category 2 (Conduction System abnormalities) (4.8 vs. 19%, p = 0.088) when tasked with context.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConclusion\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eWhile GPT-4 shows some potential in aiding ECG interpretation, its effectiveness varies significantly depending on the presence of clinical context. The study suggests that, in its current form, GPT-4 alone may not suffice for accurate ECG interpretation across a broad spectrum of cardiac conditions.\u003c/p\u003e","manuscriptTitle":"Beyond Text: The Impact of Clinical Context on GPT-4’s 12-lead ECG Interpretation Accuracy","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-03-27 06:54:56","doi":"10.21203/rs.3.rs-4047752/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"c742761b-3d5b-4f86-915d-cf66482907db","owner":[],"postedDate":"March 27th, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":29836431,"name":"Health sciences/Cardiology"},{"id":29836432,"name":"Health sciences/Medical research"}],"tags":[],"updatedAt":"2024-04-15T05:44:40+00:00","versionOfRecord":[],"versionCreatedAt":"2024-03-27 06:54:56","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-4047752","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-4047752","identity":"rs-4047752","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00