LLM Doc: An Assessment of ChatGPT’s Ability to Consent Patients for IR Procedures

preprint OA: closed
Full text JSON View at publisher
Full text 65,434 characters · extracted from preprint-html · click to expand
LLM Doc: An Assessment of ChatGPT’s Ability to Consent Patients for IR Procedures | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article LLM Doc: An Assessment of ChatGPT’s Ability to Consent Patients for IR Procedures Hayden Hofmann, Jenanan Vairavamurthy This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-4565118/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 29 Nov, 2024 Read the published version in CVIR Endovascular → Version 1 posted 5 You are reading this latest preprint version Abstract Purpose : The study aims to evaluate how current interventional radiologists view ChatGPT in the context of informed consent for interventional radiology (IR) procedures. Methods : ChatGPT-4 was instructed to outline the risks, benefits, and alternatives for IR procedures. The outputs were reviewed by IR physicians to assess if outputs were 1) accurate, 2) comprehensive, 3) easy to understand, 4) written in a conversational tone, and 5) if they were comfortable providing the output to the patient. For each criterion, outputs were measured on a 5-point scale. Mean scores and percentage of physicians rating output as sufficient (4 or 5 on 5-point scale) were measured. A linear regression correlated mean rating with number of years in practice. Intraclass correlation coefficient(ICC) measured agreement among physicians. Results: The mean rating of the ChatGPT responses was 4.29, 3.85, 4.15, 4.24, 3.82 for accuracy, comprehensiveness, readability, conversational tone, and physician comfort level, respectively. Percentage of physicians rating outputs as sufficient was 84%, 71%, 85%, 85%, and 67% for accuracy, comprehensiveness, readability, conversational tone, and physician comfort level, respectively. There was an inverse relationship between years in training and output score (coeff = -0.03413, p=0.0128); ICC measured 0.39 (p=0.003). Conclusions : GPT-4 produced outputs that were accurate, understandable, and in a conversational tone. However, GPT-4 had a decreased capacity to produce a comprehensive output leading some physicians to be uncomfortable providing the output to patients. Practicing IRs should be aware of these limitations when counseling patients as ChatGPT-4 continues to develop into a clinically usable AI tool. ChatGPT4 Large Language Model Informed Consent Interventional Radiology Artificial Intelligence Figures Figure 1 Figure 2 Introduction The chatbot nature of popular open access large language models (LLMs) such as Chat Generative Pre-Trained Transformer (ChatGPT) has garnered much attention for its ability to produce accurate and coherent information for a wide variety of medical inquiries. As these models rapidly evolve, researchers are eager to understand how to harness the power of these models while also investigating potential weaknesses. In March of 2023, OpenAI released its most recent LLM entitled Chat Generative Pre-Trained Transformed 4 (ChatGPT-4). The updated model was trained on an extensive dataset and demonstrated an improved ability to understand and generate text for more complex scenarios [ 1 ]. The latest model has received widespread recognition given its greatly improved text comprehension and production to create more accurate outputs. ChatGPT-4’s increased output accuracy has led to higher performance on the United States Medical Licensing Examination (USMLE) and on radiology board exams increasing its score from 69% with ChatGPT-3 to 81% with ChatGPT-4 [2, 3]. Previous studies have analyzed ChatGPT for potential application in patient education. Specifically in the field of interventional radiology (IR), ChatGPT was found to provide reliable information regarding IR-related content, however, in some instances, the information can be both inaccurate and confusing [ 4 ]. Other studies have highlighted the potentially dangerous shortcomings of ChatGPT, such as its generation of nonexistent references for medical diagnoses and treatments [ 5 ]. Given the free and open-access nature of the model, patients may utilize this tool to research their conditions making them susceptible to misinformation regarding their conditions and treatment options. According to recent CDC reports, over 58% of US adults have used the Internet to find medical information [ 6 ]. As LLMs continue to grow in popularity, more patients may look to ChatGPT’s interactive interface to find answers to their medical questions. The updates to the ChatGPT model as well as the increasing popularity and widespread knowledge of this new LLM technology provide an opportunity to reassess how ChatGPT can be further used in informed consent and patient education within interventional radiology. ChatGPT’s accurate and human-like outputs through a chatbot user interface could be leveraged to assist in consenting patients for certain procedures to help streamline care. Conversely, much like other medical information providers of the past, patients may get information about upcoming procedures from these LLMs, so practicing physicians should be aware of the information patients may be receiving. The purpose of this study is to evaluate how current interventional radiologists view ChatGPT outputs in the context of obtaining informed consent for common interventional radiology procedures. Methods Institutional review board (IRB) exemption was obtained. ChatGPT-4 (GPT4) was prompted to outline the risks, benefits, and alternatives for five common procedures performed by interventional radiologists: CT-guided lung biopsy, percutaneous nephrostomy tube placement, transarterial chemoembolization (TACE), inferior vena cava (IVC) filter placement, and IVC filter retrieval. The following prompt was queried: “Hello ChatGPT, you are an interventional radiologist consenting a patient for a procedure. Please write a short paragraph that outlines the risks, benefits, and alternatives of a [IR procedure] that can be given to the patient.” The survey was distributed to a large number of interventional radiology physicians in academic settings through email and social media. No incentive was offered for completing the survey. All outputs were assessed for 1) correctness, 2) comprehensiveness, 3) readability for a standard patient, 4) physician comfort with providing the output to the patient, and 5) conversational tone. Physicians evaluated each component on a Likert Scale numbered 1 through 5 which corresponded to strongly disagree, disagree, neither agree nor disagree, agree, and strongly agree, respectively. The percentage of graders who deemed the output sufficient, recording a 4 (agree) or 5 (strongly agree), was measured and component scores were averaged across cohorts. Interrater reliability was measured using the two-way random-effects model to calculate the intraclass correlation coefficient (ICC) using the IRR package in R [ 7 ]. Scores given by attendings and resident cohorts were compared using t-tests. Additionally, linear regression was used to compare reported years of practice and average output scores. GPT4 outputs were further analyzed for readability using common readability measurement scales such as Flesch Kincaid Grade Level, Flesch Kincaid Reading Ease, Coleman Liau Index, Gunning Fog Score, and Smog Index. Sentence and word count were recorded. Results We obtained responses from 21 physicians (n = 21) from six academic institutions. Seven residents (33.3%) and fourteen attendings (66.7%) responded of which sixteen were male (76.2%) and five were female (23.8%). The average year in practice for physicians was 16.4 years. On a scale of 1–5, the average grade of the ChatGPT responses was 4.29 on accuracy, 3.85 on comprehensiveness, 4.15 on readability, 4.24 on conversational tone, and 3.82 on physician comfort level. Figure 1 portrays the physicians’ average ranking across all five procedures and all five evaluation metrics as a histogram with a cumulative mean (4.07) and neutral ranking (3.0). The percentage of graders who deemed the output sufficient rating a 4 (agree) or 5 (strongly agree) was 84%, 71%, 85%, 85%, and 67% for accuracy, comprehensiveness, readability, conversational tone, and physician comfort level, respectively. Further breakdown of ChatGPT scores by survey criteria is outlined in Table 1 . Table 1 outlines the GPT-4’s performance on the IR procedures. Table 1 Cumulative Mean Lung Biopsy Mean Nephro Tube Mean IVC Placement Mean IVC Retrieval Mean TACE Mean Attending Mean Resident Mean % graders \(\ge\) 4 (all procedures) Accuracy 4.29 4.52 4.33 4.24 4.05 4.29 4.09 4.69 84 Comprehensiveness 3.85 3.95 4.05 3.71 3.62 3.91 3.76 4.03 71 Readability 4.15 4.33 4.14 4.14 3.95 4.19 4.2 4.06 85 Conversational Tone 4.24 4.38 4.29 4.10 4.19 4.24 4.23 4.26 85 Comfortable providing to patient 3.82 4.10 3.95 3.62 3.62 3.81 3.73 4.00 67 There was an inverse relationship between years in training and mean output score (coeff = -0.03413, p = 0.0128); Fig. 2 . However, no significant difference was seen between mean output scores between attending (4.00) and resident (4.21) cohorts (p = 0.5367). The intraclass correlation coefficient (ICC) was 0.39 (p = 0.003) with 95% confidence of [0.111, 0.646] from the 25 survey question responses from each of the 21 physicians. The mean Flesch-Kincaid grade level for the five GPT4 generated consent information was 11.65. The Flesch-Kincaid Reading Ease, Coleman Liau Index, Gunning Fog Score, and Smog Index were 42.2, 13.11, 15.38, and 11.2, respectively. The readability measures of the original ChatGPT outputs for the IR procedures can be seen in Table 2 . Table 2 outlines the GPT-4’s reading level performance. Table 2 Lung Biopsy Nephro Tube IVC Placement IVC Retrieval TACE Mean Flesch Grade Level 12.27 10.47 11.57 11.29 12.63 11.65 Flesch Reading Ease 41 47 43 47 33 42.2 Coleman Liau Index 13.04 12.64 13.36 11.99 14.50 13.11 Gunning Fog Score 16.8 14.6 14.8 14.5 16.2 15.4 Smog Index 12.21 10.44 10.80 11.00 11.57 11.20 Word Count 243 259 255 278 255 258 Sentence Count 12 14 12 12 11 12 Discussion As evaluated by practicing IR physicians, ChatGPT generated information for patient consent that was accurate, written in a conversational tone, and could be understood by a standard patient. The majority of physicians agreed the outputs were accurate, conversational, and could be read by a normal patient, all critical aspects of the informed consent process. Prior studies have evaluated both ChatGPT-3 and ChatGPT-4 utility in patient consent, finding that both models accurately answered patient inquiries regarding IR procedures [ 8 ]. However, the Flesch-Kincaid Grade Level (11.65) for GPT4 was well above the recommended 8th-grade level which is in concordance with findings seen in previous studies that examine the readability of the prior ChatGPT-3 model [ 4 ]. The similarities seen in both models highlight how updates to the ChatGPT model do not necessitate improvements in all domains. Due to the shifting scopes of LLMs like ChatGPT, there is need for ongoing reassessment to evaluate the potential strengths and pitfalls of this clinically applicable tool. Furthermore, a periodical assessment of how physicians view this new-age tool is critical as the LLMs evolve and both the capabilities and opinions on this technology change. The surveyed physicians highlighted a potential pitfall of ChatGPT-4 as a viable clinical tool: a decreased ability to provide a comprehensive explanation for the IR procedure. Nearly one-third of physicians reported that the GPT-4 output did not comprehensively explain the procedures’ risks, benefits, and alternatives. While ChatGPT-4’s outputs were rated as accurate, a significant portion of physicians reported that the information was insufficient in explaining the necessary information for a patient's consent, a finding similarly observed in the previous iteration of ChatGPT [ 9 ]. This limitation may explain why one-third of all surveyed physicians (33%) reported that they were not comfortable providing the outputs to their patients. The data highlights the need for physician supervision and verifications of medical outputs from the current version of the model. While physician verification of ChatGPT outputs may seem like the optimal path forward, lack of agreement across physicians may limit this implementation. The intraclass correlation coefficient (0.39) measured physician ratings across all five procedures and demonstrated poor interrater agreement. The surveying physicians were unable to agree on their evaluation of ChatGPT outputs, highlighting the subjective nature in which medical ChatGPT outputs are viewed by physicians. The poor interrater reliability exemplifies an obstacle that must be addressed in the implementation and future deployment of LLM technology. In analyzing the physician demographics, the linear regression revealed an inverse relationship between years in training and average output score. The fewer years in practice of the physician the higher they rated the ChatGPT output, and conversely, the more years in practice of the physician the lower they rated the output. The relationship has been observed in other studies that surveyed the general public finding that people under 50 years old are more likely to find ChatGPT highly useful than those over 50 years old [ 10 ]. With technology becoming a more integral part of medicine, it is important to convey the current state of LLM technology to the IR community. IR is a specialty built by innovation. From its beginning, IR has been a field of rapid adaptors who continuously iterate and develop more advanced ways to impact patient care by constantly pushing the envelope on what is possible. As a specialty that resides at the forefront of innovation and at the intersection of medicine and technology, IR is a field ripe to capitalize on this rapidly evolving technology. This study was limited in its design in that physicians surveyed in the study were not blinded to the author of the outputs (ChatGPT). Therefore, raters may have been biased both positively and negatively by their preconceived views of ChatGPT and artificial intelligence. Furthermore, the twenty-one physicians all from academic institutions are unlikely to be a completely accurate representation of all practicing interventional radiologists. A larger study is needed to better capture a more representative assessment of the model through the lens of a general IR physician. Still, the present study provides a strong evaluation of the state of ChatGPT-4 that is needed to further assess how ChatGPT evolves. Furthermore, our input prompt to ChatGPT was not validated to be comprehensive, so patients may use other wording when interfacing with LLMs that may produce information not captured by our query. Conclusion ChatGPT4 generated information for patient consent for IR procedures that was accurate, written in a conversational tone, and could be understood by a standard patient. However, outputs were insufficient in explaining the necessary information for a patient's consent leading IR physicians to feel uncomfortable providing the outputs to patients. Practicing interventional radiologists must acknowledge that ChatGPT-4 will be a tool patients use to garner information about procedures, and we must expose ourselves to the information they will receive. More importantly, we must continuously evaluate its thoroughness and accuracy in portraying IR procedures to adequately counsel patients. The outputs provided by LLMs are not completely comprehensive, and at times inaccurate, which may lead to a disconnect between patients and physicians in their understanding of their conditions and treatment options. IR physicians should understand both the immense promise this technology possesses in addition to its potentially damaging drawbacks. Declarations Funding : None Conflict of Interests : None Declaration Ethics Approval and Consent to participate: Exemption was obtained from the University of Southern California Institutional Review Board. For this study, informed consent was not required. Consent for Publication : The datasets generated and/or analyzed during the current study are not publicly available to protect the opinions and beliefs of the surveyed physicians but are available from the corresponding author on reasonable request. Availability of data and materials : The datasets generated and/or analysed during the current study are not publicly available to protect the privacy of the surveyed physicians but are available from the corresponding author on reasonable request. Competing Interests: The authors declare that they have no competing interests. Funding: The study was not supported by any funding. Authors’ Contributions: HH helped with study design, analyzed and interpreted the data, and was a major contributor to writing the manuscript. JV helped design the study, reach out to physicians to collect the data, and interpret the results. All authors read and approved the final manuscript. Acknowledgements: Not applicable. References OpenAI. GPT-4 Technical Report. arXiv:230308774 [cs] . Published online March 15, 2023. https://arxiv.org/abs/2303.08774 Knoedler L, Alfertshofer M, Knoedler S, et al. Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative Analysis. JMIR medical education . 2024;10:e51148-e51148. doi:https://doi.org/10.2196/51148 ‌3.Rajesh Bhayana, Bleakney RR, Krishna S. GPT-4 in Radiology: Improvements in Advanced Reasoning. 2023;307(5). doi:https://doi.org/10.1148/radiol.230987 McCarthy CJ, Berkowitz SA, Ramalingam V, Ahmed M. Evaluation of an Artificial Intelligence Chatbot for Delivery of Interventional Radiology Patient Education Material: A Comparison with Societal Website Content. Journal of Vascular and Interventional Radiology . Published online June 1, 2023. doi:https://doi.org/10.1016/j.jvir.2023.05.037 Hoffer EK. ChatGPT Provides References That Are Real, Inappropriate, or (Most Often) Fake. Journal of Vascular and Interventional Radiology . 2023;34(12):2240-2242. doi:https://doi.org/10.1016/j.jvir.2023.07.001 Wang X, Cohen R. Health Information Technology Use Among Adults: United States, July-December 2022. Centers for Disease Control and Prevention. Published October 31, 2023. https://stacks.cdc.gov/view/cdc/133700 Gamer M, Lemon J, Singh IFP. irr: Various Coefficients of Interrater Reliability and Agreement. R-Packages. Published January 26, 2019. https://cran.r-project.org/web/packages/irr/index.html Scheschenja M, Viniol S, Bastian MB, Wessendorf J, König AM, Mahnken AH. Feasibility of GPT-3 and GPT-4 for in-Depth Patient Education Prior to Interventional Radiological Procedures: A Comparative Analysis. CardioVascular and Interventional Radiology . Published online October 23, 2023. doi:https://doi.org/10.1007/s00270-023-03563-2 Johnson D, Goodman R, Patrinely J, et al. Assessing the Accuracy and Reliability of AI-Generated Medical Responses: An Evaluation of the Chat-GPT Model. Assessing the Accuracy and Reliability of AI-Generated Medical Responses: An Evaluation of the Chat-GPT Model . Published online February 28, 2023. doi:https://doi.org/10.21203/rs.3.rs-2566942/v1 Vogels EA. A majority of Americans have heard of ChatGPT, but few have tried it themselves. Pew Research Center. Published May 24, 2023. https://www.pewresearch.org/short-reads/2023/05/24/a-majority-of-americans-have-heard-of-chatgpt-but-few-have-tried-it-themselves/ Cite Share Download PDF Status: Published Journal Publication published 29 Nov, 2024 Read the published version in CVIR Endovascular → Version 1 posted Editorial decision: Major revision 12 Aug, 2024 Reviewers agreed at journal 05 Jul, 2024 Reviewers invited by journal 05 Jul, 2024 Editor assigned by journal 23 Jun, 2024 First submitted to journal 17 Jun, 2024 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-4565118","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":323114686,"identity":"7967fece-6640-4e33-b244-c6e878ad4af5","order_by":0,"name":"Hayden Hofmann","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA+UlEQVRIiWNgGAWjYBACAyCWYGA4wMDeAOJWADEzcwNxWngOgLhnQFoYSdHC2AYSI6DFnP2M4Y2PO+7I8Ygdfva4cF5tNH87UMuPim04tVj25BhbzjzzzJhHOs3ceOa247kzDjM2MPacuY3bYQdyzKR52w4n7pdOADK2HcttAGphZmzDo+X8GzPpv0AtPdLp36R55xzLnU9Qyw2gLYxgLSDrGmpyNxDW8qzYsrftMNAvOWXSPMcO5G4EajmI1y/nkzfe+Nl2WI5HOn2bNE9NXe6884cPPvhRgVsLAwOHATLvMJg8gEc9ELA/QObV4Vc8CkbBKBgFIxIAAEaGXpqzYF3hAAAAAElFTkSuQmCC","orcid":"https://orcid.org/0009-0003-6751-7573","institution":"University of Southern California Keck School of Medicine","correspondingAuthor":true,"prefix":"","firstName":"Hayden","middleName":"","lastName":"Hofmann","suffix":""},{"id":323114687,"identity":"a88f856e-d3df-4081-8a37-4b726821adb0","order_by":1,"name":"Jenanan Vairavamurthy","email":"","orcid":"","institution":"Mount Sinai School of Medicine: Icahn School of Medicine at Mount Sinai","correspondingAuthor":false,"prefix":"","firstName":"Jenanan","middleName":"","lastName":"Vairavamurthy","suffix":""}],"badges":[],"createdAt":"2024-06-11 15:11:16","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-4565118/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-4565118/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1186/s42155-024-00477-z","type":"published","date":"2024-11-29T15:56:58+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":62135453,"identity":"c022b564-54e8-4cac-82f8-4a549ad65661","added_by":"auto","created_at":"2024-08-09 16:17:38","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":1990048,"visible":true,"origin":"","legend":"\u003cp\u003eHistogram of mean physician rankings of ChatGPT-4 outputs across all five procedures. The vertical line at 3.0 represents the neutral ranking.\u003c/p\u003e","description":"","filename":"Figure1.png","url":"https://assets-eu.researchsquare.com/files/rs-4565118/v1/6501057aa69b23ce00a4bd8a.png"},{"id":62135454,"identity":"8fcceaa4-a5ce-448c-9349-038caad8d8b0","added_by":"auto","created_at":"2024-08-09 16:17:38","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":961239,"visible":true,"origin":"","legend":"\u003cp\u003eLinear Regression correlating mean physician ranking across all five procedures to number of years in practice. The horizontal line at 3.0 represents the neutral ranking.\u003c/p\u003e","description":"","filename":"Figure2.png","url":"https://assets-eu.researchsquare.com/files/rs-4565118/v1/891a5e7f95de6dbd508ec6fe.png"},{"id":70381568,"identity":"aef3e839-acb1-4ccd-be6e-ed75bc924602","added_by":"auto","created_at":"2024-12-02 16:12:28","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":3013753,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4565118/v1/20e1d274-f0b9-4f4a-8856-429e78bd3d83.pdf"}],"financialInterests":"","formattedTitle":"LLM Doc: An Assessment of ChatGPT’s Ability to Consent Patients for IR Procedures","fulltext":[{"header":"Introduction","content":"\u003cp\u003eThe chatbot nature of popular open access large language models (LLMs) such as Chat Generative Pre-Trained Transformer (ChatGPT) has garnered much attention for its ability to produce accurate and coherent information for a wide variety of medical inquiries. As these models rapidly evolve, researchers are eager to understand how to harness the power of these models while also investigating potential weaknesses. In March of 2023, OpenAI released its most recent LLM entitled Chat Generative Pre-Trained Transformed 4 (ChatGPT-4). The updated model was trained on an extensive dataset and demonstrated an improved ability to understand and generate text for more complex scenarios [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e]. The latest model has received widespread recognition given its greatly improved text comprehension and production to create more accurate outputs. ChatGPT-4\u0026rsquo;s increased output accuracy has led to higher performance on the United States Medical Licensing Examination (USMLE) and on radiology board exams increasing its score from 69% with ChatGPT-3 to 81% with ChatGPT-4 [2, 3].\u003c/p\u003e \u003cp\u003ePrevious studies have analyzed ChatGPT for potential application in patient education. Specifically in the field of interventional radiology (IR), ChatGPT was found to provide reliable information regarding IR-related content, however, in some instances, the information can be both inaccurate and confusing [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e4\u003c/span\u003e]. Other studies have highlighted the potentially dangerous shortcomings of ChatGPT, such as its generation of nonexistent references for medical diagnoses and treatments [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e5\u003c/span\u003e]. Given the free and open-access nature of the model, patients may utilize this tool to research their conditions making them susceptible to misinformation regarding their conditions and treatment options. According to recent CDC reports, over 58% of US adults have used the Internet to find medical information [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e6\u003c/span\u003e]. As LLMs continue to grow in popularity, more patients may look to ChatGPT\u0026rsquo;s interactive interface to find answers to their medical questions.\u003c/p\u003e \u003cp\u003e The updates to the ChatGPT model as well as the increasing popularity and widespread knowledge of this new LLM technology provide an opportunity to reassess how ChatGPT can be further used in informed consent and patient education within interventional radiology. ChatGPT\u0026rsquo;s accurate and human-like outputs through a chatbot user interface could be leveraged to assist in consenting patients for certain procedures to help streamline care. Conversely, much like other medical information providers of the past, patients may get information about upcoming procedures from these LLMs, so practicing physicians should be aware of the information patients may be receiving. The purpose of this study is to evaluate how current interventional radiologists view ChatGPT outputs in the context of obtaining informed consent for common interventional radiology procedures.\u003c/p\u003e"},{"header":"Methods","content":"\u003cp\u003eInstitutional review board (IRB) exemption was obtained. ChatGPT-4 (GPT4) was prompted to outline the risks, benefits, and alternatives for five common procedures performed by interventional radiologists: CT-guided lung biopsy, percutaneous nephrostomy tube placement, transarterial chemoembolization (TACE), inferior vena cava (IVC) filter placement, and IVC filter retrieval. The following prompt was queried: \u0026ldquo;Hello ChatGPT, you are an interventional radiologist consenting a patient for a procedure. Please write a short paragraph that outlines the risks, benefits, and alternatives of a [IR procedure] that can be given to the patient.\u0026rdquo; The survey was distributed to a large number of interventional radiology physicians in academic settings through email and social media. No incentive was offered for completing the survey. All outputs were assessed for 1) correctness, 2) comprehensiveness, 3) readability for a standard patient, 4) physician comfort with providing the output to the patient, and 5) conversational tone. Physicians evaluated each component on a Likert Scale numbered 1 through 5 which corresponded to strongly disagree, disagree, neither agree nor disagree, agree, and strongly agree, respectively.\u003c/p\u003e \u003cp\u003eThe percentage of graders who deemed the output sufficient, recording a 4 (agree) or 5 (strongly agree), was measured and component scores were averaged across cohorts. Interrater reliability was measured using the two-way random-effects model to calculate the intraclass correlation coefficient (ICC) using the IRR package in R [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e7\u003c/span\u003e]. Scores given by attendings and resident cohorts were compared using t-tests. Additionally, linear regression was used to compare reported years of practice and average output scores. GPT4 outputs were further analyzed for readability using common readability measurement scales such as Flesch Kincaid Grade Level, Flesch Kincaid Reading Ease, Coleman Liau Index, Gunning Fog Score, and Smog Index. Sentence and word count were recorded.\u003c/p\u003e"},{"header":"Results","content":"\u003cp\u003eWe obtained responses from 21 physicians (n\u0026thinsp;=\u0026thinsp;21) from six academic institutions. Seven residents (33.3%) and fourteen attendings (66.7%) responded of which sixteen were male (76.2%) and five were female (23.8%). The average year in practice for physicians was 16.4 years. On a scale of 1\u0026ndash;5, the average grade of the ChatGPT responses was 4.29 on accuracy, 3.85 on comprehensiveness, 4.15 on readability, 4.24 on conversational tone, and 3.82 on physician comfort level. Figure\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e portrays the physicians\u0026rsquo; average ranking across all five procedures and all five evaluation metrics as a histogram with a cumulative mean (4.07) and neutral ranking (3.0). The percentage of graders who deemed the output sufficient rating a 4 (agree) or 5 (strongly agree) was 84%, 71%, 85%, 85%, and 67% for accuracy, comprehensiveness, readability, conversational tone, and physician comfort level, respectively. Further breakdown of ChatGPT scores by survey criteria is outlined in Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eoutlines the GPT-4\u0026rsquo;s performance on the IR procedures.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"10\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c8\" colnum=\"8\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c9\" colnum=\"9\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c10\" colnum=\"10\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eTable\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCumulative Mean\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eLung Biopsy Mean\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eNephro Tube Mean\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eIVC Placement Mean\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eIVC Retrieval Mean\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003eTACE Mean\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c8\"\u003e \u003cp\u003eAttending Mean\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c9\"\u003e \u003cp\u003eResident Mean\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c10\"\u003e \u003cp\u003e% graders \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\ge\\)\u003c/span\u003e\u003c/span\u003e 4 (all procedures)\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAccuracy\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e4.29\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e4.52\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e4.33\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e4.24\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e4.05\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e4.29\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e4.09\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e \u003cp\u003e4.69\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e \u003cp\u003e84\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eComprehensiveness\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e3.85\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e3.95\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e4.05\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e3.71\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e3.62\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e3.91\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e3.76\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e \u003cp\u003e4.03\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e \u003cp\u003e71\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eReadability\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e4.15\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e4.33\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e4.14\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e4.14\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e3.95\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e4.19\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e4.2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e \u003cp\u003e4.06\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e \u003cp\u003e85\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eConversational Tone\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e4.24\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e4.38\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e4.29\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e4.10\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e4.19\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e4.24\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e4.23\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e \u003cp\u003e4.26\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e \u003cp\u003e85\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eComfortable providing to patient\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e3.82\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e4.10\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e3.95\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e3.62\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e3.62\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e3.81\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e3.73\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e \u003cp\u003e4.00\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e \u003cp\u003e67\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eThere was an inverse relationship between years in training and mean output score (coeff = -0.03413, p\u0026thinsp;=\u0026thinsp;0.0128); Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e. However, no significant difference was seen between mean output scores between attending (4.00) and resident (4.21) cohorts (p\u0026thinsp;=\u0026thinsp;0.5367). The intraclass correlation coefficient (ICC) was 0.39 (p\u0026thinsp;=\u0026thinsp;0.003) with 95% confidence of [0.111, 0.646] from the 25 survey question responses from each of the 21 physicians.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eThe mean Flesch-Kincaid grade level for the five GPT4 generated consent information was 11.65. The Flesch-Kincaid Reading Ease, Coleman Liau Index, Gunning Fog Score, and Smog Index were 42.2, 13.11, 15.38, and 11.2, respectively. The readability measures of the original ChatGPT outputs for the IR procedures can be seen in Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eoutlines the GPT-4\u0026rsquo;s reading level performance.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"7\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eTable\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eLung Biopsy\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eNephro Tube\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eIVC Placement\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eIVC Retrieval\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eTACE\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003eMean\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eFlesch Grade Level\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e12.27\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e10.47\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e11.57\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e11.29\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e12.63\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e11.65\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eFlesch Reading Ease\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e41\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e47\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e43\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e47\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e33\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e42.2\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eColeman Liau Index\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e13.04\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e12.64\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e13.36\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e11.99\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e14.50\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e13.11\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGunning Fog Score\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e16.8\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e14.6\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e14.8\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e14.5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e16.2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e15.4\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eSmog Index\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e12.21\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e10.44\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e10.80\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e11.00\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e11.57\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e11.20\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eWord Count\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e243\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e259\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e255\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e278\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e255\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e258\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eSentence Count\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e12\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e14\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e12\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e12\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e11\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e12\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e"},{"header":"Discussion","content":"\u003cp\u003eAs evaluated by practicing IR physicians, ChatGPT generated information for patient consent that was accurate, written in a conversational tone, and could be understood by a standard patient. The majority of physicians agreed the outputs were accurate, conversational, and could be read by a normal patient, all critical aspects of the informed consent process. Prior studies have evaluated both ChatGPT-3 and ChatGPT-4 utility in patient consent, finding that both models accurately answered patient inquiries regarding IR procedures [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e8\u003c/span\u003e]. However, the Flesch-Kincaid Grade Level (11.65) for GPT4 was well above the recommended 8th-grade level which is in concordance with findings seen in previous studies that examine the readability of the prior ChatGPT-3 model [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e4\u003c/span\u003e]. The similarities seen in both models highlight how updates to the ChatGPT model do not necessitate improvements in all domains. Due to the shifting scopes of LLMs like ChatGPT, there is need for ongoing reassessment to evaluate the potential strengths and pitfalls of this clinically applicable tool.\u003c/p\u003e \u003cp\u003eFurthermore, a periodical assessment of how physicians view this new-age tool is critical as the LLMs evolve and both the capabilities and opinions on this technology change. The surveyed physicians highlighted a potential pitfall of ChatGPT-4 as a viable clinical tool: a decreased ability to provide a comprehensive explanation for the IR procedure. Nearly one-third of physicians reported that the GPT-4 output did not comprehensively explain the procedures\u0026rsquo; risks, benefits, and alternatives. While ChatGPT-4\u0026rsquo;s outputs were rated as accurate, a significant portion of physicians reported that the information was insufficient in explaining the necessary information for a patient's consent, a finding similarly observed in the previous iteration of ChatGPT [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e9\u003c/span\u003e]. This limitation may explain why one-third of all surveyed physicians (33%) reported that they were not comfortable providing the outputs to their patients. The data highlights the need for physician supervision and verifications of medical outputs from the current version of the model.\u003c/p\u003e \u003cp\u003eWhile physician verification of ChatGPT outputs may seem like the optimal path forward, lack of agreement across physicians may limit this implementation. The intraclass correlation coefficient (0.39) measured physician ratings across all five procedures and demonstrated poor interrater agreement. The surveying physicians were unable to agree on their evaluation of ChatGPT outputs, highlighting the subjective nature in which medical ChatGPT outputs are viewed by physicians. The poor interrater reliability exemplifies an obstacle that must be addressed in the implementation and future deployment of LLM technology.\u003c/p\u003e \u003cp\u003eIn analyzing the physician demographics, the linear regression revealed an inverse relationship between years in training and average output score. The fewer years in practice of the physician the higher they rated the ChatGPT output, and conversely, the more years in practice of the physician the lower they rated the output. The relationship has been observed in other studies that surveyed the general public finding that people under 50 years old are more likely to find ChatGPT highly useful than those over 50 years old [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e10\u003c/span\u003e]. With technology becoming a more integral part of medicine, it is important to convey the current state of LLM technology to the IR community. IR is a specialty built by innovation. From its beginning, IR has been a field of rapid adaptors who continuously iterate and develop more advanced ways to impact patient care by constantly pushing the envelope on what is possible. As a specialty that resides at the forefront of innovation and at the intersection of medicine and technology, IR is a field ripe to capitalize on this rapidly evolving technology.\u003c/p\u003e \u003cp\u003eThis study was limited in its design in that physicians surveyed in the study were not blinded to the author of the outputs (ChatGPT). Therefore, raters may have been biased both positively and negatively by their preconceived views of ChatGPT and artificial intelligence. Furthermore, the twenty-one physicians all from academic institutions are unlikely to be a completely accurate representation of all practicing interventional radiologists. A larger study is needed to better capture a more representative assessment of the model through the lens of a general IR physician. Still, the present study provides a strong evaluation of the state of ChatGPT-4 that is needed to further assess how ChatGPT evolves. Furthermore, our input prompt to ChatGPT was not validated to be comprehensive, so patients may use other wording when interfacing with LLMs that may produce information not captured by our query.\u003c/p\u003e"},{"header":"Conclusion","content":"\u003cp\u003eChatGPT4 generated information for patient consent for IR procedures that was accurate, written in a conversational tone, and could be understood by a standard patient. However, outputs were insufficient in explaining the necessary information for a patient's consent leading IR physicians to feel uncomfortable providing the outputs to patients. Practicing interventional radiologists must acknowledge that ChatGPT-4 will be a tool patients use to garner information about procedures, and we must expose ourselves to the information they will receive. More importantly, we must continuously evaluate its thoroughness and accuracy in portraying IR procedures to adequately counsel patients. The outputs provided by LLMs are not completely comprehensive, and at times inaccurate, which may lead to a disconnect between patients and physicians in their understanding of their conditions and treatment options. IR physicians should understand both the immense promise this technology possesses in addition to its potentially damaging drawbacks.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eFunding\u003c/strong\u003e: None\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConflict of Interests\u003c/strong\u003e: None\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eDeclaration\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eEthics Approval and Consent to participate:\u0026nbsp;\u003c/strong\u003eExemption was obtained from the University of Southern California Institutional Review Board. For this study, informed consent was not required.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConsent for Publication\u003c/strong\u003e\u003cstrong\u003e:\u0026nbsp;\u003c/strong\u003eThe datasets generated and/or analyzed during the current study are not publicly available to protect the opinions and beliefs of the surveyed physicians but are available from the corresponding author on reasonable request.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAvailability of data and materials\u003c/strong\u003e:\u0026nbsp;The datasets generated and/or analysed during the current study are not publicly available to protect the privacy of the surveyed physicians but are available from the corresponding author on reasonable request.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting Interests: \u0026nbsp;\u003c/strong\u003eThe authors declare that they have no competing interests.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding:\u0026nbsp;\u003c/strong\u003eThe study was not supported by any funding.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthors\u0026rsquo; Contributions:\u0026nbsp;\u003c/strong\u003eHH helped with study design, analyzed and interpreted the data, and was a major contributor to writing the manuscript. JV helped design the study, reach out to physicians to collect the data, and interpret the results. All authors read and approved the final manuscript.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAcknowledgements:\u0026nbsp;\u003c/strong\u003eNot applicable.\u0026nbsp;\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eOpenAI. GPT-4 Technical Report. \u003cem\u003earXiv:230308774 [cs]\u003c/em\u003e. Published online March 15, 2023. https://arxiv.org/abs/2303.08774\u003c/li\u003e\n\u003cli\u003eKnoedler L, Alfertshofer M, Knoedler S, et al. Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative Analysis. \u003cem\u003eJMIR medical education\u003c/em\u003e. 2024;10:e51148-e51148. doi:https://doi.org/10.2196/51148\u003c/li\u003e\n\u003cli\u003e\u0026zwnj;3.Rajesh Bhayana, Bleakney RR, Krishna S. GPT-4 in Radiology: Improvements in Advanced Reasoning. 2023;307(5). doi:https://doi.org/10.1148/radiol.230987\u003c/li\u003e\n\u003cli\u003eMcCarthy CJ, Berkowitz SA, Ramalingam V, Ahmed M. Evaluation of an Artificial Intelligence Chatbot for Delivery of Interventional Radiology Patient Education Material: A Comparison with Societal Website Content. \u003cem\u003eJournal of Vascular and Interventional Radiology\u003c/em\u003e. Published online June 1, 2023. doi:https://doi.org/10.1016/j.jvir.2023.05.037\u003c/li\u003e\n\u003cli\u003eHoffer EK. ChatGPT Provides References That Are Real, Inappropriate, or (Most Often) Fake. \u003cem\u003eJournal of Vascular and Interventional Radiology\u003c/em\u003e. 2023;34(12):2240-2242. doi:https://doi.org/10.1016/j.jvir.2023.07.001\u003c/li\u003e\n\u003cli\u003eWang X, Cohen R. Health Information Technology Use Among Adults: United States, July-December 2022. Centers for Disease Control and Prevention. Published October 31, 2023. https://stacks.cdc.gov/view/cdc/133700\u003c/li\u003e\n\u003cli\u003eGamer M, Lemon J, Singh IFP. irr: Various Coefficients of Interrater Reliability and Agreement. R-Packages. Published January 26, 2019. https://cran.r-project.org/web/packages/irr/index.html\u003c/li\u003e\n\u003cli\u003eScheschenja M, Viniol S, Bastian MB, Wessendorf J, K\u0026ouml;nig AM, Mahnken AH. Feasibility of GPT-3 and GPT-4 for in-Depth Patient Education Prior to Interventional Radiological Procedures: A Comparative Analysis. \u003cem\u003eCardioVascular and Interventional Radiology\u003c/em\u003e. Published online October 23, 2023. doi:https://doi.org/10.1007/s00270-023-03563-2\u003c/li\u003e\n\u003cli\u003eJohnson D, Goodman R, Patrinely J, et al. Assessing the Accuracy and Reliability of AI-Generated Medical Responses: An Evaluation of the Chat-GPT Model. \u003cem\u003eAssessing the Accuracy and Reliability of AI-Generated Medical Responses: An Evaluation of the Chat-GPT \u003c/em\u003e\u003cem\u003eModel\u003c/em\u003e. Published online February 28, 2023. doi:https://doi.org/10.21203/rs.3.rs-2566942/v1\u003c/li\u003e\n\u003cli\u003eVogels EA. A majority of Americans have heard of ChatGPT, but few have tried it themselves. Pew Research Center. Published May 24, 2023. https://www.pewresearch.org/short-reads/2023/05/24/a-majority-of-americans-have-heard-of-chatgpt-but-few-have-tried-it-themselves/\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"cvir-endovascular","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"cire","sideBox":"Learn more about [CVIR Endovascular](https://www.springer.com/journal/42155)","snPcode":"","submissionUrl":"https://www.editorialmanager.com/cire/default.aspx","title":"CVIR Endovascular","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false},"keywords":"ChatGPT4, Large Language Model, Informed Consent, Interventional Radiology, Artificial Intelligence","lastPublishedDoi":"10.21203/rs.3.rs-4565118/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-4565118/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003e\u003cstrong\u003ePurpose\u003c/strong\u003e: The study aims to evaluate how current interventional radiologists view ChatGPT in the context of informed consent for interventional radiology (IR) procedures.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eMethods\u003c/strong\u003e: ChatGPT-4 was instructed to outline the risks, benefits, and alternatives for IR procedures. The outputs were reviewed by IR physicians to assess if outputs were 1) accurate, 2) comprehensive, 3) easy to understand, 4) written in a conversational tone, and 5) if they were comfortable providing the output to the patient. For each criterion, outputs were measured on a 5-point scale. Mean scores and percentage of physicians rating output as sufficient (4 or 5 on 5-point scale) were measured. A linear regression correlated mean rating with number of years in practice. Intraclass correlation coefficient(ICC) measured agreement among physicians.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eResults: \u003c/strong\u003eThe mean rating of the ChatGPT responses was 4.29, 3.85, 4.15, 4.24, 3.82 for accuracy, comprehensiveness, readability, conversational tone, and physician comfort level, respectively.\u003cstrong\u003e \u003c/strong\u003ePercentage of physicians rating outputs as sufficient was 84%, 71%, 85%, 85%, and 67%\u003cstrong\u003e \u003c/strong\u003efor accuracy, comprehensiveness, readability, conversational tone, and physician comfort level, respectively. There was an inverse relationship between years in training and output score (coeff = -0.03413, p=0.0128); ICC measured 0.39 (p=0.003).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConclusions\u003c/strong\u003e: GPT-4 produced outputs that were accurate, understandable, and in a conversational tone. However, GPT-4 had a decreased capacity to produce a comprehensive output leading some physicians to be uncomfortable providing the output to patients. Practicing IRs should be aware of these limitations when counseling patients as ChatGPT-4 continues to develop into a clinically usable AI tool.\u003c/p\u003e","manuscriptTitle":"LLM Doc: An Assessment of ChatGPT’s Ability to Consent Patients for IR Procedures","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-08-09 16:17:34","doi":"10.21203/rs.3.rs-4565118/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Major revision","date":"2024-08-12T04:08:55+00:00","index":"","fulltext":""},{"type":"reviewerAgreed","content":"","date":"2024-07-05T12:14:43+00:00","index":0,"fulltext":""},{"type":"reviewersInvited","content":"","date":"2024-07-05T10:40:54+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2024-06-24T02:20:56+00:00","index":"","fulltext":""},{"type":"submitted","content":"CVIR Endovascular","date":"2024-06-17T21:35:29+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"cvir-endovascular","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"cire","sideBox":"Learn more about [CVIR Endovascular](https://www.springer.com/journal/42155)","snPcode":"","submissionUrl":"https://www.editorialmanager.com/cire/default.aspx","title":"CVIR Endovascular","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false}}],"origin":"","ownerIdentity":"9af1b08d-4b9b-420b-974b-be8ce545cc83","owner":[],"postedDate":"August 9th, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[],"tags":[],"updatedAt":"2024-12-02T15:59:15+00:00","versionOfRecord":{"articleIdentity":"rs-4565118","link":"https://doi.org/10.1186/s42155-024-00477-z","journal":{"identity":"cvir-endovascular","isVorOnly":false,"title":"CVIR Endovascular"},"publishedOn":"2024-11-29 15:56:58","publishedOnDateReadable":"November 29th, 2024"},"versionCreatedAt":"2024-08-09 16:17:34","video":"","vorDoi":"10.1186/s42155-024-00477-z","vorDoiUrl":"https://doi.org/10.1186/s42155-024-00477-z","workflowStages":[]},"version":"v1","identity":"rs-4565118","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-4565118","identity":"rs-4565118","version":["v1"]},"buildId":"qtupq5eGEP_6zYnWcrvyt","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00