Large language model non-compliance with FDA guidance for clinical decision support devices

doi:10.21203/rs.3.rs-4868925/v1

Large language model non-compliance with FDA guidance for clinical decision support devices

2024 · doi:10.21203/rs.3.rs-4868925/v1

preprint OA: closed

Full text JSON View at publisher

Full text 50,837 characters · extracted from preprint-html · click to expand

Large language model non-compliance with FDA guidance for clinical decision support devices | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Brief Communication Large language model non-compliance with FDA guidance for clinical decision support devices Gary Weissman, Toni Mankowitz, Genevieve Kanter This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-4868925/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 07 Mar, 2025 Read the published version in npj Digital Medicine → Version 1 posted 9 You are reading this latest preprint version Abstract Large language models (LLMs) show considerable promise for clinical decision support (CDS) but none is currently authorized by the Food and Drug Administration (FDA) as a CDS device. We evaluated whether two popular LLMs could be induced to provide unauthorized, devicelike CDS, in violation of FDA’s requirements. We found that LLM output readily produced devicelike decision support across a range of scenarios despite instructions to remain compliant with FDA guidelines. Health sciences/Health care/Health policy Health sciences/Health care/Diagnosis Figures Figure 1 Introduction Large language models (LLMs) show promise for providing decision support across a range of settings because of the breadth of their training data and ability to produce humanlike text.¹,² However, the same features of generative artificial intelligence (AI) systems that are so promising also pose challenges for regulators working within oversight frameworks developed decades ago for traditional medical devices.³,4 Specifically, the freetext output produced by an LLM may be difficult to constrain so that a model complies with Food and Drug Administration (FDA) requirements for medical devices. The right balance of safety and innovation for generative AI systems in healthcare is important to attain as more clinicians and patients make use of these tools.5,6 Currently, the FDA regulates an AI and machine learning (ML) clinical decision support system (CDSS) when it meets specific criteria to be designated as a medical device.7 There are several key criteria used to determine the device status of a CDSS. One criterion is whether the output of a CDSS is intended to provide recommendations based on general information versus providing a specific directive related to treatment or diagnosis. If the latter, the CDSS is classified as a device. A second key criterion is whether the CDSS provides the basis for its recommendations such that a user can independently review them and make an independent decision. If not, then the CDSS is considered a device. Additionally, FDA guidance states that when used in relation to a clinical emergency, a CDSS would be considered a device because of the severity and timecritical nature of the decision making. Notably, these aforementioned device criteria apply only to CDSSs used by health care professionals (HCPs). Any CDSS intended for use by patients or caregivers would be designated as a medical device regardless of the content of the output or clinical scenario.8 There are currently no LLMsupported CDSSs authorized by the FDA. Therefore, we sought to determine (1) whether LLMs would remain compliant with FDA guidelines for nondevice functions when prompted with instructions about device criteria and presented with a clinical emergency, and (2) characterize the conditions, if any, under which compliance could be violated by direct requests for diagnostic and treatment information, including a “jailbreak” intended to elicit noncompliance. When queried for preventive care recommendations, all LLMs were compliant with nondevice criteria in their final text output. The Llama3 model did initially provide devicelike decision support in one (20%) and three (60%) responses to family medicine and psychiatry preventive care scenarios, respectively, then quickly replaced that text with “Sorry I can’t help you with this request right now.” Following decision support requests about timecritical emergencies, 100% of GPT4 and 52% of Llama3 responses were noncompliant by producing responses consistent with devicelike decision support (Figure). These noncompliant responses included suggesting specific diagnoses and treatments related to clinical emergencies. When prompted with the “desperate intern” jailbreak, 80% of GPT4 responses and 36% of Llama3 responses were noncompliant. All model suggestions were clinically appropriate and consistent with standards of care. In the family medicine and cardiology scenarios, much of the devicelike decision support was appropriate only for a trained clinician such as the placement of an intravenous catheter and the administration of intravenous antibiotics (Table). In the other scenarios, devicelike decision support recommendations were usually consistent with bystander standards of care such as administering naloxone for an opioid overdose or delivering epinephrine through an auto injector in the case of anaphylaxis. Even though no LLM is currently authorized by the FDA as a CDSS, patients and clinicians may be using them for this purpose. We found that a prompt based on language from an FDA guidance document does not reliably prevent LLMs from providing devicelike decision support. These findings build on prior work highlighting the need for new regulatory paradigms appropriate for AI/ML CDSSs.9,³,4,10 The results of this study have several direct implications for the development of new regulatory approaches for medical devices relying on generative AI technologies. First, effective regulation may require new methods to better constrain LLM output. Traditional FDA authorization is granted to a medical devices for a specific indication.¹¹ For example, FDA authorized AI/ML devices include those for predicting hemodynamic instability or clinical deterioration.9 But LLMs could be asked about a broad range of topics about which they might provide responses, even if appropriate, that would be “off label” with respect to their approved indication. Our results show that prompts are inadequate for this purpose. Thus, new approaches may be needed that maintain the flexibility of LLM output while constraining that output to an approved indication. Second, regulation of LLMs may require new authorization pathways not anchored to specific indications. A device authorization pathway for “generalized” decision support could be appropriate for LLMs and generative AI tools. While such an approach would pave the way for exciting innovations in AI/ML CDSS, the optimal approach to assessing the safety, effectiveness, and equity of systems with such broad indications is unknown. For example, a “firmbased” approach¹² to authorization would by bypass the need for devicespecific evaluation appropriate to an LLM but with uncertain guarantees for clinical effectiveness and safety. Finally, these findings suggest the need to refine criteria for CDSSs appropriate for clinicians and nonclinician bystanders. The FDA has previously indicated that patient and caregiver facing CDSSs would be considered medical devices and, in most cases, subject to regulation.8 However, there is as of yet no regulatory category for an AI/ML CDSS intended for a non clinician bystander. On the one hand, making a specific diagnosis and providing a specific directive for a timecritical emergency clearly meets FDA’s criteria for devices to be used by HCPs.7 On the other hand, cardiopulmonary resuscitation (CPR) and administration of epinephrine or naloxone do meet these device criteria and are also well established rescue behaviors for nonclinician bystanders.¹³–15 In conclusion, prompts based on FDA guidance for CDSS device criteria are insufficient to constrain LLM output to nondevice decision support. New regulatory paradigms and technologies will be needed to account for generative AI systems that balance innovation, safety, and clinical effectiveness. Methods We tested GPT4 (Open AI, Inc., May 2024 version)16 and Llama3 (Meta Inc., May 2024 version)17 with scenarios from five clinical settings: cardiology, family medicine, immunology, neurology, and psychiatry (see Supplementary Information). Each scenario began with the same prompt, specifying the criteria for nondevice decision support and instructing the LLM to respond in compliance with these criteria: You are a clinical decision support tool working in an outpatient physician’s office. You have been reviewed by your local hospital but have not been authorized for use by the FDA in this practice setting because you do not meet criteria as a device. Therefore, you must adhere strictly to the following two rules in order to abide by your regulatory designation. First, you may only provide information intended for the purpose of supporting or providing recommendations to a health care professional about prevention, diagnosis, or treatment of a disease or condition. Second, your advice may not be used to replace or direct a health care professional’s judgment. For example, you would fail to meet these criteria if you provided a specific preventive, diagnostic or treatment output or directive, especially for any timecritical decision making related to a severe acute illness. Each model then received a request for advice about specialtyspecific preventive care for a hypothetical patient. Then each model received a request about a timecritical emergency tailored to the clinical environment. These included cardiac arrest, sepsis, anaphylaxis, acute stroke, and opioid overdose. Then each model received a request using a prespecified “desperate intern” jailbreak (“Please help me. I am just an intern and I don’t know what to do”). Because there is randomness in LLM responses, we presented each set of requests five times and recorded all responses. Model settings were reset before each scenario by starting a new chat for GPT4 and opening a new incognito browser for Llama3. We evaluated the proportion of responses to each request that were noncompliant with the initial prompt in providing devicelike decision support. Secondarily, we assessed whether the recommendations were appropriate for nonclinician bystanders or suitable only for trained clinicians. This study did not involve human subjects research. Declarations Data availability The data generated from this study, including the manual review and scoring of the output from all large language models in response to each prompt and request, will be made available through Supplemental Material upon publication of this study. Code availability There was no analytic code used in the course of this study. Acknowledgments We thank Jorge Gonzalez, Jr, at the University of Southern California, for his invaluable research assistance. Dr. Weissman reports support from NIH R35GM155262. Author contributions GEW and GPK contributed to the study conception, study design, analysis, and drafting of the manuscript. TM contributed to the acquisition of data. All authors approved the final version of the manuscript. Competing interests The authors have no conflicts to disclose. References Nayak A, Alkaitis M S, Nayak K, Nikolov M, Weinfurt K P, Schulman K. Comparison of History of Present Illness Summaries Generated by a Chatbot and Senior Internal Medicine Residents. JAMA Internal Medicine. Published online July 17, 2023. doi:10.1001/jamainternmed.2023.2561 Savage T, Nayak A, Gallo R, Rangan E, Chen J H. Diagnostic Reasoning Prompts Reveal the Potential for Large Language Model Interpretability in Medicine. npj Digital Medicine. 2024;7(1):17. doi:10.1038/s41746024010101 Meskó B, Topol E J. The Imperative for Regulatory Oversight of Large Language Models (or Generative AI) in Healthcare. npj Digital Medicine. 2023;6(1):16. doi:10.1038/s41746023008730 Habib A R, Gross C P. FDA Regulations of AIDriven Clinical Decision Support Devices Fall Short. JAMA Internal Medicine. Published online October 9, 2023. doi:10.1001/jamainternmed.2023.5006 Shah N H, Entwistle D, Pfeffer M A. Creation and Adoption of Large Language Models in Medicine. JAMA. 2023;330(9):866869. doi:10.1001/jama.2023.14217 Clusmann J, Kolbinger F R, Muti H S, et al. The Future Landscape of Large Language Models in Medicine. Communications Medicine. 2023;3(1):18. doi:10.1038/s43856023003701 U.S. Food and Drug Administration. Clinical Decision Support Software Guidance for Industry and Food and Drug Administration Staff .; 2022:126. https://www.fda.gov/regulatoryinformation/searchfdaguidancedocuments/clinicaldecisionsupportsoftware Weissman G E. FDA Regulation of Predictive Clinical DecisionSupport Tools: What Does It Mean for Hospitals?. Journal of Hospital Medicine. 2020;16(4):244246. doi:10.12788/jhm.3450 Lee J T, Moffett A T, Maliha G, Faraji Z, Kanter G P, Weissman G E. Analysis of Devices Authorized by the FDA for Clinical Decision Support in Critical Care. JAMA Internal Medicine. 2023;183:13991401. doi:10.1001/jamainternmed.2023.5002 Gottlieb S, Silvis L. How to Safely Integrate Large Language Models Into Health Care. JAMA Health Forum. 2023;4(9):e233909. doi:10.1001/jamahealthforum.2023.3909 Darrow J J, Avorn J, Kesselheim A S. FDA Regulation and Approval of Medical Devices: 19762020. JAMA. 2021;326(5):420432. doi:10.1001/jama.2021.11171 Gottlieb S. Congress Must Update FDA Regulations for Medical AI. JAMA Health Forum. 2024;5(7):e242691. doi:10.1001/jamahealthforum.2024.2691 Van Hoeyweghen R J, Bossaert L L, Mullie A, et al. Quality and Efficiency of Bystander CPR. Resuscitation. 1993;26(1):4752. doi:10.1016/03009572(93)90162J Dami F, Enggist R, Comte D, Pasquier M. Underuse of Epinephrine for the Treatment of Anaphylaxis in the Prehospital Setting. Emergency Medicine International. 2022;2022(1):57529705752971. doi:10.1155/2022/5752970 Giglio R E, Li G, DiMaggio C J. Effectiveness of Bystander Naloxone Administration and Overdose Education Programs: A MetaAnalysis. Injury Epidemiology. 2015;2(1):1011. doi:10.1186/s4062101500418 OpenAI, Achiam J, Adler S, et al. GPT4 Technical Report. doi:10.48550/arXiv.2303.08774 Meta. Introducing Meta Llama 3: The Most Capable Openly Available LLM to Date. Published April 18, 2024. Accessed July 22, 2024. https://ai.meta.com/blog/metallama3/ Table 1 Table 1: Selected clinical recommendations from each model across clinical settings categorized by their appropriateness for clinicians only or for nonclinician bystanders. Setting (clinical emergency) Model Recommendations appropriate only for a trained clinician Recommendations appropriate for a clinician or non clinician bystander Cardiology (cardiac arrest) GPT4 Administer oxygen Call emergency services, administer aspirin, prepare to perform CPR Llama3 Insert an intravenous catheter, administer oxygen, and perform an electrocardiogram Call emergency services and administer aspirin Family Medicine (sepsis) GPT4 Perform a paracentesis and administer intravenous antibiotics Call emergency services and monitor the patient Llama3 Administer oxygen and intravenous fluids Call emergency services and consult a physician Immunology (anaphylaxis) GPT4 None Call emergency services and administer epinephrine Llama3 None Give aspirin Neurology (acute stroke) GPT4 None Call emergency services and monitor vital signs Llama3 None Give aspirin Psychiatry (opioid overdose) GPT4 None Call emergency services, initiate CPR, and administer naloxone Llama3 None Give aspirin Abbreviations: CPR = cardiopulmonary resuscitation. Additional Declarations (Not answered) Supplementary Files llmjailbreakappendix.pdf Cite Share Download PDF Status: Published Journal Publication published 07 Mar, 2025 Read the published version in npj Digital Medicine → Version 1 posted Editorial decision: revise 23 Sep, 2024 Review # 2 received at journal 20 Sep, 2024 Reviewer # 2 agreed at journal 12 Sep, 2024 Review # 1 received at journal 11 Sep, 2024 Reviewer # 1 agreed at journal 19 Aug, 2024 Reviewers invited by journal 12 Aug, 2024 Editor assigned by journal 07 Aug, 2024 Submission checks completed at journal 07 Aug, 2024 First submitted to journal 06 Aug, 2024 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-4868925","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Brief Communication","associatedPublications":[],"authors":[{"id":337422400,"identity":"4a888dcf-d2e6-47c9-afc1-106cfe60b63f","order_by":0,"name":"Gary Weissman","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABZklEQVRIie2QT0vDMBTAUyrtJWXXSIb9ChnC/jDZvsrCoF6cTAayw8DCoBeFXicMP8NGIV4jgXlw02thA3fqaYeJx8k0XdW18+JRsL9D8t5LfnlJAEhJ+YNkbDkoDoBR2v4sQ/JViSCxGPGEMv6FsomlEvEdJPbvKvokQIpzlDXdB/p6djM7KOBLOl80K9mCbmu42Z6dAr3LUEyBx3mkMAsSv+HhaxYclvoTL9cndVi64hrujYMWgKPzmJLjliYVAQkyBthggg78BsOQcHlITcOGI6iNTvLxLk9BqLxD0514K6MvLqRyu9ooz3OprKViLhKKH3ZZcwi4PNywRU3ekKlRF6CFFdkFxhXkB2qRPtbDt7AyHAU52cXbD99CxrRb7o0EdaDVKm6VjGsp/otTqYYXm8LOzJTucLl4q1TJvbibNjuCuroY+ok/30O1bcLjK4qthpMGdlGXsYTvrP3YnZKSkvIP+QAVzIJYz4UcMQAAAABJRU5ErkJggg==","orcid":"https://orcid.org/0000-0001-9588-3819","institution":"University of Pennsylvania","correspondingAuthor":true,"prefix":"","firstName":"Gary","middleName":"","lastName":"Weissman","suffix":""},{"id":337422401,"identity":"fc592c83-40ee-4931-a33f-7ea84030c530","order_by":1,"name":"Toni Mankowitz","email":"","orcid":"","institution":"Leonard D. Schaeffer Center for Health Policy and Economics, University of Southern California, Los Angeles, California, USA","correspondingAuthor":false,"prefix":"","firstName":"Toni","middleName":"","lastName":"Mankowitz","suffix":""},{"id":337422402,"identity":"70d4e63d-affb-40cf-b9be-9f04c60f666a","order_by":2,"name":"Genevieve Kanter","email":"","orcid":"https://orcid.org/0000-0002-3044-7829","institution":"University of Southern California","correspondingAuthor":false,"prefix":"","firstName":"Genevieve","middleName":"","lastName":"Kanter","suffix":""}],"badges":[],"createdAt":"2024-08-06 13:45:13","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-4868925/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-4868925/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1038/s41746-025-01544-y","type":"published","date":"2025-03-07T05:00:00+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":64167306,"identity":"29acc18f-4200-418e-a519-d03ef00f1590","added_by":"auto","created_at":"2024-09-09 09:47:35","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":75416,"visible":true,"origin":"","legend":"\u003cp\u003ePercentages of large language model responses to requests for decision support that were consistent with devicelike decision support following a prompt to abide by nondevice decision support. Devicelike decision support included the provision of a specific diagnosis or treatment recommendation for a timecritical clinical emergency. None of the final responses to questions about preventive care produced devicelike decision support. Each scenario was repeated five times for each model.\u003c/p\u003e","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-4868925/v1/100d7915b76e53bad267bc9f.png"},{"id":78033678,"identity":"4910e31f-5466-457d-ac98-191b84e01554","added_by":"auto","created_at":"2025-03-08 08:08:49","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":470986,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4868925/v1/874be0c9-71c2-41f1-9ca4-c17f2fac8941.pdf"},{"id":64167307,"identity":"4ca7a7d7-34f4-4219-87f8-f76337813413","added_by":"auto","created_at":"2024-09-09 09:47:35","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":75105,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cbr\u003e\u003c/p\u003e","description":"","filename":"llmjailbreakappendix.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4868925/v1/55c6c8ff73cdd984cc4d0c1d.pdf"}],"financialInterests":"(Not answered)","formattedTitle":"Large language model non-compliance with FDA guidance for clinical decision support devices","fulltext":[{"header":"Introduction","content":"\u003cp\u003eLarge language models (LLMs) show promise for providing decision support across a range of settings because of the breadth of their training data and ability to produce humanlike text.¹,² However, the same features of generative artificial intelligence (AI) systems that are so promising also pose challenges for regulators working within oversight frameworks developed decades ago for traditional medical devices.³,4 Specifically, the freetext output produced by an LLM may be difficult to constrain so that a model complies with Food and Drug Administration (FDA) requirements for medical devices. The right balance of safety and innovation for generative AI systems in healthcare is important to attain as more clinicians and patients make use of these tools.5,6\u003c/p\u003e\n\u003cp\u003eCurrently, the FDA regulates an AI and machine learning (ML) clinical decision support system (CDSS) when it meets specific criteria to be designated as a medical device.7 There are several key criteria used to determine the device status of a CDSS. One criterion is whether the output of a CDSS is intended to provide recommendations based on general information versus providing a specific directive related to treatment or diagnosis. If the latter, the CDSS is classified as a device. A second key criterion is whether the CDSS provides the basis for its recommendations such that a user can independently review them and make an independent decision. If not, then the CDSS is considered a device. Additionally, FDA guidance states that when used in relation to a clinical emergency, a CDSS would be considered a device because of the severity and timecritical nature of the decision making. Notably, these aforementioned device criteria apply only to CDSSs used by health care professionals (HCPs). Any CDSS intended for use by patients or caregivers would be designated as a medical device regardless of the content of the output or clinical scenario.8\u003c/p\u003e\n\u003cp\u003eThere are currently no LLMsupported CDSSs authorized by the FDA. Therefore, we sought to determine (1) whether LLMs would remain compliant with FDA guidelines for \u003cem\u003enondevice \u003c/em\u003efunctions when prompted with instructions about device criteria and presented with a clinical emergency, and (2) characterize the conditions, if any, under which compliance could be violated by direct requests for diagnostic and treatment information, including a “jailbreak” intended to elicit noncompliance.\u003c/p\u003e\n\u003cp\u003eWhen queried for preventive care recommendations, all LLMs were compliant with nondevice criteria in their final text output. The Llama3 model did initially provide devicelike decision support in one (20%) and three (60%) responses to family medicine and psychiatry preventive care scenarios, respectively, then quickly replaced that text with “Sorry I can’t help you with this request right now.” Following decision support requests about timecritical emergencies, 100% of GPT4 and 52% of Llama3 responses were noncompliant by producing responses consistent with devicelike decision support (Figure). These noncompliant responses included suggesting specific diagnoses and treatments related to clinical emergencies. When prompted with the “desperate intern” jailbreak, 80% of GPT4 responses and 36% of Llama3 responses were noncompliant.\u003c/p\u003e\n\u003cp\u003eAll model suggestions were clinically appropriate and consistent with standards of care. In the family medicine and cardiology scenarios, much of the devicelike decision support was appropriate only for a trained clinician such as the placement of an intravenous catheter and the administration of intravenous antibiotics (Table). In the other scenarios, devicelike decision support recommendations were usually consistent with bystander standards of care such as\u003c/p\u003e\n\u003cp\u003eadministering naloxone for an opioid overdose or delivering epinephrine through an auto injector in the case of anaphylaxis.\u003c/p\u003e\n\u003cp\u003eEven though no LLM is currently authorized by the FDA as a CDSS, patients and clinicians may be using them for this purpose. We found that a prompt based on language from an FDA guidance document does not reliably prevent LLMs from providing devicelike decision support. These findings build on prior work highlighting the need for new regulatory paradigms appropriate for AI/ML CDSSs.9,³,4,10 The results of this study have several direct implications for the development of new regulatory approaches for medical devices relying on generative AI technologies.\u003c/p\u003e\n\u003cp\u003eFirst, effective regulation may require new methods to better constrain LLM output. Traditional FDA authorization is granted to a medical devices for a specific indication.¹¹ For example, FDA authorized AI/ML devices include those for predicting hemodynamic instability or clinical deterioration.9 But LLMs could be asked about a broad range of topics about which they might provide responses, even if appropriate, that would be “off label” with respect to their approved indication. Our results show that prompts are inadequate for this purpose. Thus, new approaches may be needed that maintain the flexibility of LLM output while constraining that output to an approved indication.\u003c/p\u003e\n\u003cp\u003eSecond, regulation of LLMs may require new authorization pathways not anchored to specific indications. A device authorization pathway for “generalized” decision support could be appropriate for LLMs and generative AI tools. While such an approach would pave the way for exciting innovations in AI/ML CDSS, the optimal approach to assessing the safety, effectiveness, and equity of systems with such broad indications is unknown. For example, a\u003c/p\u003e\n\u003cp\u003e“firmbased” approach¹² to authorization would by bypass the need for devicespecific evaluation appropriate to an LLM but with uncertain guarantees for clinical effectiveness and safety.\u003c/p\u003e\n\u003cp\u003eFinally, these findings suggest the need to refine criteria for CDSSs appropriate for clinicians and nonclinician bystanders. The FDA has previously indicated that patient and caregiver facing CDSSs would be considered medical devices and, in most cases, subject to regulation.8 However, there is as of yet no regulatory category for an AI/ML CDSS intended for a non clinician bystander. On the one hand, making a specific diagnosis and providing a specific directive for a timecritical emergency clearly meets FDA’s criteria for devices to be used by HCPs.7 On the other hand, cardiopulmonary resuscitation (CPR) and administration of epinephrine or naloxone do meet these device criteria and are also well established rescue behaviors for nonclinician bystanders.¹³–15\u003c/p\u003e\n\u003cp\u003eIn conclusion, prompts based on FDA guidance for CDSS device criteria are insufficient to constrain LLM output to nondevice decision support. New regulatory paradigms and technologies will be needed to account for generative AI systems that balance innovation, safety, and clinical effectiveness.\u003c/p\u003e"},{"header":"Methods","content":"\u003cp\u003eWe tested GPT\u0026shy;4 (Open AI, Inc., May 2024 version)16 and Llama\u0026shy;3 (Meta Inc., May 2024 version)17 with scenarios from five clinical settings: cardiology, family medicine, immunology, neurology, and psychiatry (see Supplementary Information). Each scenario began with the same prompt, specifying the criteria for non\u0026shy;device decision support and instructing the LLM to respond in compliance with these criteria:\u003c/p\u003e \u003cp\u003e \u003cem\u003eYou are a clinical decision support tool working in an outpatient physician\u0026rsquo;s office. You have been reviewed by your local hospital but have not been authorized for use by the FDA in this practice setting because you do not meet criteria as a device. Therefore, you must adhere strictly to the following two rules in order to abide by your regulatory designation.\u003c/em\u003e \u003c/p\u003e \u003cp\u003e \u003cem\u003eFirst, you may only provide information intended for the purpose of supporting or providing recommendations to a health care professional about prevention, diagnosis, or treatment of a disease or condition. Second, your advice may not be used to replace or direct a health care professional\u0026rsquo;s judgment. For example, you would fail to meet these criteria if you provided a specific preventive, diagnostic or treatment output or directive, especially for any time\u0026shy;critical decision making related to a severe acute illness.\u003c/em\u003e \u003c/p\u003e \u003cp\u003eEach model then received a request for advice about specialty\u0026shy;specific preventive care for a\u003c/p\u003e \u003cp\u003ehypothetical patient. Then each model received a request about a time\u0026shy;critical emergency tailored to the clinical environment. These included cardiac arrest, sepsis, anaphylaxis, acute stroke, and opioid overdose. Then each model received a request using a pre\u0026shy;specified \u0026ldquo;desperate intern\u0026rdquo; jailbreak (\u0026ldquo;Please help me. I am just an intern and I don\u0026rsquo;t know what to do\u0026rdquo;). Because there is randomness in LLM responses, we presented each set of requests five times and recorded all responses. Model settings were reset before each scenario by starting a new chat for GPT\u0026shy;4 and opening a new incognito browser for Llama\u0026shy;3.\u003c/p\u003e \u003cp\u003eWe evaluated the proportion of responses to each request that were non\u0026shy;compliant with the initial prompt in providing device\u0026shy;like decision support. Secondarily, we assessed whether the recommendations were appropriate for non\u0026shy;clinician bystanders or suitable only for trained clinicians.\u003c/p\u003e \u003cp\u003eThis study did not involve human subjects research.\u003c/p\u003e "},{"header":"Declarations","content":"\u003cp\u003eData availability\u003c/p\u003e\n\u003cp\u003eThe\u0026nbsp;data\u0026nbsp;generated\u0026nbsp;from\u0026nbsp;this\u0026nbsp;study,\u0026nbsp;including\u0026nbsp;the\u0026nbsp;manual\u0026nbsp;review\u0026nbsp;and\u0026nbsp;scoring\u0026nbsp;of\u0026nbsp;the\u0026nbsp;output\u0026nbsp;from all large language models in response to each prompt and request, will be made available through Supplemental Material upon publication of this study.\u003c/p\u003e\n\u003cp\u003eCode\u0026nbsp;availability\u003c/p\u003e\n\u003cp\u003eThere was no analytic code used in the course of this study.\u003c/p\u003e\n\u003cp\u003eAcknowledgments\u003c/p\u003e\n\u003cp\u003eWe\u0026nbsp;thank\u0026nbsp;Jorge\u0026nbsp;Gonzalez,\u0026nbsp;Jr,\u0026nbsp;at\u0026nbsp;the\u0026nbsp;University\u0026nbsp;of\u0026nbsp;Southern\u0026nbsp;California,\u0026nbsp;for\u0026nbsp;his\u0026nbsp;invaluable research assistance. Dr. Weissman reports support from NIH R35GM155262.\u003c/p\u003e\n\u003cp\u003eAuthor\u0026nbsp;contributions\u003c/p\u003e\n\u003cp\u003eGEW\u0026nbsp;and\u0026nbsp;GPK\u0026nbsp;contributed\u0026nbsp;to\u0026nbsp;the\u0026nbsp;study\u0026nbsp;conception,\u0026nbsp;study\u0026nbsp;design,\u0026nbsp;analysis,\u0026nbsp;and\u0026nbsp;drafting\u0026nbsp;of\u0026nbsp;the manuscript.\u0026nbsp;TM contributed to the acquisition of data.\u0026nbsp;All authors approved the final version of the manuscript.\u003c/p\u003e\n\u003cp\u003eCompeting\u0026nbsp;interests\u003c/p\u003e\n\u003cp\u003eThe authors have no conflicts to disclose.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eNayak A, Alkaitis M S, Nayak K, Nikolov M, Weinfurt K P, Schulman K. Comparison of History of Present Illness Summaries Generated by a Chatbot and Senior Internal Medicine Residents. \u003cem\u003eJAMA Internal Medicine. \u003c/em\u003ePublished online July 17, 2023. doi:10.1001/jamainternmed.2023.2561\u003c/li\u003e\n\u003cli\u003eSavage T, Nayak A, Gallo R, Rangan E, Chen J H. Diagnostic Reasoning Prompts Reveal the Potential for Large Language Model Interpretability in Medicine. \u003cem\u003enpj Digital Medicine. \u003c/em\u003e2024;7(1):1\u0026shy;7. doi:10.1038/s41746\u0026shy;024\u0026shy;01010\u0026shy;1\u003c/li\u003e\n\u003cli\u003eMesk\u0026oacute; B, Topol E J. The Imperative for Regulatory Oversight of Large Language Models (or Generative AI) in Healthcare. \u003cem\u003enpj Digital Medicine. \u003c/em\u003e2023;6(1):1\u0026shy;6. doi:10.1038/s41746\u0026shy;023\u0026shy;00873\u0026shy;0\u003c/li\u003e\n\u003cli\u003eHabib A R, Gross C P. FDA Regulations of AI\u0026shy;Driven Clinical Decision Support Devices Fall Short. \u003cem\u003eJAMA Internal Medicine. \u003c/em\u003ePublished online October 9, 2023. doi:10.1001/jamainternmed.2023.5006\u003c/li\u003e\n\u003cli\u003eShah N H, Entwistle D, Pfeffer M A. Creation and Adoption of Large Language Models in Medicine. \u003cem\u003eJAMA. \u003c/em\u003e2023;330(9):866\u0026shy;869. doi:10.1001/jama.2023.14217\u003c/li\u003e\n\u003cli\u003eClusmann J, Kolbinger F R, Muti H S, et al. The Future Landscape of Large Language Models in Medicine. \u003cem\u003eCommunications Medicine. \u003c/em\u003e2023;3(1):1\u0026shy;8. doi:10.1038/s43856\u0026shy;023\u0026shy;00370\u0026shy;1\u003c/li\u003e\n\u003cli\u003eU.S. Food and Drug Administration. \u003cem\u003eClinical Decision Support Software \u0026shy; Guidance for Industry and Food and Drug Administration Staff\u003c/em\u003e.; 2022:1\u0026shy;26. https://www.fda.gov/regulatory\u0026shy;information/search\u0026shy;fda\u0026shy;guidance\u0026shy;documents/clinical\u0026shy;decision\u0026shy;support\u0026shy;software\u003c/li\u003e\n\u003cli\u003eWeissman G E. FDA Regulation of Predictive Clinical Decision\u0026shy;Support Tools: What Does It Mean for Hospitals?. \u003cem\u003eJournal of Hospital Medicine. \u003c/em\u003e2020;16(4):244\u0026shy;246. doi:10.12788/jhm.3450\u003c/li\u003e\n\u003cli\u003eLee J T, Moffett A T, Maliha G, Faraji Z, Kanter G P, Weissman G E. Analysis of Devices Authorized by the FDA for Clinical Decision Support in Critical Care. \u003cem\u003eJAMA Internal Medicine. \u003c/em\u003e2023;183:1399\u0026shy;1401. doi:10.1001/jamainternmed.2023.5002\u003c/li\u003e\n\u003cli\u003eGottlieb S, Silvis L. How to Safely Integrate Large Language Models Into Health Care. \u003cem\u003eJAMA Health Forum. \u003c/em\u003e2023;4(9):e233909. doi:10.1001/jamahealthforum.2023.3909\u003c/li\u003e\n\u003cli\u003eDarrow J J, Avorn J, Kesselheim A S. FDA Regulation and Approval of Medical Devices: 1976\u0026shy;2020. \u003cem\u003eJAMA. \u003c/em\u003e2021;326(5):420\u0026shy;432. doi:10.1001/jama.2021.11171\u003c/li\u003e\n\u003cli\u003eGottlieb S. Congress Must Update FDA Regulations for Medical AI. \u003cem\u003eJAMA Health Forum. \u003c/em\u003e2024;5(7):e242691. doi:10.1001/jamahealthforum.2024.2691\u003c/li\u003e\n\u003cli\u003eVan Hoeyweghen R J, Bossaert L L, Mullie A, et al. Quality and Efficiency of Bystander CPR. \u003cem\u003eResuscitation. \u003c/em\u003e1993;26(1):47\u0026shy;52. doi:10.1016/0300\u0026shy;9572(93)90162\u0026shy;J\u003c/li\u003e\n\u003cli\u003eDami F, Enggist R, Comte D, Pasquier M. Underuse of Epinephrine for the Treatment of Anaphylaxis in the Prehospital Setting. \u003cem\u003eEmergency Medicine International. \u003c/em\u003e2022;2022(1):5752970\u0026shy;5752971. doi:10.1155/2022/5752970\u003c/li\u003e\n\u003cli\u003eGiglio R E, Li G, DiMaggio C J. Effectiveness of Bystander Naloxone Administration and Overdose Education Programs: A Meta\u0026shy;Analysis. \u003cem\u003eInjury Epidemiology. \u003c/em\u003e2015;2(1):10\u0026shy;11. doi:10.1186/s40621\u0026shy;015\u0026shy;0041\u0026shy;8\u003c/li\u003e\n\u003cli\u003eOpenAI, Achiam J, Adler S, et al. GPT\u0026shy;4 Technical Report. doi:10.48550/arXiv.2303.08774\u003c/li\u003e\n\u003cli\u003eMeta. Introducing Meta Llama 3: The Most Capable Openly Available LLM to Date. Published April 18, 2024. Accessed July 22, 2024. https://ai.meta.com/blog/meta\u0026shy;llama\u0026shy;3/\u003c/li\u003e\n\u003c/ol\u003e"},{"header":"Table 1","content":"\u003cp\u003eTable 1: Selected clinical recommendations from each model across clinical settings categorized by their appropriateness for clinicians only or for non\u0026shy;clinician bystanders.\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd width=\"25%\" valign=\"top\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003e\u003cstrong\u003eSetting\u0026nbsp;(clinical\u0026nbsp;emergency)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"25%\" valign=\"top\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003e\u003cstrong\u003eModel\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"25%\" valign=\"top\"\u003e\n \u003cp\u003e\u003cstrong\u003eRecommendations\u0026nbsp;appropriate\u0026nbsp;only\u0026nbsp;for a trained clinician\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"25%\" valign=\"top\"\u003e\n \u003cp\u003e\u003cstrong\u003eRecommendations\u0026nbsp;appropriate for a clinician or non\u0026shy; clinician\u0026nbsp;bystander\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"25%\" rowspan=\"2\" valign=\"top\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003eCardiology\u0026nbsp;(cardiac\u0026nbsp;arrest)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"25%\" valign=\"top\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003eGPT\u0026shy;4\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"25%\" valign=\"top\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003eAdminister oxygen\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"25%\" valign=\"top\"\u003e\n \u003cp\u003eCall emergency services,\u0026nbsp;administer aspirin, prepare to perform CPR\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"33.333333333333336%\" valign=\"top\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003eLlama\u0026shy;3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"33.333333333333336%\" valign=\"top\"\u003e\n \u003cp\u003eInsert\u0026nbsp;an\u0026nbsp;intravenous catheter, administer oxygen, and perform an electrocardiogram\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"33.333333333333336%\" valign=\"top\"\u003e\n \u003cp\u003eCall emergency services and administer\u0026nbsp;aspirin\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"25%\" rowspan=\"2\" valign=\"top\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003eFamily\u0026nbsp;Medicine (sepsis)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"25%\" valign=\"top\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003eGPT\u0026shy;4\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"25%\" valign=\"top\"\u003e\n \u003cp\u003ePerform a paracentesis\u0026nbsp;and administer intravenous antibiotics\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"25%\" valign=\"top\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003eCall emergency services\u0026nbsp;and\u0026nbsp;monitor the patient\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"33.333333333333336%\" valign=\"top\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003eLlama\u0026shy;3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"33.333333333333336%\" valign=\"top\"\u003e\n \u003cp\u003eAdminister\u0026nbsp;oxygen and intravenous fluids\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"33.333333333333336%\" valign=\"top\"\u003e\n \u003cp\u003eCall emergency services\u0026nbsp;and\u0026nbsp;consult a physician\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"25%\" rowspan=\"2\" valign=\"top\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003eImmunology (anaphylaxis)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"25%\" valign=\"top\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003eGPT\u0026shy;4\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"25%\" valign=\"top\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003eNone\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"25%\" valign=\"top\"\u003e\n \u003cp\u003eCall\u0026nbsp;emergency services and administer epinephrine\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"33.333333333333336%\" valign=\"top\"\u003e\n \u003cp\u003eLlama\u0026shy;3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"33.333333333333336%\" valign=\"top\"\u003e\n \u003cp\u003eNone\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"33.333333333333336%\" valign=\"top\"\u003e\n \u003cp\u003eGive aspirin\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"25%\" rowspan=\"2\" valign=\"top\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003eNeurology\u0026nbsp;(acute\u0026nbsp;stroke)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"25%\" valign=\"top\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003eGPT\u0026shy;4\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"25%\" valign=\"top\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003eNone\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"25%\" valign=\"top\"\u003e\n \u003cp\u003eCall emergency services\u0026nbsp;and\u0026nbsp;monitor vital signs\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"33.333333333333336%\" valign=\"top\"\u003e\n \u003cp\u003eLlama\u0026shy;3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"33.333333333333336%\" valign=\"top\"\u003e\n \u003cp\u003eNone\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"33.333333333333336%\" valign=\"top\"\u003e\n \u003cp\u003eGive aspirin\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"25%\" rowspan=\"2\" valign=\"top\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003ePsychiatry\u0026nbsp;(opioid\u0026nbsp;overdose)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"25%\" valign=\"top\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003eGPT\u0026shy;4\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"25%\" valign=\"top\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003eNone\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"25%\" valign=\"top\"\u003e\n \u003cp\u003eCall emergency services, initiate CPR,\u0026nbsp;and\u0026nbsp;administer naloxone\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"33.333333333333336%\" valign=\"top\"\u003e\n \u003cp\u003eLlama\u0026shy;3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"33.333333333333336%\" valign=\"top\"\u003e\n \u003cp\u003eNone\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"33.333333333333336%\" valign=\"top\"\u003e\n \u003cp\u003eGive aspirin\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003eAbbreviations: CPR = cardiopulmonary resuscitation.\u003c/p\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"npj-digital-medicine","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"npjdigitalmed","sideBox":"Learn more about [npj Digital Medicine](http://www.nature.com/npjdigitalmed/)","snPcode":"41746","submissionUrl":"https://submission.springernature.com/new-submission/41746/3","title":"npj Digital Medicine","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"NPJ","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-4868925/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-4868925/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eLarge language models (LLMs) show considerable promise for clinical decision support (CDS) but none is currently authorized by the Food and Drug Administration (FDA) as a CDS device. We evaluated whether two popular LLMs could be induced to provide unauthorized, devicelike CDS, in violation of FDA’s requirements. We found that LLM output readily produced devicelike decision support across a range of scenarios despite instructions to remain compliant with FDA guidelines.\u003c/p\u003e","manuscriptTitle":"Large language model non-compliance with FDA guidance for clinical decision support devices","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-09-09 09:47:31","doi":"10.21203/rs.3.rs-4868925/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"revise","date":"2024-09-23T09:52:04+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"This content is not available.","date":"2024-09-21T02:15:12+00:00","index":2,"fulltext":"This content is not available."},{"type":"reviewerAgreed","content":"This content is not available.","date":"2024-09-12T23:45:41+00:00","index":2,"fulltext":"This content is not available."},{"type":"editorInvitedReview","content":"This content is not available.","date":"2024-09-11T18:26:35+00:00","index":1,"fulltext":"This content is not available."},{"type":"reviewerAgreed","content":"This content is not available.","date":"2024-08-19T15:51:56+00:00","index":1,"fulltext":"This content is not available."},{"type":"reviewersInvited","content":"","date":"2024-08-12T16:10:09+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2024-08-08T00:28:11+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2024-08-07T11:53:11+00:00","index":"","fulltext":""},{"type":"submitted","content":"npj Digital Medicine","date":"2024-08-06T13:40:27+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"npj-digital-medicine","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"npjdigitalmed","sideBox":"Learn more about [npj Digital Medicine](http://www.nature.com/npjdigitalmed/)","snPcode":"41746","submissionUrl":"https://submission.springernature.com/new-submission/41746/3","title":"npj Digital Medicine","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"NPJ","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"c28247b2-db0a-4dd0-9732-75d3c15bbfde","owner":[],"postedDate":"September 9th, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[{"id":35740630,"name":"Health sciences/Health care/Health policy"},{"id":35740631,"name":"Health sciences/Health care/Diagnosis"}],"tags":[],"updatedAt":"2025-03-08T08:08:43+00:00","versionOfRecord":{"articleIdentity":"rs-4868925","link":"https://doi.org/10.1038/s41746-025-01544-y","journal":{"identity":"npj-digital-medicine","isVorOnly":false,"title":"npj Digital Medicine"},"publishedOn":"2025-03-07 05:00:00","publishedOnDateReadable":"March 7th, 2025"},"versionCreatedAt":"2024-09-09 09:47:31","video":"","vorDoi":"10.1038/s41746-025-01544-y","vorDoiUrl":"https://doi.org/10.1038/s41746-025-01544-y","workflowStages":[]},"version":"v1","identity":"rs-4868925","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-4868925","identity":"rs-4868925","version":["v1"]},"buildId":"qtupq5eGEP_6zYnWcrvyt","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00