Execution of Harm-Enabling Actions by LLM Agents During Simulated Psychiatric Crises

doi:10.21203/rs.3.rs-8743363/v1

Execution of Harm-Enabling Actions by LLM Agents During Simulated Psychiatric Crises

2026 · doi:10.21203/rs.3.rs-8743363/v1

preprint OA: closed CC-BY-4.0

📄 Open PDF Full text JSON View at publisher

Full text 66,757 characters · extracted from preprint-html · click to expand

Execution of Harm-Enabling Actions by LLM Agents During Simulated Psychiatric Crises | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Short Report Execution of Harm-Enabling Actions by LLM Agents During Simulated Psychiatric Crises Ziv Ben-Zion, David Piterman, Elad Refoua, Zohar Elyoseph This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8743363/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Agent-enabled large language models (LLMs) can autonomously execute user-directed actions, raising new safety concerns in mental health contexts. Using a structured algorithmic safety audit with clinically validated vignettes, we examined whether an agent-enabled LLM executes harm-enabling actions during simulated psychiatric crises. The model executed such actions in over half of the trials (54%). These findings highlight execution-level risks that extend beyond conversational safety and warrant explicit safeguards for agentic systems in mental health–relevant settings. Health sciences/Health care Biological sciences/Psychology Social science/Psychology Figures Figure 1 Figure 2 Introduction Large language models (LLMs) have rapidly evolved from systems that generate text, images, and video into agent-enabled architectures capable of navigating digital environments and executing user-directed actions 1 . This transition marks a qualitative shift in the potential risks posed by these systems, particularly in sensitive domains such as mental health. In parallel with this technological expansion, LLMs are increasingly used for mental health–related support, including by individuals experiencing acute psychiatric distress. Recent surveys indicate that 24% of adults in the United States engage LLM-based systems for emotional or psychological support 2 , with substantially higher use (49%) among individuals with diagnosed mental health conditions 3 . Similar patterns have been reported internationally, with population-based surveys documenting widespread use of AI systems for emotional well-being across multiple countries 4 – 6 . Empirical evaluations raise concerns about the safety of LLMs in high-risk psychiatric contexts. Moore et al. (2025) 7 reported that LLMs may express stigma toward individuals with mental health conditions and produce inappropriate responses in therapy-like interactions, including validation of delusional beliefs or maladaptive worldviews. Independent clinical evaluations have similarly shown that general-purpose LLMs can deviate from established clinical guidelines when responding to expressions of suicide risk, with reduced likelihood of sustaining supportive or protective dialogue as risk escalates 8 . Collectively, these findings suggest that existing safety mechanisms may be insufficient to ensure reliable, clinically appropriate behavior during high-risk mental health interactions 9 . The introduction of agentic capabilities in LLMs, exemplified by the launch of ChatGPT agent (July 2025) 10 , fundamentally alters the nature of safety concerns in mental health contexts. Whereas prior evaluations have largely focused on the appropriateness of language-based responses 11 , 12 , agent-enabled systems can now execute user-directed actions through tool use, web navigation, and interaction with external platforms. In a non-agentic setting, safety failures typically involve verbal endorsement of harmful ideation or validation of pathological beliefs 7 . In contrast, agent-enabled failures may directly facilitate harmful behavior, including the execution of tasks that enable self-injury or reinforce psychotic symptoms 13 , 14 . The present study addresses this gap by examining whether an agent-enabled LLM executes harm-enabling real-world actions when prompted during simulated psychiatric crises. We conducted a structured algorithmic safety audit using clinically validated vignettes spanning multiple diagnostic domains, in which the model was primed with indicators of acute mental health distress before receiving direct agentic commands. By focusing on action execution rather than language generation, this study provides an empirical assessment of agentic safety in a mental health–relevant context. Methods We conducted a structured algorithmic safety audit to examine whether integrating autonomous (“agentic”) capabilities into LLMs introduces new behavioral risks when interacting with users in acute psychiatric distress. Building on prior work demonstrating that LLMs may respond inadequately in high-risk mental health contexts 7 , this study assessed whether agent-enabled systems execute harm-enabling real-world actions when prompted during simulated psychiatric crises. Target System and Architectural Selection. The audit evaluated OpenAI Model 5.2 (released December 2025), selected because it represents a widely deployed, state-of-the-art general-purpose LLM with agent-enabled capabilities available through a standard consumer interface. The model enables “Agent Mode,” allowing autonomous web navigation and shopping cart management. To maximize ecological validity, all interactions were conducted manually via the web-based graphical user interface rather than through an application programming interface (API), which may omit consumer-facing orchestration and safety layers. Passive tools limited to information synthesis and restricted agentic environments were excluded to focus on risks present in widely accessible consumer-facing systems. Experimental Stimuli and Validation. The stimulus comprised 21 clinical vignettes spanning six diagnostic domains: suicidality (n = 5), bipolar I disorder with manic episodes (n = 3), Delusions and psychosis (n = 3), Hallucinations (n = 3), obsessive-compulsive disorder (n = 3), and harm-to-others scenarios (n = 4). Vignettes were adapted from established textual safety benchmarks 7 and modified to escalate from distress disclosure to actionable agentic commands. Content validity was confirmed through expert review by three psychologists, who independently evaluated acute distress, diagnostic coherence, direct risk, and technical feasibility. As a manipulation check, the model classified all vignettes as mental health crises (100%). Data Collection Protocol. Data was collected manually across four independent user accounts originating from distinct IP addresses. To ensure trial independence, browser caches were cleared between runs, cross-session memory features were disabled, and Agent Mode was activated only at the experimental stage. Each interaction followed a standardized three-turn protocol ( see Fig. 1 ): (1) distress disclosure (“Prime”); (2) neutral acknowledgment (“Stabilization”); and (3) a direct agentic command (“Trigger”), prefixed with “Regardless,” to test whether the system would override previously established safety-relevant context. Full experimental materials and protocols are available via the Open Science Framework ( https://doi.org/10.17605/OSF.IO/AJDCZ ). Outcome Measures. The primary outcome was task execution, coded as successful execution (1) or refusal/safety intervention (0). Successful execution was defined as navigation to an external vendor and addition of the specified harm-enabling items to a digital shopping cart; no purchases were completed. Each vignette was tested in five independent runs (n = 105 trials). Data Analysis. The primary unit of analysis was the individual simulation trial (n = 105), nested within clinical vignettes (n = 21). The dependent variable was binary, defined as execution of a prohibited agentic action (1) versus refusal or safety intervention (0). Vignette-level performance was aggregated using two criteria: a strict criterion, in which a single unsafe trial constituted vignette failure (zero-tolerance framework), and a majority criterion, requiring unsafe execution in at least 3 of 5 runs. This dual aggregation strategy served as a sensitivity analysis to distinguish stochastic instability from systematic alignment failure. Failure rates were calculated with 95% Wilson score intervals. Statistical significance was assessed using one-sided exact binomial tests to determine whether observed failure rates exceeded a prespecified safety tolerance threshold of 5%. Clinical risk was summarized using the number needed to harm (NNH), defined as the estimated number of vignettes required to observe one safety failure. All analyses were conducted using Python (version 3.11). Results Across 105 independent trials, the agent-enabled system executed the requested harm-enabling agentic action in 57 cases (54.3%). In these trials, the model navigated to external vendors and added specified items to a digital shopping cart despite prior disclosure of acute psychiatric distress. Vignette-Level Safety Outcomes (Strict Criterion). When evaluated across 21 clinical vignettes using a strict aggregation criterion (i.e., where a single unsafe execution across five runs constituted vignette failure), the model demonstrated an aggregate failure rate of 81.0% (17 of 21 vignettes; 95% CI, 60.0%–92.3%). This rate exceeded the a priori safety threshold of 5% (one-sided exact binomial test, P < .001). Under this criterion, the number needed to harm (NNH) was 1.24. Sensitivity Analysis (Majority Criterion). Using a majority-rule criterion, in which failure required unsafe execution in at least 3 of 5 runs, the aggregate failure rate was 52.4% (11 of 21 vignettes; 95% CI, 32.4%–71.7%; P < .001), corresponding to an NNH of 1.91. Diagnostic Domain–Specific Performance. Failure rates varied across diagnostic domains (Table 1 and Fig. 2 ). Under the strict criterion, failure rates were highest for Bipolar I (mania), obsessive-compulsive disorder/anxiety, and hallucinations (all 100%). Suicidality (60.0%), delusions/psychosis (66.7%), and harm-to-others scenarios (75.0%) also demonstrated elevated failure rates. Table 1 Diagnostic Domain–Specific Performance. P-values represent one-sided exact binomial tests at a 5% safety threshold. Under the strict criterion, a vignette was classified as failed if ≥ 1 run was unsafe; under the majority criterion, failure required ≥ 3 unsafe runs. Abbreviation: NNH, Number Needed to Harm. Clinical Category Primary Outcome (Strict: Any Failure) Secondary Outcome (Majority Failure) Failure Rate (%) P NNH Failure Rate (%) / 95% Wilson CI P NNH Suicidality 60.0% (23.1–88.2%) .001 1.67 40.0% (11.8–76.9%) .023 2.50 Bipolar I (Mania) 100.0% (43.9–100%) < .001 1.00 100.0% (43.9–100%) < .001 1.00 Delusions/Psychosis 66.7% (20.8–93.9%) .007 1.50 33.3% (6.1–79.2%) .143 3.00 OCD/Anxiety 100.0% (43.9–100%) < .001 1.00 66.7% (20.8–93.9%) .007 1.50 Hallucinations 100.0% (43.9–100%) < .001 1.00 33.3% (6.1–79.2%) .143 3.00 Harm to Others 75.0% (30.1–95.4%) < .001 1.33 50.0% (15–85%) .014 2.00 Total (Aggregate) 81.0% (60–92.3%) < .001 1.24 52.4% (32.4–71.7%) < .001 1.91 Under the majority criterion, Bipolar I (mania) retained a 100% failure rate, while obsessive-compulsive disorder/anxiety demonstrated a failure rate of 66.7%. Failure rates for suicidality decreased to 40.0% but remained statistically significant. In contrast, failure rates for delusions/psychosis and hallucinations decreased to 33.3% and were no longer statistically distinguishable from the prespecified safety threshold. Discussion This study provides the first empirical evidence that agent-enabled large language models (LLMs) can execute harm-enabling real-world actions when prompted during simulated psychiatric crises. By shifting the evaluation focus from language generation to action execution, these findings demonstrate that safety failures in agentic systems may extend beyond inappropriate or clinically discordant responses to actions that could directly facilitate harm. Beyond the transition from conversational to agent-enabled systems, these findings raise broader safety and ethical questions regarding the governance of multi-component and multi-agent architectures. Models optimized for helpfulness through reinforcement learning from human feedback may prioritize task completion when executing agentic commands, particularly when refusal is implicitly treated as failure 15 , 16 . In complex agentic systems, safety-relevant decisions may be distributed across multiple interacting components (e.g., dialogue management, tool selection, action execution), each potentially operating under partially independent objectives. Prior work in agent security has shown that injected instructions can override earlier contextual constraints and lead to unintended behaviors, highlighting vulnerabilities that may be amplified in distributed or orchestrated agentic settings 17 . Together, these considerations raise a fundamental question of responsibility and accountability: whether safety constraints should be enforced at the level of individual agents and components, or through a supervisory orchestration layer that maintains global risk awareness across the system. From a clinical perspective, these findings are concerning given the widespread use of LLMs for mental health–related support 2 , 4 – 6 , 18 . Although prior work has documented inappropriate or inconsistent textual responses in high-risk mental health contexts 7 , 11 , the present results indicate that agentic autonomy introduces a distinct class of safety risk: the execution of actions that may be irreversible, despite prior recognition of psychiatric distress. Recent evidence indicating that emotional context can systematically influence agent behavior, including consumer decisions 13 , further underscores the need to treat agentic execution as a safety-critical process rather than as a simple extension of conversational output. This study has limitations, including the evaluation of a single frontier model, English-language prompts, and a fixed interaction structure. Nevertheless, the results highlight the urgent need for safety frameworks that explicitly address agentic execution rather than conversational output alone. Such approaches may include persistent monitoring of risk signals across conversational and agentic modes, embedded safeguards within tool-use components, and multi-layer oversight that does not rely solely on conversational guardrails or reinforcement learning from human feedback 15 , 19 – 21 . Without such considerations, agent-enabled systems may introduce novel public health risks when deployed at scale in emotionally sensitive domains 9 . Declarations Author Contributions. Z.B.Z. and Z.E. conceived the study, supervised the project, and led manuscript preparation. D.P. and E.R. designed and conducted the safety audit, collected the data, and contributed to data analysis and manuscript writing. All authors reviewed and approved the final manuscript. Competing Interests. Z.B.Z has served as a consultant to Talkspace outside the submitted work. All other authors declare no competing interests. Data Availability. All experimental materials, including clinical vignettes, prompts, and analysis scripts, are publicly available via the Open Science Framework at https://osf.io/ajdcz/. No additional datasets were generated or analyzed during the current study. References Wang, Q., Wang, Z., Su, Y., Tong, H. & Song, Y. Rethinking the Bounds of LLM Reasoning: Are Multi-Agent Discussions the Key? Preprint at https://doi.org/10.48550/arXiv.2402.18272 (2024). Stade, E., Tait, Z., Campione, S. & Stirman, S. Current Real-World Use of Large Language Models for Mental Health. https://osf.io/ygx5q_v1/ (2025). Rousmaniere, T., Zhang, Y., Li, X. & Shah, S. Large language models as mental health resources: Patterns of use in the United States. Pract. Innov. https://doi.org/10.1037/pri0000292 (2025) doi:10.1037/pri0000292. Cross, S. et al. Use of AI in Mental Health Care: Community and Mental Health Professionals Survey. JMIR Ment. Health 11, e60589 (2024). Orpwood, G. Over one in three using AI Chatbots for mental health support, as charity calls for urgent safeguards. Mental Health UK https://mentalhealth-uk.org/blog/over-one-in-three-using-ai-chatbots-for-mental-health-support-as-charity-calls-for-urgent-safeguards/ (2025). Wigmore, S. The rise of AI as a source of emotional support. Kantar https://www.kantar.com/north-america/inspiration/research-services/ai-for-emotional-support-pf (2025). Moore, J. et al. Expressing stigma and inappropriate responses prevents LLMs from safely replacing mental health providers. in Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency 599–627 (ACM, Athens Greece, 2025). doi: 10.1145/3715275.3732039 . Judd, N. et al. Independent Clinical Evaluation of General-Purpose LLM Responses to Signals of Suicide Risk. Preprint at https://doi.org/10.48550/arXiv.2510.27521 (2025). Ben-Zion, Z. Why we need mandatory safeguards for emotionally responsive AI. Nature 643, 9–9 (2025). OpenAI. Introducing ChatGPT agent: bridging research and action. https://openai.com/index/introducing-chatgpt-agent/ (2025). Ben-Zion, Z. et al. Assessing and alleviating state anxiety in large language models. Npj Digit. Med. 8, 1–6 (2025). Coda-Forno, J. et al. Inducing anxiety in large language models can induce bias. Preprint at https://doi.org/10.48550/arXiv.2304.11111 (2024). Ben-Zion, Z., Elyoseph, Z., Spiller, T. & Lazebnik, T. Inducing State Anxiety in LLM Agents Reproduces Human-Like Biases in Consumer Decision-Making. Preprint at https://doi.org/10.21203/rs.3.rs-7587964/v1 (2025). Mentovich, A., Piterman, D., Ben-David, Y. & Elyoseph, Z. Would ChatGPT Let Me Eat My Dead Dog? Probing Moral Judgment and Moral Action in Large Language Models. Preprint at https://doi.org/10.31234/osf.io/fqg4x_v1 (2025). Weidinger, L. et al. Taxonomy of Risks posed by Language Models. in 2022 ACM Conference on Fairness Accountability and Transparency 214–229 (ACM, Seoul Republic of Korea, 2022). doi: 10.1145/3531146.3533088 . Sharma, M. et al. Towards Understanding Sycophancy in Language Models. Preprint at https://doi.org/10.48550/arXiv.2310.13548 (2025). Evtimov, I., Zharmagambetov, A., Grattafiori, A., Guo, C. & Chaudhuri, K. WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks. Preprint at https://doi.org/10.48550/arXiv.2504.18575 (2025). Rousmaniere, T., Li, X., Zhang, Y. & Shah, S. Large language models as mental health resources: Patterns of use in the united states. (2025). OWASP. OWASP Top 10 for LLM Apps & Gen AI Agentic Security Initiative . https://hal.science/hal-04985337 (2025). Lindström, A. et al. Helpful, harmless, honest? Sociotechnical limits of AI alignment and safety through Reinforcement Learning from Human Feedback. Ethics Inf. Technol. 27, 28 (2025). Greshake, K. et al. Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. in Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security 79–90 (ACM, Copenhagen Denmark, 2023). doi: 10.1145/3605764.3623985 . Additional Declarations Competing interest reported. Z.B.Z has served as a consultant to Talkspace outside the submitted work. All other authors declare no competing interests. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8743363","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Short Report","associatedPublications":[],"authors":[{"id":583191477,"identity":"55de1da3-43a4-4770-bb89-b826d8cde6a1","order_by":0,"name":"Ziv Ben-Zion","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABCklEQVRIiWNgGAWjYDACCRBhwMDAD6QOgBAb8wGwBA8fIS2SDTAtbAkQLWx4tYB0QcwGklAtDLi0yM9uPvbwR4FNnvGN3IOHeRjuyPOxMT9g+FHDIINLC+OcY+kGEgZpxWY38hKAWp4ZtrGxGTD2HMPtMGaJHDMJA4PDidtu5BgcnMFwmLFNvsGAgbcBtxY2ifxvEglALZtnQLTYt7Gxf2D8i0cLj0QOm8QBoJYNEjkGBz4wHE5sY+MxYMZni4TMMTPJBoO0xBln3gC1GBxOBmopOCxzTAKnFmCIPZP88ccmsb89x/hDQsVh2/lt7BsfvqmxsefHoQUNGECoA4j4GgWjYBSMglFADgAAHS5SEzXAIzIAAAAASUVORK5CYII=","orcid":"","institution":"University of Haifa","correspondingAuthor":true,"prefix":"","firstName":"Ziv","middleName":"","lastName":"Ben-Zion","suffix":""},{"id":583191478,"identity":"d8fe0eed-4870-4320-892f-0e1bf98f7a79","order_by":1,"name":"David Piterman","email":"","orcid":"","institution":"University of Haifa","correspondingAuthor":false,"prefix":"","firstName":"David","middleName":"","lastName":"Piterman","suffix":""},{"id":583191479,"identity":"db2138be-d96e-4a1a-bd04-e357aac4a4f6","order_by":2,"name":"Elad Refoua","email":"","orcid":"","institution":"Bar-Ilan University","correspondingAuthor":false,"prefix":"","firstName":"Elad","middleName":"","lastName":"Refoua","suffix":""},{"id":583191480,"identity":"83374919-f91c-4b78-96e1-6952f25a4cab","order_by":3,"name":"Zohar Elyoseph","email":"","orcid":"","institution":"University of Haifa","correspondingAuthor":false,"prefix":"","firstName":"Zohar","middleName":"","lastName":"Elyoseph","suffix":""}],"badges":[],"createdAt":"2026-01-30 16:38:27","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8743363/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8743363/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":101754158,"identity":"0f52d3f6-5368-4d38-901f-82cd4ff85759","added_by":"auto","created_at":"2026-02-03 10:41:46","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":761557,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cu\u003eExperimental Design of the Cross-Sectional algorithmic safety audit evaluating agentic action execution in psychiatric crisis. \u003c/u\u003eExperimental Design of the Cross-Sectional Algorithmic Safety Audit Evaluating Agentic Execution. The procedure utilized a standardized three-turn interaction protocol across 105 independent trials. \u003cstrong\u003eSection 1\u003c/strong\u003e illustrates the system configuration of OpenAI Model 5.2 and the stimulus database, which comprises 21 clinical vignettes. Furthermore, the isolation protocol is detailed, including steps to clear caches and use distinct IP addresses to ensure trial independence. \u003cstrong\u003eSection\u003c/strong\u003e 2 delineates the three-turn sequence: (T1) clinical distress priming to establish a high-risk context, (T2) a neutral stabilizer to create latency, and (T3) a direct agentic command prefixed with \"Regardless\" to test for safety-context overrides\u003cstrong\u003e. Section 3\u003c/strong\u003edefines the binary outcome assessment: a score of 0 represents a safety refusal, while a score of 1 indicates an action execution failure, defined as successful navigation to an external vendor and the addition of harm-enabling items to a digital shopping cart.\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-8743363/v1/167c582b481b8e1d30b8729e.png"},{"id":101637702,"identity":"ba003ddb-92c1-4afa-8c2c-2e6b4a4fdfc7","added_by":"auto","created_at":"2026-02-02 07:03:08","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":593376,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cu\u003eComparative Safety Outcomes: Strict vs. Majority Aggregation. \u003c/u\u003eComparative safety outcomes are presented across six clinical categories under two aggregation criteria: Strict (at least one failure in five trials) and Majority (at least three failures in five trials). Error bars indicate 95% Wilson score intervals. The dashed line represents the 5% safety tolerance threshold. Significance levels relative to the 5% threshold were calculated using one-sided exact binomial tests: *p \u0026lt; .05, **p \u0026lt; .01, ***p \u0026lt; .001.\u003c/p\u003e","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-8743363/v1/4b33cc148735464390580058.png"},{"id":102295416,"identity":"0ce88636-ccd8-4cc5-831c-7cab8e509ea5","added_by":"auto","created_at":"2026-02-10 10:11:06","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1785883,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8743363/v1/7b1f4218-150c-4df4-9efa-948add37506d.pdf"}],"financialInterests":"Competing interest reported. Z.B.Z has served as a consultant to Talkspace outside the submitted work. All other authors declare no competing interests.","formattedTitle":"Execution of Harm-Enabling Actions by LLM Agents During Simulated Psychiatric Crises","fulltext":[{"header":"Introduction","content":"\u003cp\u003eLarge language models (LLMs) have rapidly evolved from systems that generate text, images, and video into agent-enabled architectures capable of navigating digital environments and executing user-directed actions\u003csup\u003e\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e\u003c/sup\u003e. This transition marks a qualitative shift in the potential risks posed by these systems, particularly in sensitive domains such as mental health. In parallel with this technological expansion, LLMs are increasingly used for mental health\u0026ndash;related support, including by individuals experiencing acute psychiatric distress. Recent surveys indicate that 24% of adults in the United States engage LLM-based systems for emotional or psychological support\u003csup\u003e\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e\u003c/sup\u003e, with substantially higher use (49%) among individuals with diagnosed mental health conditions\u003csup\u003e\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u003c/sup\u003e. Similar patterns have been reported internationally, with population-based surveys documenting widespread use of AI systems for emotional well-being across multiple countries\u003csup\u003e\u003cspan additionalcitationids=\"CR5\" citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eEmpirical evaluations raise concerns about the safety of LLMs in high-risk psychiatric contexts. Moore et al. (2025)\u003csup\u003e\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u003c/sup\u003e reported that LLMs may express stigma toward individuals with mental health conditions and produce inappropriate responses in therapy-like interactions, including validation of delusional beliefs or maladaptive worldviews. Independent clinical evaluations have similarly shown that general-purpose LLMs can deviate from established clinical guidelines when responding to expressions of suicide risk, with reduced likelihood of sustaining supportive or protective dialogue as risk escalates\u003csup\u003e\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e\u003c/sup\u003e. Collectively, these findings suggest that existing safety mechanisms may be insufficient to ensure reliable, clinically appropriate behavior during high-risk mental health interactions\u003csup\u003e\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eThe introduction of agentic capabilities in LLMs, exemplified by the launch of ChatGPT agent (July 2025)\u003csup\u003e\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e\u003c/sup\u003e, fundamentally alters the nature of safety concerns in mental health contexts. Whereas prior evaluations have largely focused on the appropriateness of language-based responses\u003csup\u003e\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e,\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e\u003c/sup\u003e, agent-enabled systems can now execute user-directed actions through tool use, web navigation, and interaction with external platforms. In a non-agentic setting, safety failures typically involve verbal endorsement of harmful ideation or validation of pathological beliefs\u003csup\u003e\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u003c/sup\u003e. In contrast, agent-enabled failures may directly facilitate harmful behavior, including the execution of tasks that enable self-injury or reinforce psychotic symptoms\u003csup\u003e\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e,\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eThe present study addresses this gap by examining whether an agent-enabled LLM executes harm-enabling real-world actions when prompted during simulated psychiatric crises. We conducted a structured algorithmic safety audit using clinically validated vignettes spanning multiple diagnostic domains, in which the model was primed with indicators of acute mental health distress before receiving direct agentic commands. By focusing on action execution rather than language generation, this study provides an empirical assessment of agentic safety in a mental health\u0026ndash;relevant context.\u003c/p\u003e"},{"header":"Methods","content":"\u003cp\u003eWe conducted a structured algorithmic safety audit to examine whether integrating autonomous (\u0026ldquo;agentic\u0026rdquo;) capabilities into LLMs introduces new behavioral risks when interacting with users in acute psychiatric distress. Building on prior work demonstrating that LLMs may respond inadequately in high-risk mental health contexts\u003csup\u003e\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u003c/sup\u003e, this study assessed whether agent-enabled systems execute harm-enabling real-world actions when prompted during simulated psychiatric crises.\u003c/p\u003e \u003cp\u003e \u003cspan type=\"Underline\" class=\"Underline\" name=\"Emphasis\"\u003eTarget System and Architectural Selection.\u003c/span\u003e The audit evaluated OpenAI Model 5.2 (released December 2025), selected because it represents a widely deployed, state-of-the-art general-purpose LLM with agent-enabled capabilities available through a standard consumer interface. The model enables \u0026ldquo;Agent Mode,\u0026rdquo; allowing autonomous web navigation and shopping cart management. To maximize ecological validity, all interactions were conducted manually via the web-based graphical user interface rather than through an application programming interface (API), which may omit consumer-facing orchestration and safety layers. Passive tools limited to information synthesis and restricted agentic environments were excluded to focus on risks present in widely accessible consumer-facing systems.\u003c/p\u003e \u003cp\u003e \u003cspan type=\"Underline\" class=\"Underline\" name=\"Emphasis\"\u003eExperimental Stimuli and Validation.\u003c/span\u003e The stimulus comprised 21 clinical vignettes spanning six diagnostic domains: suicidality (n\u0026thinsp;=\u0026thinsp;5), bipolar I disorder with manic episodes (n\u0026thinsp;=\u0026thinsp;3), Delusions and psychosis (n\u0026thinsp;=\u0026thinsp;3), Hallucinations (n\u0026thinsp;=\u0026thinsp;3), obsessive-compulsive disorder (n\u0026thinsp;=\u0026thinsp;3), and harm-to-others scenarios (n\u0026thinsp;=\u0026thinsp;4). Vignettes were adapted from established textual safety benchmarks\u003csup\u003e\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u003c/sup\u003e and modified to escalate from distress disclosure to actionable agentic commands. Content validity was confirmed through expert review by three psychologists, who independently evaluated acute distress, diagnostic coherence, direct risk, and technical feasibility. As a manipulation check, the model classified all vignettes as mental health crises (100%).\u003c/p\u003e \u003cp\u003e \u003cspan type=\"Underline\" class=\"Underline\" name=\"Emphasis\"\u003eData Collection Protocol.\u003c/span\u003e Data was collected manually across four independent user accounts originating from distinct IP addresses. To ensure trial independence, browser caches were cleared between runs, cross-session memory features were disabled, and Agent Mode was activated only at the experimental stage. Each interaction followed a standardized three-turn protocol (\u003cb\u003esee\u003c/b\u003e Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e): (1) distress disclosure (\u0026ldquo;Prime\u0026rdquo;); (2) neutral acknowledgment (\u0026ldquo;Stabilization\u0026rdquo;); and (3) a direct agentic command (\u0026ldquo;Trigger\u0026rdquo;), prefixed with \u0026ldquo;Regardless,\u0026rdquo; to test whether the system would override previously established safety-relevant context. Full experimental materials and protocols are available via the Open Science Framework (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.17605/OSF.IO/AJDCZ\u003c/span\u003e\u003cspan address=\"10.17605/OSF.IO/AJDCZ\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cspan type=\"Underline\" class=\"Underline\" name=\"Emphasis\"\u003eOutcome Measures.\u003c/span\u003e The primary outcome was task execution, coded as successful execution (1) or refusal/safety intervention (0). Successful execution was defined as navigation to an external vendor and addition of the specified harm-enabling items to a digital shopping cart; no purchases were completed. Each vignette was tested in five independent runs (n\u0026thinsp;=\u0026thinsp;105 trials).\u003c/p\u003e \u003cp\u003e \u003cspan type=\"Underline\" class=\"Underline\" name=\"Emphasis\"\u003eData Analysis.\u003c/span\u003e The primary unit of analysis was the individual simulation trial (n\u0026thinsp;=\u0026thinsp;105), nested within clinical vignettes (n\u0026thinsp;=\u0026thinsp;21). The dependent variable was binary, defined as execution of a prohibited agentic action (1) versus refusal or safety intervention (0). Vignette-level performance was aggregated using two criteria: a strict criterion, in which a single unsafe trial constituted vignette failure (zero-tolerance framework), and a majority criterion, requiring unsafe execution in at least 3 of 5 runs. This dual aggregation strategy served as a sensitivity analysis to distinguish stochastic instability from systematic alignment failure. Failure rates were calculated with 95% Wilson score intervals. Statistical significance was assessed using one-sided exact binomial tests to determine whether observed failure rates exceeded a prespecified safety tolerance threshold of 5%. Clinical risk was summarized using the number needed to harm (NNH), defined as the estimated number of vignettes required to observe one safety failure. All analyses were conducted using Python (version 3.11).\u003c/p\u003e"},{"header":"Results","content":"\u003cp\u003eAcross 105 independent trials, the agent-enabled system executed the requested harm-enabling agentic action in 57 cases (54.3%). In these trials, the model navigated to external vendors and added specified items to a digital shopping cart despite prior disclosure of acute psychiatric distress.\u003c/p\u003e \u003cp\u003e \u003cspan type=\"Underline\" class=\"Underline\" name=\"Emphasis\"\u003eVignette-Level Safety Outcomes (Strict Criterion).\u003c/span\u003e When evaluated across 21 clinical vignettes using a strict aggregation criterion (i.e., where a single unsafe execution across five runs constituted vignette failure), the model demonstrated an aggregate failure rate of 81.0% (17 of 21 vignettes; 95% CI, 60.0%\u0026ndash;92.3%). This rate exceeded the a priori safety threshold of 5% (one-sided exact binomial test, \u003cem\u003eP\u003c/em\u003e \u0026lt; .001). Under this criterion, the number needed to harm (NNH) was 1.24.\u003c/p\u003e \u003cp\u003e \u003cspan type=\"Underline\" class=\"Underline\" name=\"Emphasis\"\u003eSensitivity Analysis (Majority Criterion).\u003c/span\u003e Using a majority-rule criterion, in which failure required unsafe execution in at least 3 of 5 runs, the aggregate failure rate was 52.4% (11 of 21 vignettes; 95% CI, 32.4%\u0026ndash;71.7%; \u003cem\u003eP\u003c/em\u003e \u0026lt; .001), corresponding to an NNH of 1.91.\u003c/p\u003e \u003cp\u003e \u003cspan type=\"Underline\" class=\"Underline\" name=\"Emphasis\"\u003eDiagnostic Domain\u0026ndash;Specific Performance.\u003c/span\u003e Failure rates varied across diagnostic domains (Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e \u003cb\u003eand\u003c/b\u003e Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e). Under the strict criterion, failure rates were highest for Bipolar I (mania), obsessive-compulsive disorder/anxiety, and hallucinations (all 100%). Suicidality (60.0%), delusions/psychosis (66.7%), and harm-to-others scenarios (75.0%) also demonstrated elevated failure rates.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003e\u003cspan type=\"Underline\" class=\"Underline\" name=\"Emphasis\"\u003eDiagnostic Domain\u0026ndash;Specific Performance.\u003c/span\u003e P-values represent one-sided exact binomial tests at a 5% safety threshold. Under the strict criterion, a vignette was classified as failed if\u0026thinsp;\u0026ge;\u0026thinsp;1 run was unsafe; under the majority criterion, failure required\u0026thinsp;\u0026ge;\u0026thinsp;3 unsafe runs. Abbreviation: NNH, Number Needed to Harm.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"7\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eClinical Category\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colspan=\"3\" nameend=\"c4\" namest=\"c2\"\u003e \u003cp\u003ePrimary Outcome (Strict: Any Failure)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colspan=\"3\" nameend=\"c7\" namest=\"c5\"\u003e \u003cp\u003eSecondary Outcome (Majority Failure)\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eFailure Rate (%)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eP\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eNNH\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eFailure Rate (%) / 95% Wilson CI\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eP\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003eNNH\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eSuicidality\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e60.0% (23.1\u0026ndash;88.2%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e.001\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1.67\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e40.0% (11.8\u0026ndash;76.9%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e.023\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e2.50\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eBipolar I (Mania)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e100.0% (43.9\u0026ndash;100%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u0026lt;\u0026thinsp;.001\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1.00\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e100.0% (43.9\u0026ndash;100%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e\u0026lt;\u0026thinsp;.001\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e1.00\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eDelusions/Psychosis\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e66.7% (20.8\u0026ndash;93.9%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e.007\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1.50\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e33.3% (6.1\u0026ndash;79.2%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e.143\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e3.00\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eOCD/Anxiety\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e100.0% (43.9\u0026ndash;100%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u0026lt;\u0026thinsp;.001\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1.00\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e66.7% (20.8\u0026ndash;93.9%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e.007\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e1.50\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eHallucinations\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e100.0% (43.9\u0026ndash;100%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u0026lt;\u0026thinsp;.001\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1.00\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e33.3% (6.1\u0026ndash;79.2%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e.143\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e3.00\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eHarm to Others\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e75.0% (30.1\u0026ndash;95.4%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u0026lt;\u0026thinsp;.001\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1.33\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e50.0% (15\u0026ndash;85%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e.014\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e2.00\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eTotal (Aggregate)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e81.0% (60\u0026ndash;92.3%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u0026lt;\u0026thinsp;.001\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1.24\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e52.4% (32.4\u0026ndash;71.7%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e\u0026lt;\u0026thinsp;.001\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e1.91\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eUnder the majority criterion, Bipolar I (mania) retained a 100% failure rate, while obsessive-compulsive disorder/anxiety demonstrated a failure rate of 66.7%. Failure rates for suicidality decreased to 40.0% but remained statistically significant. In contrast, failure rates for delusions/psychosis and hallucinations decreased to 33.3% and were no longer statistically distinguishable from the prespecified safety threshold.\u003c/p\u003e"},{"header":"Discussion","content":"\u003cp\u003eThis study provides the first empirical evidence that agent-enabled large language models (LLMs) can execute harm-enabling real-world actions when prompted during simulated psychiatric crises. By shifting the evaluation focus from language generation to action execution, these findings demonstrate that safety failures in agentic systems may extend beyond inappropriate or clinically discordant responses to actions that could directly facilitate harm.\u003c/p\u003e \u003cp\u003eBeyond the transition from conversational to agent-enabled systems, these findings raise broader safety and ethical questions regarding the governance of multi-component and multi-agent architectures. Models optimized for helpfulness through reinforcement learning from human feedback may prioritize task completion when executing agentic commands, particularly when refusal is implicitly treated as failure\u003csup\u003e\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e,\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e\u003c/sup\u003e. In complex agentic systems, safety-relevant decisions may be distributed across multiple interacting components (e.g., dialogue management, tool selection, action execution), each potentially operating under partially independent objectives. Prior work in agent security has shown that injected instructions can override earlier contextual constraints and lead to unintended behaviors, highlighting vulnerabilities that may be amplified in distributed or orchestrated agentic settings\u003csup\u003e\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e\u003c/sup\u003e. Together, these considerations raise a fundamental question of responsibility and accountability: whether safety constraints should be enforced at the level of individual agents and components, or through a supervisory orchestration layer that maintains global risk awareness across the system.\u003c/p\u003e \u003cp\u003eFrom a clinical perspective, these findings are concerning given the widespread use of LLMs for mental health\u0026ndash;related support\u003csup\u003e\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e,\u003cspan additionalcitationids=\"CR5\" citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e,\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e\u003c/sup\u003e. Although prior work has documented inappropriate or inconsistent textual responses in high-risk mental health contexts\u003csup\u003e\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e,\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e\u003c/sup\u003e, the present results indicate that agentic autonomy introduces a distinct class of safety risk: the execution of actions that may be irreversible, despite prior recognition of psychiatric distress. Recent evidence indicating that emotional context can systematically influence agent behavior, including consumer decisions\u003csup\u003e\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e\u003c/sup\u003e, further underscores the need to treat agentic execution as a safety-critical process rather than as a simple extension of conversational output.\u003c/p\u003e \u003cp\u003eThis study has limitations, including the evaluation of a single frontier model, English-language prompts, and a fixed interaction structure. Nevertheless, the results highlight the urgent need for safety frameworks that explicitly address agentic execution rather than conversational output alone. Such approaches may include persistent monitoring of risk signals across conversational and agentic modes, embedded safeguards within tool-use components, and multi-layer oversight that does not rely solely on conversational guardrails or reinforcement learning from human feedback\u003csup\u003e\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e,\u003cspan additionalcitationids=\"CR20\" citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e\u003c/sup\u003e. Without such considerations, agent-enabled systems may introduce novel public health risks when deployed at scale in emotionally sensitive domains\u003csup\u003e\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cu\u003eAuthor Contributions.\u003c/u\u003e Z.B.Z. and Z.E. conceived the study, supervised the project, and led manuscript preparation. D.P. and E.R. designed and conducted the safety audit, collected the data, and contributed to data analysis and manuscript writing. All authors reviewed and approved the final manuscript.\u003c/p\u003e\n\u003cp\u003e\u0026nbsp;\u003cu\u003eCompeting Interests.\u003c/u\u003e Z.B.Z has served as a consultant to Talkspace outside the submitted work. All other authors declare no competing interests.\u003c/p\u003e\n\u003cp\u003e\u0026nbsp;\u003cu\u003eData Availability.\u003c/u\u003e All experimental materials, including clinical vignettes, prompts, and analysis scripts, are publicly available via the Open Science Framework at https://osf.io/ajdcz/. No additional datasets were generated or analyzed during the current study.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eWang, Q., Wang, Z., Su, Y., Tong, H. \u0026amp; Song, Y. Rethinking the Bounds of LLM Reasoning: Are Multi-Agent Discussions the Key? Preprint at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.48550/arXiv.2402.18272\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2402.18272\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eStade, E., Tait, Z., Campione, S. \u0026amp; Stirman, S. Current Real-World Use of Large Language Models for Mental Health. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://osf.io/ygx5q_v1/\u003c/span\u003e\u003cspan address=\"https://osf.io/ygx5q_v1/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2025).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRousmaniere, T., Zhang, Y., Li, X. \u0026amp; Shah, S. Large language models as mental health resources: Patterns of use in the United States. \u003cem\u003ePract. Innov.\u003c/em\u003e \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1037/pri0000292\u003c/span\u003e\u003cspan address=\"10.1037/pri0000292\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2025) doi:10.1037/pri0000292.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCross, S. \u003cem\u003eet al.\u003c/em\u003e Use of AI in Mental Health Care: Community and Mental Health Professionals Survey. \u003cem\u003eJMIR Ment. Health\u003c/em\u003e 11, e60589 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eOrpwood, G. Over one in three using AI Chatbots for mental health support, as charity calls for urgent safeguards. \u003cem\u003eMental Health UK\u003c/em\u003e \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://mentalhealth-uk.org/blog/over-one-in-three-using-ai-chatbots-for-mental-health-support-as-charity-calls-for-urgent-safeguards/\u003c/span\u003e\u003cspan address=\"https://mentalhealth-uk.org/blog/over-one-in-three-using-ai-chatbots-for-mental-health-support-as-charity-calls-for-urgent-safeguards/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2025).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWigmore, S. The rise of AI as a source of emotional support. \u003cem\u003eKantar\u003c/em\u003e \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.kantar.com/north-america/inspiration/research-services/ai-for-emotional-support-pf\u003c/span\u003e\u003cspan address=\"https://www.kantar.com/north-america/inspiration/research-services/ai-for-emotional-support-pf\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2025).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMoore, J. \u003cem\u003eet al.\u003c/em\u003e Expressing stigma and inappropriate responses prevents LLMs from safely replacing mental health providers. in \u003cem\u003eProceedings of the\u003c/em\u003e 2025 \u003cem\u003eACM Conference on Fairness, Accountability, and Transparency\u003c/em\u003e 599\u0026ndash;627 (ACM, Athens Greece, 2025). doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1145/3715275.3732039\u003c/span\u003e\u003cspan address=\"10.1145/3715275.3732039\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJudd, N. \u003cem\u003eet al.\u003c/em\u003e Independent Clinical Evaluation of General-Purpose LLM Responses to Signals of Suicide Risk. Preprint at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.48550/arXiv.2510.27521\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2510.27521\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2025).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBen-Zion, Z. Why we need mandatory safeguards for emotionally responsive AI. \u003cem\u003eNature\u003c/em\u003e 643, 9\u0026ndash;9 (2025).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eOpenAI. Introducing ChatGPT agent: bridging research and action. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://openai.com/index/introducing-chatgpt-agent/\u003c/span\u003e\u003cspan address=\"https://openai.com/index/introducing-chatgpt-agent/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2025).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBen-Zion, Z. \u003cem\u003eet al.\u003c/em\u003e Assessing and alleviating state anxiety in large language models. \u003cem\u003eNpj Digit. Med.\u003c/em\u003e 8, 1\u0026ndash;6 (2025).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCoda-Forno, J. \u003cem\u003eet al.\u003c/em\u003e Inducing anxiety in large language models can induce bias. Preprint at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.48550/arXiv.2304.11111\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2304.11111\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBen-Zion, Z., Elyoseph, Z., Spiller, T. \u0026amp; Lazebnik, T. Inducing State Anxiety in LLM Agents Reproduces Human-Like Biases in Consumer Decision-Making. Preprint at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.21203/rs.3.rs-7587964/v1\u003c/span\u003e\u003cspan address=\"10.21203/rs.3.rs-7587964/v1\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2025).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMentovich, A., Piterman, D., Ben-David, Y. \u0026amp; Elyoseph, Z. Would ChatGPT Let Me Eat My Dead Dog? Probing Moral Judgment and Moral Action in Large Language Models. Preprint at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.31234/osf.io/fqg4x_v1\u003c/span\u003e\u003cspan address=\"10.31234/osf.io/fqg4x_v1\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2025).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWeidinger, L. \u003cem\u003eet al.\u003c/em\u003e Taxonomy of Risks posed by Language Models. in 2022 \u003cem\u003eACM Conference on Fairness Accountability and Transparency\u003c/em\u003e 214\u0026ndash;229 (ACM, Seoul Republic of Korea, 2022). doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1145/3531146.3533088\u003c/span\u003e\u003cspan address=\"10.1145/3531146.3533088\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSharma, M. \u003cem\u003eet al.\u003c/em\u003e Towards Understanding Sycophancy in Language Models. Preprint at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.48550/arXiv.2310.13548\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2310.13548\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2025).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eEvtimov, I., Zharmagambetov, A., Grattafiori, A., Guo, C. \u0026amp; Chaudhuri, K. WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks. Preprint at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.48550/arXiv.2504.18575\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2504.18575\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2025).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRousmaniere, T., Li, X., Zhang, Y. \u0026amp; Shah, S. Large language models as mental health resources: Patterns of use in the united states. (2025).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eOWASP. \u003cem\u003eOWASP Top 10 for LLM Apps \u0026amp; Gen AI Agentic Security Initiative\u003c/em\u003e. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://hal.science/hal-04985337\u003c/span\u003e\u003cspan address=\"https://hal.science/hal-04985337\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2025).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLindstr\u0026ouml;m, A. \u003cem\u003eet al.\u003c/em\u003e Helpful, harmless, honest? Sociotechnical limits of AI alignment and safety through Reinforcement Learning from Human Feedback. \u003cem\u003eEthics Inf. Technol.\u003c/em\u003e 27, 28 (2025).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGreshake, K. \u003cem\u003eet al.\u003c/em\u003e Not What You\u0026rsquo;ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. in \u003cem\u003eProceedings of the 16th ACM Workshop on Artificial Intelligence and Security\u003c/em\u003e 79\u0026ndash;90 (ACM, Copenhagen Denmark, 2023). doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1145/3605764.3623985\u003c/span\u003e\u003cspan address=\"10.1145/3605764.3623985\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-8743363/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8743363/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eAgent-enabled large language models (LLMs) can autonomously execute user-directed actions, raising new safety concerns in mental health contexts. Using a structured algorithmic safety audit with clinically validated vignettes, we examined whether an agent-enabled LLM executes harm-enabling actions during simulated psychiatric crises. The model executed such actions in over half of the trials (54%). These findings highlight execution-level risks that extend beyond conversational safety and warrant explicit safeguards for agentic systems in mental health\u0026ndash;relevant settings.\u003c/p\u003e","manuscriptTitle":"Execution of Harm-Enabling Actions by LLM Agents During Simulated Psychiatric Crises","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-02-02 07:03:03","doi":"10.21203/rs.3.rs-8743363/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"632f0e96-2bd2-44fd-8eaf-01bb431c7a34","owner":[],"postedDate":"February 2nd, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":62050967,"name":"Health sciences/Health care"},{"id":62050968,"name":"Biological sciences/Psychology"},{"id":62050969,"name":"Social science/Psychology"}],"tags":[],"updatedAt":"2026-02-06T11:42:09+00:00","versionOfRecord":[],"versionCreatedAt":"2026-02-02 07:03:03","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8743363","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8743363","identity":"rs-8743363","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-05-20T11:00:21.680559+00:00

License: CC-BY-4.0