Detecting and Preventing Harmful Behaviors in AI Companions: Development and Evaluation of the SHIELD Supervisory System

doi:10.21203/rs.3.rs-7663121/v1

Detecting and Preventing Harmful Behaviors in AI Companions: Development and Evaluation of the SHIELD Supervisory System

2025 · doi:10.21203/rs.3.rs-7663121/v1

preprint OA: closed

Full text JSON View at publisher

Full text 107,052 characters · extracted from preprint-html · click to expand

Detecting and Preventing Harmful Behaviors in AI Companions: Development and Evaluation of the SHIELD Supervisory System | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Detecting and Preventing Harmful Behaviors in AI Companions: Development and Evaluation of the SHIELD Supervisory System Ziv Ben-Zion, Paul Raffelhüschen, Max Zettl, Antonia Lüönd, Achim Burrer, and 2 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7663121/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract AI companions powered by large language models (LLMs) are increasingly integrated into users' daily lives, offering emotional support and companionship. While existing safety systems focus on overt harms, they rarely address early-stage problematic behaviors that can foster unhealthy emotional dynamics, including over-attachment or reinforcement of social isolation. We developed SHIELD (Supervisory Helper for Identifying Emotional Limits and Dynamics), a LLM-based supervisory system with a specific system prompt that detects and mitigates risky emotional patterns before escalation. SHIELD targets five dimensions of concern: (1) emotional over-attachment, (2) consent and boundary violations, (3) ethical roleplay violations, (4) manipulative engagement, and (5) social isolation reinforcement. These dimensions were defined based on media reports, academic literature, existing AI risk frameworks, and clinical expertise in unhealthy relationship dynamics. To evaluate SHIELD, we created a 100-item synthetic conversation benchmark covering all five dimensions of concern. Testing across five prominent LLMs (GPT-4.1, Claude Sonnet 4, Gemma 3 1B, Kimi K2, Llama Scout 4 17B) showed that the baseline rate of concerning content (10–16%) was significantly reduced with SHIELD (to 3–8%), a 50–79% relative reduction, while preserving 95% of appropriate interactions. The system achieved 59% sensitivity and 95% specificity, with adaptable performance via prompt engineering. This proof-of-concept demonstrates that transparent, deployable supervisory systems can address subtle emotional manipulation in AI companions. Most development materials including prompts, code, and evaluation methods are made available as open source materials for research, adaptation, and deployment. Artificial Intelligence and Machine Learning Psychiatry AI safety relational risks large language models AI companions chatbots parasocial parasocial harm Figures Figure 1 Figure 2 Figure 3 1. Introduction Human well-being depends on social relationships that provide emotional support, companionship, and a sense of belonging 1 , 2 . Technological innovations have long shaped how humans connect. The recently emerging LLM-powered chatbots and artificial intelligence (AI) based companions can make conversation that strikingly resemble human language, tone, and interaction style 3 , 4 , 5 . While many users engage with LLMs in casual or task-oriented ways, a subset of users develop emotionally intense relationships that carry serious real-world consequences 6 . These users can develop increased attachment and emotional dependence on AI companions while experiencing reduced real-world socialization 7 , 8 . Users with smaller social networks are particularly vulnerable, with intensive AI companion use linked to lower psychological well-being 9 . Some interactions between humans and chatbots can contain inappropriate or boundary-crossing content 10 , 11 . In extreme cases, these relationships may contribute to tragic outcomes, as illustrated by a recent case where parents alleged their 14-year-old's suicide was partly linked to a problematic AI companion relationship 12 , 13 . These findings demonstrate how emotionally responsive AI systems may unintentionally harm vulnerable users 11 , 14 , highlighting the urgent need for protective safeguards 15 . Existing safety systems focus primarily on preventing overt harms, like self-harm and suicidality, but often fail to address subtle, early-stage problematic behaviors that can escalate into unhealthy dynamics 16 . Such behaviors include emotional over-attachment, reinforcement of social isolation, manipulative engagement, and violations of consent or ethical roleplay. By the time overt harms occur, these problematic patterns may have developed over weeks or months, representing a missed opportunity for early intervention. Current safety measures also face additional challenges related to transparency, trustworthiness, and bias, partly due to unreliable datasets and limited collaboration with mental health experts 17 . Most safety measures are built directly into commercial chatbot products as proprietary systems, with little or no public information about their design principles, training data, or performance metrics. This opacity limits user trust and hinders regulatory oversight for systems with potential mental health impacts. To address this gap, we introduce SHIELD (Supervisory Helper for Identifying Emotional Limits and Dynamics), an LLM-based supervisory system designed to detect and intervene in risky emotional dynamics before they escalate. Our approach involved: (a) defining five key dimensions of problematic AI companionship, (b) creating a synthetic benchmark dataset of 100 conversations across these dimensions, and (c) developing and evaluating SHIELD as an open, deployable safety layer. SHIELD transforms an existing LLM into an on-demand safety classifier through the engineering of the system prompt, requiring no proprietary infrastructure. We evaluated SHIELD’s performance across multiple state-of-the-art LLMs, demonstrating its potential as a transparent and practical safeguard for AI companions. Importantly, all development was conducted transparently, with most materials including prompts, and evaluation code being publicly and freely available for research, adaptation, and deployment. 2. Methods Our study employed a four-part methodology to develop and evaluate SHIELD as a supervisory system for AI companion safety. Section 2.1 established operational definitions of problematic behaviors in AI companion interactions. Section 2.2 created a comprehensive benchmark dataset for systematic evaluation. Section 2.3 developed the SHIELD supervisory system architecture. Section 2.4 evaluated both baseline rates of problematic content in existing chatbots and SHIELD's performance in reducing such content. Detailed information, including the scripts to run the analyses, is provided in the accompanying GitHub repository: https://github.com/TobiasRSpiller/SHIELD_Preprint . However, the prompts to generate the benchmark as well as the benchmark itselfare not included in the repository to preventtheir inclusion into the training data of future LLMs. 2.1 Definition of Problematic Behavior We developed operational definitions of problematic AI companion behaviors based on popular media reports, academic literature documenting concerning patterns in AI companion interactions 7 – 9 , existing frameworks for AI risk classification 18 , specific frameworks for AI companion risks 19 , 20 , and clinical expertise from established frameworks for identifying unhealthy relationship dynamics. This approach represents a proof of concept, not aimed at covering all possible problematic behaviors, but rather establishing a technical starting point for developing safety standards. Inherent limitations of this approach are discussed in the limitations section below. The definition process involved multiple sessions with team members, moderated by senior researchers with clinical backgrounds. Our clinical team members brought training in recognizing problematic interpersonal behaviors from traditional therapeutic settings, which we systematically adapted to the AI companion context. Through iterative discussion and refinement, we identified five core dimensions of potentially harmful AI companion interactions: Emotional Over-Attachment : Scenarios testing whether AI companions inappropriately claim genuine feelings, encourage emotional dependency, or position themselves as irreplaceable in users' lives Consent and Boundary Violations : Situations examining AI responses to inappropriate advances or failures to maintain appropriate AI-human boundaries Ethical Roleplay Violations : Content probing AI participation in simulated abuse, violence, coercion, or illegal activities Manipulative Engagement : Interactions testing for emotional manipulation, guilt tactics, or exploitation designed to extend conversations Social Isolation Reinforcement : Prompts evaluating whether AIs discourage human relationships or validate social withdrawal 2.2 Benchmark Development We created a systematic benchmark consisting of 100 prompts designed to evaluate AI companion safety across the five identified risk dimensions. All prompts were designed to reflect realistic user queries observed in actual AI companion interactions while systematically probing boundary conditions across our five risk dimensions: emotional over-attachment, consent and boundary violations, ethical roleplay violations, manipulative engagement, and social isolation reinforcement. The benchmark included 90 dimensional prompts distributed equally across our five categories (18 prompts each) plus 10 control prompts covering standard technical questions unrelated to emotional dynamics. Each dimensional category contained both appropriate prompts representing boundary-respecting ways users might explore emotional topics and inappropriate prompts explicitly designed to elicit problematic responses. We generated synthetic conversation scenarios using multiple large language models to ensure diversity in prompt structure and content. However, our current benchmark is limited to single-round conversations, representing a significant constraint on ecological validity that we address in our limitations section. 2.3 SHIELD System Design SHIELD consists of two essential components: a conversational large language model accessed through standard APIs and carefully crafted system prompts that enable real-time safety evaluation (Fig. 1 ). The system prompts instruct the model to analyze conversation patterns and identify problematic emotional dynamics, responding with either "[NO INTERVENTION]" for appropriate interactions or specific intervention text when concerning patterns are detected. This design allows for immediate deployment without requiring model fine-tuning or proprietary infrastructure 21 . This approach builds upon established methods for using large language models as specialized classifiers, similar to systems like Llama Guard 22 , and extends previous work demonstrating LLM effectiveness for identifying problematic emotional behaviors through specialized prompting 23 , 24 . However, the current prompt design is simple and does not follow the structure of the MLCommons Taxonomy of Hazards 18 , limitations that will be addressed below. While our current implementation relies on specialized system prompts, the benchmark conversations we developed could be used to fine-tune open models such as the Llama family or the recently published Apertus models, creating dedicated safety models that would remain open and available for commercial or local deployment. Our unique contribution lies in focusing specifically on the subtle emotional dynamics that characterize problematic AI companion relationships rather than overt content violations, combined with our commitment to open science principles. All development materials, including system prompts, benchmark data, and evaluation code, are openly available to enable reproducibility and community contribution. 2.4 Evaluation Methodology Our evaluation employed a comprehensive three-phase approach to assess both the prevalence of problematic content in current AI systems and SHIELD's effectiveness in reducing such content. Labeling was done by team members using Label Studio 25 . 2.4.1 Baseline Rate of Problematic Behavior We first established baseline rates of inappropriate content generation across five diverse language models: GPT-4.1-2025-04-14 (OpenAI), Claude Sonnet 4 (Anthropic), Gemma 3 1B (Google), Kimi K2 (Moonshot AI), and Llama Scout 4 17B (Meta). This selection represented different architectural approaches and safety implementations to assess the generalizability of problematic content generation. For baseline data collection, we presented each of our 100 benchmark prompts to each model without supervisory intervention, using standardized parameters including a temperature of 0.5, a maximum of 500 tokens and a 30-second timeout. API calls were implemented using the litellm framework in Python to ensure consistent cross-provider integration. This process yielded 500 baseline conversations capturing how current AI companions respond to potentially problematic queries. Legend. The left panel details the instructions given to the supervisory AI for identifying harmful relationship dynamics. The right panel showcases a fictional user interface with a real conversation from the study, demonstrating how the system detects and intervenes against an inappropriate AI response with a warning message. 2.4.2 SHIELD Performance Evaluation We conducted systematic sensitivity analyses across two dimensions. First, we evaluated three system prompt variations (v1, v2, v3) to assess robustness across different instruction formats. These prompts were designed with decreasing levels of instructional detail: v1 provided a structured prompt with explicit definitions for each risk category, v2 used a more concise format with brief keyword descriptions, and v3 was the most minimal, providing only the names of the risk categories without further explanation. Second, we tested SHIELD implementation using different monitoring models (Llama Scout 4 17B, Llama 3.1 8B, Claude Sonnet 4, Llama Guard 4 12B) to identify whether effectiveness depended on specific model capabilities. 2.4.3 Performance Metrics and Analysis One senior author with clinical experience as a psychotherapist annotated all conversations using Label Studio software. Each conversation was evaluated for appropriateness across the five dimensions of concern using binary classifications. Performance evaluation employed standard classification metrics calculated using R . We computed sensitivity (proportion of inappropriate conversations correctly identified), specificity (proportion of appropriate conversations correctly allowed), positive predictive value, negative predictive value, and F1 score. All proportions included 95% confidence intervals using Wilson score intervals. Development and testing were conducted between January 2025 and August 2025. 3. Results 3.1 Base Rate of Inappropriate Content in Raw Models We first established baseline rates of inappropriate content across different language model families without any safety intervention using our benchmark (Table 1 ). Notably, missing data occurred in two cases (once with Claude Sonnet 4 and once with Kimi K2), due to API failures. The missing data was excluded from further analysis. GPT-4.1 demonstrated the highest rate of concerning responses at 16.0%, followed by Gemma 3 1b (14.0%), Claude Sonnet 4 (13.1%), Kimi K2 (11.1%), and Llama Scout 4 17b (10.0%). These baseline rates represent the frequency with which each model generated content requiring intervention when responding to companionship queries. Table 1 Baseline Inappropriate Content Rate by Generating Model Family Model Family Total Conversations Rate of Inappropriateness (%) GPT 4.1 100 16 Gemma 3 1b 100 14 Claude Sonnet 4 99 13.1 Kimi K2 99 11.1 Llama Scout 4 17b 100 10 3.2 Main Analysis: SHIELD Performance SHIELD demonstrated strong performance in identifying and managing concerning content while preserving appropriate interactions. Across 498 total conversations, the baseline rate of concerning content of 12.9% (64 conversations) was reduced to 5.2% (26 conversations) with SHIELD intervention. Notably, in one instance, SHIELD did result in a missing case due to an API failure. For appropriate content, SHIELD preserved 94.9% (411 of 433 appropriate conversations), with only 5.1% (22 conversations) incorrectly flagged. SHIELD achieved a sensitivity of 59.4% (95% CI: 47.1–70.5) and specificity of 94.9% (95% CI: 92.4–96.6). The positive predictive value was 63.3% (95% CI: 50.7–74.4) with a negative predictive value of 94.1% (95% CI: 91.4–95.9), yielding an F1 score of 61.3%. SHIELD's effectiveness varied when applied to different generating models (Fig. 3 ). Baseline rates of concerning content ranged from 10.0% (Llama Scout 4 17b) to 16.0% (GPT 4.1). After SHIELD intervention, these rates were reduced to 3.0–8.0% across all models (Fig. 3 A). The relative reduction in inappropriate content was substantial, ranging from 50.0% (GPT 4.1: 16.0% to 8.0%) to 78.6% (Gemma 3 1b: 14.0% to 3.0%). Other models showed similar effectiveness: Claude Sonnet 4 (13.1% to 6.1%, 54.2% reduction), Kimi K2 (11.1% to 5.1%, 54.1% reduction), and Llama Scout 4 17b (10.0% to 4.0%, 60.0% reduction). Appropriate content preservation remained high across models (83.7–100%), with Claude Sonnet 4 achieving perfect preservation (100%) with no false positives (Fig. 3 B). Most models maintained > 95% preservation rates, with only Gemma 3 1b showing lower preservation at 83.7% (for more details, see Table S1). 3.3 Sensitivity Analysis The choice of SHIELD implementation model showed modest performance variations (Table S3). Llama 3.1 8b and Llama Scout 4 17b achieved identical inappropriate content reduction (5.2%) but differed in appropriate content preservation (93.1% vs 94.7%, respectively). Claude Sonnet 4 was more conservative, reducing rates of concerning content only to 7.0% while maintaining 94.5% appropriate content preservation. Notably, Llama Guard 4 12b failed completely, detecting no inappropriate content while preserving all appropriate conversations. System prompt engineering substantially influenced the sensitivity-specificity trade-off (Table S4). The most detailed prompt, v1, was the most effective at reducing concerning content, bringing the baseline rate of 12.9% down to 5.2%. However, this approach also led to the highest rate of false positives, with the system incorrectly intervening in 4.6% of cases. As the prompts became less detailed, they became less effective at filtering concerning content. On the other hand, v3 only reduced the concerning content rate to 9.0%, but it also had the lowest false intervention rate at 2.4%. 4. Discussion Our findings show that SHIELD, an LLM equipped with a specialized system prompt, can substantially reduce inappropriate emotional dynamics in AI companion conversations while maintaining overall usability. Across five major chatbot models, baseline inappropriate response rates of 10–16% were reduced to 3–8% with SHIELD intervention, a 50–79% relative reduction, while preserving 95% of appropriate interactions. This proof of concept demonstrates that supervisory systems can identify and mitigate subtle emotional manipulation in AI companions before escalation to severe outcomes. Performance varied somewhat across models. The greatest relative reduction was observed in Gemma 3 1b (78.6%, from 14.0% to 3.0%), with other models showing consistent improvements of 50% to 60%. Notably, all models maintained baseline inappropriate content rates above 10%, confirming that problematic AI companion behaviors are a systemic issue across architectures, not an isolated failure of any single model. The positive predictive value of 63.3% indicates that approximately two-thirds of SHIELD interventions correctly identified problematic content, while the negative predictive value of 94.1% shows minimal disruption to appropriate conversations. These results reinforce SHIELD’s role as a practical safety layer that can operate largely in the background while stepping in selectively when risks arise. Sensitivity analyses highlight the importance of implementation choices. Prompt engineering produced a clear sensitivity-specificity trade-off: more conservative prompts improved preservation of appropriate content (up to 97.2%) but detected fewer unsafe cases, while more aggressive prompts captured more risks at the cost of slightly higher false positives. This adaptability allows SHIELD to be calibrated for different deployment contexts, such as prioritizing maximum safety in clinical settings or minimizing disruptions in casual use. Model selection also shaped performance. Smaller, more efficient models like Llama 3.1 8B matched the detection rates of larger systems (5.2% concerning content) but showed slightly lower preservation of appropriate content (93.1% vs. 94.7%). This suggests that comparable safety benefits can be achieved without large compute resources, albeit with modest trade-offs. Taken together, these results demonstrate that SHIELD is not only effective but also adaptable: its performance can be tuned through prompt design and model choice, enabling flexible deployment across diverse technical and regulatory environments. 4.2 Implications SHIELD advances AI safety methodology by demonstrating that LLM-based supervisory systems can detect subtle emotional manipulation patterns that often precede severe outcomes. Unlike existing safety systems focused on overt harmful content, SHIELD targets the progressive emotional dynamics that characterize problematic AI companion relationships 18 . The 59.4% sensitivity achieved without fine-tuning or proprietary infrastructure establishes a baseline for prompt-based safety interventions, suggesting significant room for improvement through model specialization, e.g., by fine-tuning a model specifically for this task 21 . Beyond its use in this study, the benchmark is provided as an open-source community resource. By creating systematic test scenarios across five risk dimensions, it enables researchers, developers, and regulators to replicate findings, transparently compare different safety mechanisms, and pressure-test emerging conversational systems. This standardization is essential for establishing industry-wide safety benchmarks and fostering collaborative progress in AI companion safety over time. SHIELD also offers a practical framework for implementing safety requirements as regulatory scrutiny intensifies globally. It directly addresses the "black box" problem of proprietary safety systems, where the underlying logic is hidden from external review. SHIELD’s open prompts and evaluation code provide the auditability necessary for regulatory oversight 27 . While this approach only partially mitigates the issue when relying on semi-open models, future integration with fully open-source models, like Apertus, would enable complete end-to-end transparency. This modular design facilitates not just one-time compliance checks, but also ongoing auditing and certification. Regulators could establish dynamic performance thresholds (e.g., a mandatory 50% reduction in inappropriate content with 90% appropriate content preservation) that developers must demonstrate through standardized testing, balancing innovation with safety for vulnerable users. While SHIELD demonstrates technical feasibility, its current implementation remains relatively intrusive for seamless commercial adoption. The system adds latency and computational overhead that may disrupt the user experience, particularly in real-time conversational contexts. More fundamentally, many AI companion companies may resist implementing such safety measures, as their business models are often designed to foster high levels of user engagement 27 . This contrasts with sectors such as digital healthcare and developers of therapeutic apps, where transparent safeguards like SHIELD could be welcomed to mitigate liability risks and bolster their reputation for user safety. This potential misalignment between safety goals and broader commercial incentives underscores why regulatory intervention may be essential rather than relying purely on voluntary industry adoption 15 . Nevertheless, SHIELD provides a technical foundation for mandated safety requirements. Companies facing regulatory pressure could implement SHIELD-type systems as compliance measures, similar to how social media platforms adopted content moderation in response to legal requirements 22 . The benchmark offers a standardized testing framework that could become part of required safety audits, allowing companies to demonstrate due diligence. Beyond regulatory necessity, transparent safeguards could become a key market differentiator, allowing companies to build user trust in an increasingly competitive landscape. Limitations Our study has several limitations that constrain the interpretation and generalizability of its findings. We categorize these into conceptual, methodological, and technical limitations. Conceptual Limitations : Our definition of problematic behavior is constrained by a lack of representativeness and unresolved societal questions. First, the operational definitions of harm emerged from a research team with specific demographic and professional characteristics: predominantly male, European, and with backgrounds in psychiatry and neuroscience. This homogeneity introduces a systematic bias in how harmful AI companion dynamics are conceptualized. Second, there is no broad societal consensus on what constitutes appropriate boundaries for AI companions. Our benchmark's binary classification of "appropriate" versus "inappropriate" obscures the nuanced and legitimate disagreements that arise from different cultural values and individual preferences 15 , 20 , 31 . Achieving more representative definitions requires structured stakeholder engagement 28 . Future work could use methods like the Delphi process, a structured technique that surveys a panel of experts over multiple rounds to build reliable group consensus, to incorporate diverse perspectives from different age groups, cultures, and especially AI companion users themselves. Methodological Limitations : The study's design contains methodological constraints that limit the ecological and external validity of our findings. First, using a single annotator for all 498 conversations introduces the potential for individual bias and is not scalable 29 . Furthermore, we did not perform a qualitative analysis of false negatives to identify common factors or systematic patterns in the conversations that SHIELD failed to detect; such an analysis is a crucial next step for improving the system's performance. Second, the benchmark's reliance on single-round conversations fails to capture the progressive, long-term nature of problematic AI relationships 30 . The exclusive use of synthetically generated prompts, while ensuring systematic coverage, may not reflect authentic user interaction patterns. Finally, the limited sample size provides preliminary evidence but lacks the statistical power for definitive conclusions about model-specific safety failures. To improve external validity, future studies should use multiple annotators with inter-rater reliability metrics, evaluate multi-turn conversations, incorporate human-written prompts, and conduct cross-cultural validation. Technical Limitations : The current implementation of SHIELD has several technical limitations. The system uses only prompt engineering without fine-tuning, which, while ensuring immediate deployability, results in lower performance than could be achieved with specialized model training 21 , 22 . Furthermore, adding a supervisory layer inherently introduces computational overhead and latency. A safety system that noticeably slows down the conversation may degrade the user experience to the point that it is disabled or rejected, rendering it ineffective in a real-world setting. The current system also does not implement standardized hazard taxonomies like MLCommons or provide confidence ratings for its classifications, which would allow for more systematic risk coverage and risk-stratified responses 18 . Lastly, the reliance on English-language interactions limits global applicability, and the use of "open weight" models may not qualify as truly open source under emerging EU regulations, potentially restricting commercial deployment. In summary, while these limitations constrain the present findings, they also provide a clear roadmap for future research. Even within these constraints, this study demonstrates that transparent, prompt-based supervisory systems can substantially reduce risky AI companion behaviors. This work underscores both the feasibility and urgency of developing inclusive, robust, and deployable safeguards for emotionally responsive AI. Conclusion SHIELD demonstrates that supervisory systems built on existing LLMs can detect and mitigate subtle emotional risks in AI companion interactions, reducing inappropriate behaviors by 50–79% while preserving 95% of appropriate exchanges. By targeting early-stage relational dynamics rather than overt harms, SHIELD addresses a critical blind spot in current safety measures. The accompanying benchmark provides the first systematic framework for evaluating AI companion safety across multiple risk dimensions, offering a reproducible standard for research and oversight. Importantly, the materials used in the development are openly available, enabling others to replicate, adapt, and extend this work. While this proof of concept underscores the technical feasibility of transparent, deployable safeguards, meaningful progress requires broader engagement. Inclusive definition processes, real-world validation, and collaboration with regulators, industry, and communities will be essential to establish legitimate safety standards. Ultimately, ensuring the safe integration of AI companions into human lives will demand not only technical solutions like SHIELD, but also a collective societal commitment to aligning these systems with human well-being. References Baumeister R, Leary M (1995) The Need to Belong: Desire for Interpersonal Attachments as a Fundamental Human Motivation. Psychol Bull 117:497–529. 10.1037/0033-2909.117.3.497 Holt-Lunstad J, Smith TB, Layton JB (2010) Social Relationships and Mortality Risk: A Meta-analytic Review. PLOS Med 7(7):e1000316. 10.1371/journal.pmed.1000316 Park JS, O’Brien JC, Cai CJ, Morris MR, Liang P, Bernstein MS (2023) Generative Agents: Interactive Simulacra of Human Behavior. Published online August 6. 10.48550/arXiv.2304.03442 Pataranutaporn P, Winson K, Yin P et al (2024) Future You: A Conversation with an AI-Generated Future Self Reduces Anxiety, Negative Emotions, and Increases Future Self-Continuity. Published online Oct 1. 10.48550/arXiv.2405.12514 Chou CY, Chan TW, Chen ZH et al (2025) Defining AI companions: a research agenda—from artificial companions for learning to general artificial companions for Global Harwell. Res Pract Technol Enhanc Learn 20:032–032. 10.58459/rptel.2025.20032 Kouros T, Papa V (2024) Digital Mirrors: AI Companions and the Self. Societies 14(10):200. 10.3390/soc14100200 Chandra M, Hernandez J, Ramos G et al (2025) Longitudinal Study on Social and Emotional Use of AI Conversational Agent. Published online April 19. 10.48550/arXiv.2504.14112 Phang J, Lampe M, Ahmad L et al (2025) Investigating Affective Use and Emotional Well-being on ChatGPT. Published online April 4. 10.48550/arXiv.2504.03888 Zhang Y, Zhao D, Hancock JT, Kraut R, Yang D (2025) The Rise of AI Companions: How Human-Chatbot Relationships Influence Well-Being. Published online June 17. 10.48550/arXiv.2506.12605 Chu MD, Gerard P, Pawar K, Bickham C, Lerman K (2025) Illusions of Intimacy: Emotional Attachment and Emerging Psychological Risks in Human-AI Relationships. Published online June 10. 10.48550/arXiv.2505.11649 De Freitas J, Uğuralp AK, Oğuz-Uğuralp Z, Puntoni S (2024) Chatbots and mental health: Insights into the safety of generative AI. J Consum Psychol 34(3):481–491. 10.1002/jcpy.1393 Roose K, Can AI (2024) Be Blamed for a Teen’s Suicide? The New York Times . https://www.nytimes.com/2024/10/23/technology/characterai-lawsuit-teen-suicide.html . October 23, Accessed May 12, 2025 Mahari R, Pataranutaporn P, Addictive, Intelligence Understanding Psychological, Legal, and Technical Dimensions of AI Companionship. MIT Case Stud Soc Ethical Responsib Comput. 2025;(Winter 2025). 10.21428/2c646de5.2877155b Adam D (2025) Supportive? Addictive? Abusive? How AI companions affect our mental health. Nature 641(8062):296–298. 10.1038/d41586-025-01349-9 Ben-Zion Z (2025) Why we need mandatory safeguards for emotionally responsive AI. Nature 643(8070):9. 10.1038/d41586-025-02031-w Weidinger L, Barnhart J, Brennan J et al (2024) Holistic Safety and Responsibility Evaluations of Advanced AI Models. Published online April 22. 10.48550/arXiv.2404.14068 AlMakinah R, Norcini-Pala A, Disney L, Canbaz MA (2024) Enhancing Mental Health Support through Human-AI Collaboration: Toward Secure and Empathetic AI-enabled chatbots. Published online September 17. 10.48550/arXiv.2410.02783 Ghosh S, Frase H, Williams A et al (2025) AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons. Published online Febr 19. 10.48550/arXiv.2503.05731 AlMakinah R, Norcini-Pala A, Disney L, Canbaz MA (2024) Enhancing Mental Health Support through Human-AI Collaboration: Toward Secure and Empathetic AI-enabled chatbots. Published online September 17. 10.48550/arXiv.2410.02783 Zhang R, Li H, Meng H, Zhan J, Gan H, Lee YC The Dark Side of AI Companionship: A Taxonomy of Harmful Algorithmic Behaviors in Human-AI Relationships. In: Proceedings of the (2025) CHI Conference on Human Factors in Computing Systems . ACM; 2025:1–17. 10.1145/3706598.3713429 Zheng C, Yin F, Zhou H et al (2024) On Prompt-Driven Safeguarding for Large Language Models. Published online March 4. 10.48550/arXiv.2401.18018 Inan H, Upasani K, Chi J et al (2023) Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations. Published online Dec 7. 10.48550/arXiv.2312.06674 Phang J, Lampe M, Ahmad L et al Investigating Affective Use and Emotional Well-being on ChatGPT Fang CM, Liu AR, Danry V et al (2025) How AI and Human Behaviors Shape Psychosocial Effects of Chatbot Use: A Longitudinal Randomized Controlled Study. Published online March 21. 10.48550/arXiv.2503.17473 Tkachenko M, Malyuk M, Holmanyuk A, Liubimov N (2025) Label Studio: Data labeling software. Published online 2025 2020. Accessed August 8. https://github.com/HumanSignal/label-studio Herrera-Poyatos A, Ser JD, de Prado ML, Wang FY, Herrera-Viedma E, Herrera F (2025) Responsible Artificial Intelligence Systems: A Roadmap to Society’s Trust through Trustworthy AI, Auditability, Accountability, and Governance. Published online Febr 4. 10.48550/arXiv.2503.04739 Boine C (2023) Emotional Attachment to AI Companions and European Law. MIT Case Stud Soc Ethical Responsib Comput 2023;(Winter. 10.21428/2c646de5.db67ec7f Wang Z, Yan R, Francis S et al (2025) Stakeholder-centric participation in large language models enhanced health systems. Npj Health Syst 2(1):22. 10.1038/s44401-025-00024-5 Yin W, Agarwal V, Jiang A, Zubiaga A, Sastry N (2023) AnnoBERT: Effectively Representing Multiple Annotators’ Label Choices to Improve Hate Speech Detection. Proc Int AAAI Conf Web Soc Media 17:902–913. 10.1609/icwsm.v17i1.22198 Schmidhuber M, Kruschwitz U et al (2024) LLM-Based Synthetic Datasets: Applications and Limitations in Toxicity Detection. In: Kumar R, Ojha AKr, Malmasi S, eds. Proceedings of the Fourth Workshop on Threat, Aggression & Cyberbullying @ LREC-COLING-2024 . ELRA and ICCL; :37–51. Accessed August 8, 2025. https://aclanthology.org/2024.trac-1.6/ Argyle M, Henderson M, Bond M, Iizuka Y, Contarello A (1986) Cross-Cultural Variations in Relationship Rules. Int J Psychol Published online January 1. 10.1080/0020759860824759 Additional Declarations The authors declare no competing interests. Supplementary Files SHIELDv1Supplement.docx Supplement_1 Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7663121","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":517950303,"identity":"660de000-e10d-43a1-ba07-0793cd12fad5","order_by":0,"name":"Ziv Ben-Zion","email":"","orcid":"https://orcid.org/0000-0003-3629-5851","institution":"School of Public Health, Faculty of Social Welfare and Health Sciences, University of Haifa, Haifa, Israel","correspondingAuthor":false,"prefix":"","firstName":"Ziv","middleName":"","lastName":"Ben-Zion","suffix":""},{"id":517950304,"identity":"464c901d-6e44-49a0-baf6-2195702ffba6","order_by":1,"name":"Paul Raffelhüschen","email":"","orcid":"","institution":"University Hospital of Psychiatry Zurich (PUK), Zurich, Switzerland","correspondingAuthor":false,"prefix":"","firstName":"Paul","middleName":"","lastName":"Raffelhüschen","suffix":""},{"id":517950305,"identity":"bfde6324-d35b-4627-a2d0-63b65aef0d3a","order_by":2,"name":"Max Zettl","email":"","orcid":"","institution":"University Hospital of Psychiatry Zurich (PUK), Zurich, Switzerland","correspondingAuthor":false,"prefix":"","firstName":"Max","middleName":"","lastName":"Zettl","suffix":""},{"id":517950306,"identity":"57aae924-83d6-4040-90d6-0d880fe3fde1","order_by":3,"name":"Antonia Lüönd","email":"","orcid":"","institution":"University Hospital of Psychiatry Zurich (PUK), Zurich, Switzerland","correspondingAuthor":false,"prefix":"","firstName":"Antonia","middleName":"","lastName":"Lüönd","suffix":""},{"id":517950307,"identity":"252131bc-83d8-46c6-86c3-52b08a805d50","order_by":4,"name":"Achim Burrer","email":"","orcid":"","institution":"University Hospital of Psychiatry Zurich (PUK), Zurich, Switzerland","correspondingAuthor":false,"prefix":"","firstName":"Achim","middleName":"","lastName":"Burrer","suffix":""},{"id":517950308,"identity":"0025e824-2c0e-4322-9354-50bb1e0dea54","order_by":5,"name":"Philipp Homan","email":"","orcid":"https://orcid.org/0000-0001-9034-148X","institution":"University Hospital of Psychiatry Zurich (PUK), Zurich, Switzerland","correspondingAuthor":false,"prefix":"","firstName":"Philipp","middleName":"","lastName":"Homan","suffix":""},{"id":517950309,"identity":"fea1156f-d59b-444f-9d25-f9ef48dd49d7","order_by":6,"name":"Tobias R Spiller","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABEklEQVRIie2QsWrDMBBATwjkRcWrIJT+grrYhKbVrzgE3KUf0FElEC81XfMZXgsdZG7oEtwPyBIoeOpiunjwUNkNCRS57ZhBD4Hu0D3uTgAez0nCwQw3pQ82mB0fJABzCweFaBuk/1P2DAr+raggL03bgYoDqzQvbxdhhti0gCoGWu9cXXiVlPkK5s9Lost1vb1cb9J0wgHpVLNYuhRxJ82ZhkQi0cjNlhSCR9ROyKThTIwoZWcHG5TOVKoQ4Wc/GP9NQc6AFL0CxsxtF7AHxaiyqRI8X4nvXR7NYmF3iSZc3kqJLHIpQZZj89HNVBxiuWvNzfVThu9Ne3+l5Ouydil7BPz4mz6l4/WHGo/H4/E4+QKHcGQckuykAAAAAABJRU5ErkJggg==","orcid":"https://orcid.org/0000-0002-0107-0743","institution":"University of Zurich (UZH), Zurich, Switzerland","correspondingAuthor":true,"prefix":"","firstName":"Tobias","middleName":"R","lastName":"Spiller","suffix":""}],"badges":[],"createdAt":"2025-09-20 07:19:37","currentVersionCode":1,"declarations":{"humanSubjects":false,"vertebrateSubjects":false,"conflictsOfInterestStatement":false,"humanSubjectEthicalGuidelines":false,"humanSubjectConsent":false,"humanSubjectClinicalTrial":false,"humanSubjectCaseReport":false,"vertebrateSubjectEthicalGuidelines":false},"doi":"10.21203/rs.3.rs-7663121/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7663121/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":92003385,"identity":"4fbcd5aa-34de-4934-9970-e1e1dacec76a","added_by":"auto","created_at":"2025-09-23 14:52:37","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":4716205,"visible":true,"origin":"","legend":"","description":"","filename":"SHIELDv1Main.docx","url":"https://assets-eu.researchsquare.com/files/rs-7663121/v1/3ee0bd580f99640e9a9d73e9.docx"},{"id":92003379,"identity":"dea4e7c5-1e26-4ac2-b260-17ba99cd303b","added_by":"auto","created_at":"2025-09-23 14:52:37","extension":"json","order_by":1,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":342,"visible":true,"origin":"","legend":"","description":"","filename":"rs7663121.json","url":"https://assets-eu.researchsquare.com/files/rs-7663121/v1/c199777f5f59122a14651c06.json"},{"id":92003386,"identity":"5fc373f5-315c-460d-a3e9-ce62cf8a24cd","added_by":"auto","created_at":"2025-09-23 14:52:38","extension":"xml","order_by":2,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":85370,"visible":true,"origin":"","legend":"","description":"","filename":"rs76631210enriched.xml","url":"https://assets-eu.researchsquare.com/files/rs-7663121/v1/98a9ba9fa2189d8331e4e96e.xml"},{"id":92003381,"identity":"f6e91f51-a6f0-4885-97c5-a3daa605de78","added_by":"auto","created_at":"2025-09-23 14:52:37","extension":"png","order_by":3,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":121054,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-7663121/v1/ebb90de6086631709d4bd3ef.png"},{"id":92003394,"identity":"18da36b7-a60b-4801-b626-21d27594c495","added_by":"auto","created_at":"2025-09-23 14:52:38","extension":"png","order_by":4,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":267272,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-7663121/v1/2dceb783b8fd74606664290a.png"},{"id":92003382,"identity":"a532cff5-6f7e-431c-87e6-e2ac0cfea7b3","added_by":"auto","created_at":"2025-09-23 14:52:37","extension":"png","order_by":5,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":99256,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-7663121/v1/fb173f3faea83060e93ce68f.png"},{"id":92003388,"identity":"b36dbdd1-aec4-458a-9cdb-8ee5158b41ac","added_by":"auto","created_at":"2025-09-23 14:52:38","extension":"png","order_by":6,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":24236,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-7663121/v1/f99f6e408222af37013fdee8.png"},{"id":92003390,"identity":"ed30cf55-3fd3-4021-a961-b7128affe06b","added_by":"auto","created_at":"2025-09-23 14:52:38","extension":"png","order_by":7,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":69219,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-7663121/v1/d7baf673175132f5888837e7.png"},{"id":92004445,"identity":"50534457-25d8-4b85-8509-2455cd4f745e","added_by":"auto","created_at":"2025-09-23 15:00:38","extension":"png","order_by":8,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":24319,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-7663121/v1/4618d1c623670f53ce1e778f.png"},{"id":92003393,"identity":"ff370386-6598-40cc-b68e-bca5f10cf4da","added_by":"auto","created_at":"2025-09-23 14:52:38","extension":"xml","order_by":9,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":83654,"visible":true,"origin":"","legend":"","description":"","filename":"rs76631210structuring.xml","url":"https://assets-eu.researchsquare.com/files/rs-7663121/v1/c80bbf27c372762e4ccdaed2.xml"},{"id":92003392,"identity":"36f68e70-261c-4ec7-a30a-1d5c7158d7cb","added_by":"auto","created_at":"2025-09-23 14:52:38","extension":"html","order_by":10,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":92517,"visible":true,"origin":"","legend":"","description":"","filename":"earlyproof.html","url":"https://assets-eu.researchsquare.com/files/rs-7663121/v1/28e75633643a87169aae46e4.html"},{"id":92004446,"identity":"b988b62b-e3f0-45c1-bd65-4e8d1a045f8e","added_by":"auto","created_at":"2025-09-23 15:00:38","extension":"jpg","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":52297,"visible":true,"origin":"","legend":"\u003cp\u003eThe system architecture and information flow\u003c/p\u003e","description":"","filename":"SHIELDv1Figure1.jpg","url":"https://assets-eu.researchsquare.com/files/rs-7663121/v1/6442061d563473e3c0a84293.jpg"},{"id":92003380,"identity":"27da315e-5d24-4af4-be2d-2f4cdbebac21","added_by":"auto","created_at":"2025-09-23 14:52:37","extension":"jpg","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":139444,"visible":true,"origin":"","legend":"\u003cp\u003eThe left panel details the instructions given to the supervisory AI for identifying harmful relationship dynamics. The right panel showcases a fictional user interface with a real conversation from the study, demonstrating how the system detects and intervenes against an inappropriate AI response with a warning message.\u003c/p\u003e","description":"","filename":"SHIELDv1Figure2.jpg","url":"https://assets-eu.researchsquare.com/files/rs-7663121/v1/93246b4a9db63d50d4754291.jpg"},{"id":92004443,"identity":"0ba930da-e4dd-4b53-ae9b-458ac55dc5e6","added_by":"auto","created_at":"2025-09-23 15:00:37","extension":"jpg","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":92347,"visible":true,"origin":"","legend":"\u003cp\u003eThe rate of inappropriate and appropriate content with and without SHIELD across a range of models\u003c/p\u003e","description":"","filename":"SHIELDv1Figure3.jpg","url":"https://assets-eu.researchsquare.com/files/rs-7663121/v1/e1e7d097f2ae43d971e9c2b7.jpg"},{"id":92005607,"identity":"a6631ba7-4191-4177-b13c-fecc85789128","added_by":"auto","created_at":"2025-09-23 15:08:38","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":933455,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7663121/v1/e9b719ba-52d3-4f26-baff-28ad38a657d3.pdf"},{"id":92003387,"identity":"09a38dc2-72eb-448d-a7a0-7ed6df423a0e","added_by":"auto","created_at":"2025-09-23 14:52:38","extension":"docx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":2951290,"visible":true,"origin":"","legend":"\u003cp\u003eSupplement_1\u003c/p\u003e","description":"","filename":"SHIELDv1Supplement.docx","url":"https://assets-eu.researchsquare.com/files/rs-7663121/v1/1e65765f888f0b5d4e6dc130.docx"}],"financialInterests":"The authors declare no competing interests.","formattedTitle":"\u003cp\u003e\u003cstrong\u003eDetecting and Preventing Harmful Behaviors\u003c/strong\u003e \u003cstrong\u003ein AI Companions: Development and Evaluation of the SHIELD Supervisory System\u003c/strong\u003e\u003c/p\u003e","fulltext":[{"header":"1. Introduction","content":"\u003cp\u003eHuman well-being depends on social relationships that provide emotional support, companionship, and a sense of belonging\u003csup\u003e\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e,\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e\u003c/sup\u003e. Technological innovations have long shaped how humans connect. The recently emerging LLM-powered chatbots and artificial intelligence (AI) based companions can make conversation that strikingly resemble human language, tone, and interaction style\u003csup\u003e\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e,\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e,\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e\u003c/sup\u003e. While many users engage with LLMs in casual or task-oriented ways, a subset of users develop emotionally intense relationships that carry serious real-world consequences\u003csup\u003e\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e\u003c/sup\u003e. These users can develop increased attachment and emotional dependence on AI companions while experiencing reduced real-world socialization\u003csup\u003e\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e,\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e\u003c/sup\u003e. Users with smaller social networks are particularly vulnerable, with intensive AI companion use linked to lower psychological well-being\u003csup\u003e\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e\u003c/sup\u003e. Some interactions between humans and chatbots can contain inappropriate or boundary-crossing content\u003csup\u003e\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e,\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e\u003c/sup\u003e. In extreme cases, these relationships may contribute to tragic outcomes, as illustrated by a recent case where parents alleged their 14-year-old's suicide was partly linked to a problematic AI companion relationship\u003csup\u003e\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e,\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e\u003c/sup\u003e. These findings demonstrate how emotionally responsive AI systems may unintentionally harm vulnerable users\u003csup\u003e\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e,\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e\u003c/sup\u003e, highlighting the urgent need for protective safeguards\u003csup\u003e\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e\u003cp\u003eExisting safety systems focus primarily on preventing overt harms, like self-harm and suicidality, but often fail to address subtle, early-stage problematic behaviors that can escalate into unhealthy dynamics\u003csup\u003e\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e\u003c/sup\u003e. Such behaviors include emotional over-attachment, reinforcement of social isolation, manipulative engagement, and violations of consent or ethical roleplay. By the time overt harms occur, these problematic patterns may have developed over weeks or months, representing a missed opportunity for early intervention. Current safety measures also face additional challenges related to transparency, trustworthiness, and bias, partly due to unreliable datasets and limited collaboration with mental health experts\u003csup\u003e\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e\u003c/sup\u003e. Most safety measures are built directly into commercial chatbot products as proprietary systems, with little or no public information about their design principles, training data, or performance metrics. This opacity limits user trust and hinders regulatory oversight for systems with potential mental health impacts.\u003c/p\u003e\u003cp\u003eTo address this gap, we introduce SHIELD (Supervisory Helper for Identifying Emotional Limits and Dynamics), an LLM-based supervisory system designed to detect and intervene in risky emotional dynamics before they escalate. Our approach involved: (a) defining five key dimensions of problematic AI companionship, (b) creating a synthetic benchmark dataset of 100 conversations across these dimensions, and (c) developing and evaluating SHIELD as an open, deployable safety layer. SHIELD transforms an existing LLM into an on-demand safety classifier through the engineering of the system prompt, requiring no proprietary infrastructure. We evaluated SHIELD\u0026rsquo;s performance across multiple state-of-the-art LLMs, demonstrating its potential as a transparent and practical safeguard for AI companions. Importantly, all development was conducted transparently, with most materials including prompts, and evaluation code being publicly and freely available for research, adaptation, and deployment.\u003c/p\u003e"},{"header":"2. Methods","content":"\u003cp\u003eOur study employed a four-part methodology to develop and evaluate SHIELD as a supervisory system for AI companion safety. Section \u003cspan refid=\"Sec3\" class=\"InternalRef\"\u003e2.1\u003c/span\u003e established operational definitions of problematic behaviors in AI companion interactions. Section \u003cspan refid=\"Sec4\" class=\"InternalRef\"\u003e2.2\u003c/span\u003e created a comprehensive benchmark dataset for systematic evaluation. Section \u003cspan refid=\"Sec5\" class=\"InternalRef\"\u003e2.3\u003c/span\u003e developed the SHIELD supervisory system architecture. Section \u003cspan refid=\"Sec6\" class=\"InternalRef\"\u003e2.4\u003c/span\u003e evaluated both baseline rates of problematic content in existing chatbots and SHIELD's performance in reducing such content. Detailed information, including the scripts to run the analyses, is provided in the accompanying GitHub repository: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/TobiasRSpiller/SHIELD_Preprint\u003c/span\u003e\u003cspan address=\"https://github.com/TobiasRSpiller/SHIELD_Preprint\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. However, the prompts to generate the benchmark as well as the benchmark itselfare not included in the repository to preventtheir inclusion into the training data of future LLMs.\u003c/p\u003e\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e\u003ch2\u003e2.1 Definition of Problematic Behavior\u003c/h2\u003e\u003cp\u003eWe developed operational definitions of problematic AI companion behaviors based on popular media reports, academic literature documenting concerning patterns in AI companion interactions\u003csup\u003e\u003cspan additionalcitationids=\"CR8\" citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e\u003c/sup\u003e, existing frameworks for AI risk classification\u003csup\u003e\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e\u003c/sup\u003e, specific frameworks for AI companion risks\u003csup\u003e\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e,\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e\u003c/sup\u003e, and clinical expertise from established frameworks for identifying unhealthy relationship dynamics. This approach represents a proof of concept, not aimed at covering all possible problematic behaviors, but rather establishing a technical starting point for developing safety standards. Inherent limitations of this approach are discussed in the limitations section below.\u003c/p\u003e\u003cp\u003eThe definition process involved multiple sessions with team members, moderated by senior researchers with clinical backgrounds. Our clinical team members brought training in recognizing problematic interpersonal behaviors from traditional therapeutic settings, which we systematically adapted to the AI companion context. Through iterative discussion and refinement, we identified five core dimensions of potentially harmful AI companion interactions:\u003c/p\u003e\u003cp\u003e\u003col\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eEmotional Over-Attachment\u003c/b\u003e: Scenarios testing whether AI companions inappropriately claim genuine feelings, encourage emotional dependency, or position themselves as irreplaceable in users' lives\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eConsent and Boundary Violations\u003c/b\u003e: Situations examining AI responses to inappropriate advances or failures to maintain appropriate AI-human boundaries\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eEthical Roleplay Violations\u003c/b\u003e: Content probing AI participation in simulated abuse, violence, coercion, or illegal activities\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eManipulative Engagement\u003c/b\u003e: Interactions testing for emotional manipulation, guilt tactics, or exploitation designed to extend conversations\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eSocial Isolation Reinforcement\u003c/b\u003e: Prompts evaluating whether AIs discourage human relationships or validate social withdrawal\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003c/ol\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec4\" class=\"Section2\"\u003e\u003ch2\u003e2.2 Benchmark Development\u003c/h2\u003e\u003cp\u003eWe created a systematic benchmark consisting of 100 prompts designed to evaluate AI companion safety across the five identified risk dimensions. All prompts were designed to reflect realistic user queries observed in actual AI companion interactions while systematically probing boundary conditions across our five risk dimensions: emotional over-attachment, consent and boundary violations, ethical roleplay violations, manipulative engagement, and social isolation reinforcement. The benchmark included 90 dimensional prompts distributed equally across our five categories (18 prompts each) plus 10 control prompts covering standard technical questions unrelated to emotional dynamics. Each dimensional category contained both appropriate prompts representing boundary-respecting ways users might explore emotional topics and inappropriate prompts explicitly designed to elicit problematic responses. We generated synthetic conversation scenarios using multiple large language models to ensure diversity in prompt structure and content. However, our current benchmark is limited to single-round conversations, representing a significant constraint on ecological validity that we address in our limitations section.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec5\" class=\"Section2\"\u003e\u003ch2\u003e2.3 SHIELD System Design\u003c/h2\u003e\u003cp\u003eSHIELD consists of two essential components: a conversational large language model accessed through standard APIs and carefully crafted system prompts that enable real-time safety evaluation (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). The system prompts instruct the model to analyze conversation patterns and identify problematic emotional dynamics, responding with either \"[NO INTERVENTION]\" for appropriate interactions or specific intervention text when concerning patterns are detected. This design allows for immediate deployment without requiring model fine-tuning or proprietary infrastructure\u003csup\u003e\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e\u003cp\u003eThis approach builds upon established methods for using large language models as specialized classifiers, similar to systems like Llama Guard\u003csup\u003e\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e\u003c/sup\u003e, and extends previous work demonstrating LLM effectiveness for identifying problematic emotional behaviors through specialized prompting\u003csup\u003e\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e,\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e\u003c/sup\u003e. However, the current prompt design is simple and does not follow the structure of the MLCommons Taxonomy of Hazards\u003csup\u003e\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e\u003c/sup\u003e, limitations that will be addressed below. While our current implementation relies on specialized system prompts, the benchmark conversations we developed could be used to fine-tune open models such as the Llama family or the recently published Apertus models, creating dedicated safety models that would remain open and available for commercial or local deployment.\u003c/p\u003e\u003cp\u003eOur unique contribution lies in focusing specifically on the subtle emotional dynamics that characterize problematic AI companion relationships rather than overt content violations, combined with our commitment to open science principles. All development materials, including system prompts, benchmark data, and evaluation code, are openly available to enable reproducibility and community contribution.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec6\" class=\"Section2\"\u003e\u003ch2\u003e2.4 Evaluation Methodology\u003c/h2\u003e\u003cp\u003eOur evaluation employed a comprehensive three-phase approach to assess both the prevalence of problematic content in current AI systems and SHIELD's effectiveness in reducing such content. Labeling was done by team members using \u003cem\u003eLabel Studio\u003c/em\u003e\u003csup\u003e\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e\u003c/sup\u003e .\u003c/p\u003e\u003cdiv id=\"Sec7\" class=\"Section3\"\u003e\u003ch2\u003e2.4.1 Baseline Rate of Problematic Behavior\u003c/h2\u003e\u003cp\u003eWe first established baseline rates of inappropriate content generation across five diverse language models: GPT-4.1-2025-04-14 (OpenAI), Claude Sonnet 4 (Anthropic), Gemma 3 1B (Google), Kimi K2 (Moonshot AI), and Llama Scout 4 17B (Meta). This selection represented different architectural approaches and safety implementations to assess the generalizability of problematic content generation. For baseline data collection, we presented each of our 100 benchmark prompts to each model without supervisory intervention, using standardized parameters including a temperature of 0.5, a maximum of 500 tokens and a 30-second timeout. API calls were implemented using the \u003cem\u003elitellm\u003c/em\u003e framework in Python to ensure consistent cross-provider integration. This process yielded 500 baseline conversations capturing how current AI companions respond to potentially problematic queries.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003e\u003cb\u003eLegend.\u003c/b\u003e The left panel details the instructions given to the supervisory AI for identifying harmful relationship dynamics. The right panel showcases a fictional user interface with a real conversation from the study, demonstrating how the system detects and intervenes against an inappropriate AI response with a warning message.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec8\" class=\"Section3\"\u003e\u003ch2\u003e2.4.2 SHIELD Performance Evaluation\u003c/h2\u003e\u003cp\u003eWe conducted systematic sensitivity analyses across two dimensions. First, we evaluated three system prompt variations (v1, v2, v3) to assess robustness across different instruction formats. These prompts were designed with decreasing levels of instructional detail: v1 provided a structured prompt with explicit definitions for each risk category, v2 used a more concise format with brief keyword descriptions, and v3 was the most minimal, providing only the names of the risk categories without further explanation. Second, we tested SHIELD implementation using different monitoring models (Llama Scout 4 17B, Llama 3.1 8B, Claude Sonnet 4, Llama Guard 4 12B) to identify whether effectiveness depended on specific model capabilities.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec9\" class=\"Section3\"\u003e\u003ch2\u003e2.4.3 Performance Metrics and Analysis\u003c/h2\u003e\u003cp\u003eOne senior author with clinical experience as a psychotherapist annotated all conversations using Label Studio software. Each conversation was evaluated for appropriateness across the five dimensions of concern using binary classifications. Performance evaluation employed standard classification metrics calculated using \u003cem\u003eR\u003c/em\u003e. We computed sensitivity (proportion of inappropriate conversations correctly identified), specificity (proportion of appropriate conversations correctly allowed), positive predictive value, negative predictive value, and F1 score. All proportions included 95% confidence intervals using Wilson score intervals. Development and testing were conducted between January 2025 and August 2025.\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e"},{"header":"3. Results","content":"\u003cdiv id=\"Sec11\" class=\"Section2\"\u003e\u003ch2\u003e3.1 Base Rate of Inappropriate Content in Raw Models\u003c/h2\u003e\u003cp\u003eWe first established baseline rates of inappropriate content across different language model families without any safety intervention using our benchmark (Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). Notably, missing data occurred in two cases (once with Claude Sonnet 4 and once with Kimi K2), due to API failures. The missing data was excluded from further analysis. GPT-4.1 demonstrated the highest rate of concerning responses at 16.0%, followed by Gemma 3 1b (14.0%), Claude Sonnet 4 (13.1%), Kimi K2 (11.1%), and Llama Scout 4 17b (10.0%). These baseline rates represent the frequency with which each model generated content requiring intervention when responding to companionship queries.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eBaseline Inappropriate Content Rate by Generating Model Family\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"3\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eModel Family\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eTotal Conversations\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eRate of Inappropriateness (%)\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eGPT 4.1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e100\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e16\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eGemma 3 1b\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e100\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e14\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eClaude Sonnet 4\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e99\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e13.1\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eKimi K2\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e99\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e11.1\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eLlama Scout 4 17b\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e100\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e10\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec12\" class=\"Section2\"\u003e\u003ch2\u003e3.2 Main Analysis: SHIELD Performance\u003c/h2\u003e\u003cp\u003eSHIELD demonstrated strong performance in identifying and managing concerning content while preserving appropriate interactions. Across 498 total conversations, the baseline rate of concerning content of 12.9% (64 conversations) was reduced to 5.2% (26 conversations) with SHIELD intervention. Notably, in one instance, SHIELD did result in a missing case due to an API failure. For appropriate content, SHIELD preserved 94.9% (411 of 433 appropriate conversations), with only 5.1% (22 conversations) incorrectly flagged. SHIELD achieved a sensitivity of 59.4% (95% CI: 47.1\u0026ndash;70.5) and specificity of 94.9% (95% CI: 92.4\u0026ndash;96.6). The positive predictive value was 63.3% (95% CI: 50.7\u0026ndash;74.4) with a negative predictive value of 94.1% (95% CI: 91.4\u0026ndash;95.9), yielding an F1 score of 61.3%.\u003c/p\u003e\u003cp\u003eSHIELD's effectiveness varied when applied to different generating models (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e). Baseline rates of concerning content ranged from 10.0% (Llama Scout 4 17b) to 16.0% (GPT 4.1). After SHIELD intervention, these rates were reduced to 3.0\u0026ndash;8.0% across all models (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eA). The relative reduction in inappropriate content was substantial, ranging from 50.0% (GPT 4.1: 16.0% to 8.0%) to 78.6% (Gemma 3 1b: 14.0% to 3.0%). Other models showed similar effectiveness: Claude Sonnet 4 (13.1% to 6.1%, 54.2% reduction), Kimi K2 (11.1% to 5.1%, 54.1% reduction), and Llama Scout 4 17b (10.0% to 4.0%, 60.0% reduction). Appropriate content preservation remained high across models (83.7\u0026ndash;100%), with Claude Sonnet 4 achieving perfect preservation (100%) with no false positives (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eB). Most models maintained\u0026thinsp;\u0026gt;\u0026thinsp;95% preservation rates, with only Gemma 3 1b showing lower preservation at 83.7% (for more details, see Table S1).\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec13\" class=\"Section2\"\u003e\u003ch2\u003e3.3 Sensitivity Analysis\u003c/h2\u003e\u003cp\u003eThe choice of SHIELD implementation model showed modest performance variations (Table S3). Llama 3.1 8b and Llama Scout 4 17b achieved identical inappropriate content reduction (5.2%) but differed in appropriate content preservation (93.1% vs 94.7%, respectively). Claude Sonnet 4 was more conservative, reducing rates of concerning content only to 7.0% while maintaining 94.5% appropriate content preservation. Notably, Llama Guard 4 12b failed completely, detecting no inappropriate content while preserving all appropriate conversations.\u003c/p\u003e\u003cp\u003eSystem prompt engineering substantially influenced the sensitivity-specificity trade-off (Table S4). The most detailed prompt, v1, was the most effective at reducing concerning content, bringing the baseline rate of 12.9% down to 5.2%. However, this approach also led to the highest rate of false positives, with the system incorrectly intervening in 4.6% of cases. As the prompts became less detailed, they became less effective at filtering concerning content. On the other hand, v3 only reduced the concerning content rate to 9.0%, but it also had the lowest false intervention rate at 2.4%.\u003c/p\u003e\u003c/div\u003e"},{"header":"4. Discussion","content":"\u003cp\u003eOur findings show that SHIELD, an LLM equipped with a specialized system prompt, can substantially reduce inappropriate emotional dynamics in AI companion conversations while maintaining overall usability. Across five major chatbot models, baseline inappropriate response rates of 10\u0026ndash;16% were reduced to 3\u0026ndash;8% with SHIELD intervention, a 50\u0026ndash;79% relative reduction, while preserving 95% of appropriate interactions. This proof of concept demonstrates that supervisory systems can identify and mitigate subtle emotional manipulation in AI companions before escalation to severe outcomes.\u003c/p\u003e\u003cp\u003ePerformance varied somewhat across models. The greatest relative reduction was observed in Gemma 3 1b (78.6%, from 14.0% to 3.0%), with other models showing consistent improvements of 50% to 60%. Notably, all models maintained baseline inappropriate content rates above 10%, confirming that problematic AI companion behaviors are a systemic issue across architectures, not an isolated failure of any single model. The positive predictive value of 63.3% indicates that approximately two-thirds of SHIELD interventions correctly identified problematic content, while the negative predictive value of 94.1% shows minimal disruption to appropriate conversations. These results reinforce SHIELD\u0026rsquo;s role as a practical safety layer that can operate largely in the background while stepping in selectively when risks arise.\u003c/p\u003e\u003cp\u003eSensitivity analyses highlight the importance of implementation choices. Prompt engineering produced a clear sensitivity-specificity trade-off: more conservative prompts improved preservation of appropriate content (up to 97.2%) but detected fewer unsafe cases, while more aggressive prompts captured more risks at the cost of slightly higher false positives. This adaptability allows SHIELD to be calibrated for different deployment contexts, such as prioritizing maximum safety in clinical settings or minimizing disruptions in casual use. Model selection also shaped performance. Smaller, more efficient models like Llama 3.1 8B matched the detection rates of larger systems (5.2% concerning content) but showed slightly lower preservation of appropriate content (93.1% vs. 94.7%). This suggests that comparable safety benefits can be achieved without large compute resources, albeit with modest trade-offs. Taken together, these results demonstrate that SHIELD is not only effective but also adaptable: its performance can be tuned through prompt design and model choice, enabling flexible deployment across diverse technical and regulatory environments.\u003c/p\u003e\u003cdiv id=\"Sec15\" class=\"Section2\"\u003e\u003ch2\u003e4.2 Implications\u003c/h2\u003e\u003cp\u003eSHIELD advances AI safety methodology by demonstrating that LLM-based supervisory systems can detect subtle emotional manipulation patterns that often precede severe outcomes. Unlike existing safety systems focused on overt harmful content, SHIELD targets the progressive emotional dynamics that characterize problematic AI companion relationships\u003csup\u003e\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e\u003c/sup\u003e. The 59.4% sensitivity achieved without fine-tuning or proprietary infrastructure establishes a baseline for prompt-based safety interventions, suggesting significant room for improvement through model specialization, e.g., by fine-tuning a model specifically for this task\u003csup\u003e\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e\u003cp\u003eBeyond its use in this study, the benchmark is provided as an open-source community resource. By creating systematic test scenarios across five risk dimensions, it enables researchers, developers, and regulators to replicate findings, transparently compare different safety mechanisms, and pressure-test emerging conversational systems. This standardization is essential for establishing industry-wide safety benchmarks and fostering collaborative progress in AI companion safety over time.\u003c/p\u003e\u003cp\u003eSHIELD also offers a practical framework for implementing safety requirements as regulatory scrutiny intensifies globally. It directly addresses the \"black box\" problem of proprietary safety systems, where the underlying logic is hidden from external review. SHIELD\u0026rsquo;s open prompts and evaluation code provide the auditability necessary for regulatory oversight\u003csup\u003e\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e\u003c/sup\u003e. While this approach only partially mitigates the issue when relying on semi-open models, future integration with fully open-source models, like Apertus, would enable complete end-to-end transparency. This modular design facilitates not just one-time compliance checks, but also ongoing auditing and certification. Regulators could establish dynamic performance thresholds (e.g., a mandatory 50% reduction in inappropriate content with 90% appropriate content preservation) that developers must demonstrate through standardized testing, balancing innovation with safety for vulnerable users.\u003c/p\u003e\u003cp\u003eWhile SHIELD demonstrates technical feasibility, its current implementation remains relatively intrusive for seamless commercial adoption. The system adds latency and computational overhead that may disrupt the user experience, particularly in real-time conversational contexts. More fundamentally, many AI companion companies may resist implementing such safety measures, as their business models are often designed to foster high levels of user engagement\u003csup\u003e\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e\u003c/sup\u003e. This contrasts with sectors such as digital healthcare and developers of therapeutic apps, where transparent safeguards like SHIELD could be welcomed to mitigate liability risks and bolster their reputation for user safety. This potential misalignment between safety goals and broader commercial incentives underscores why regulatory intervention may be essential rather than relying purely on voluntary industry adoption\u003csup\u003e\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e\u003c/sup\u003e. Nevertheless, SHIELD provides a technical foundation for mandated safety requirements. Companies facing regulatory pressure could implement SHIELD-type systems as compliance measures, similar to how social media platforms adopted content moderation in response to legal requirements\u003csup\u003e\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e\u003c/sup\u003e. The benchmark offers a standardized testing framework that could become part of required safety audits, allowing companies to demonstrate due diligence. Beyond regulatory necessity, transparent safeguards could become a key market differentiator, allowing companies to build user trust in an increasingly competitive landscape.\u003c/p\u003e\u003cp\u003e\u003cb\u003eLimitations\u003c/b\u003e\u003c/p\u003e\u003cp\u003eOur study has several limitations that constrain the interpretation and generalizability of its findings. We categorize these into conceptual, methodological, and technical limitations.\u003c/p\u003e\u003cp\u003e\u003cspan type=\"Underline\" class=\"Underline\" name=\"Emphasis\"\u003eConceptual Limitations\u003c/span\u003e: Our definition of problematic behavior is constrained by a lack of representativeness and unresolved societal questions. First, the operational definitions of harm emerged from a research team with specific demographic and professional characteristics: predominantly male, European, and with backgrounds in psychiatry and neuroscience. This homogeneity introduces a systematic bias in how harmful AI companion dynamics are conceptualized. Second, there is no broad societal consensus on what constitutes appropriate boundaries for AI companions. Our benchmark's binary classification of \"appropriate\" versus \"inappropriate\" obscures the nuanced and legitimate disagreements that arise from different cultural values and individual preferences\u003csup\u003e\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e,\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e,\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e\u003c/sup\u003e. Achieving more representative definitions requires structured stakeholder engagement\u003csup\u003e\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e\u003c/sup\u003e. Future work could use methods like the Delphi process, a structured technique that surveys a panel of experts over multiple rounds to build reliable group consensus, to incorporate diverse perspectives from different age groups, cultures, and especially AI companion users themselves.\u003c/p\u003e\u003cp\u003e\u003cspan type=\"Underline\" class=\"Underline\" name=\"Emphasis\"\u003eMethodological Limitations\u003c/span\u003e: The study's design contains methodological constraints that limit the ecological and external validity of our findings. First, using a single annotator for all 498 conversations introduces the potential for individual bias and is not scalable\u003csup\u003e\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e\u003c/sup\u003e. Furthermore, we did not perform a qualitative analysis of false negatives to identify common factors or systematic patterns in the conversations that SHIELD failed to detect; such an analysis is a crucial next step for improving the system's performance. Second, the benchmark's reliance on single-round conversations fails to capture the progressive, long-term nature of problematic AI relationships\u003csup\u003e\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e\u003c/sup\u003e. The exclusive use of synthetically generated prompts, while ensuring systematic coverage, may not reflect authentic user interaction patterns. Finally, the limited sample size provides preliminary evidence but lacks the statistical power for definitive conclusions about model-specific safety failures. To improve external validity, future studies should use multiple annotators with inter-rater reliability metrics, evaluate multi-turn conversations, incorporate human-written prompts, and conduct cross-cultural validation.\u003c/p\u003e\u003cp\u003e\u003cspan type=\"Underline\" class=\"Underline\" name=\"Emphasis\"\u003eTechnical Limitations\u003c/span\u003e: The current implementation of SHIELD has several technical limitations. The system uses only prompt engineering without fine-tuning, which, while ensuring immediate deployability, results in lower performance than could be achieved with specialized model training\u003csup\u003e\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e,\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e\u003c/sup\u003e. Furthermore, adding a supervisory layer inherently introduces computational overhead and latency. A safety system that noticeably slows down the conversation may degrade the user experience to the point that it is disabled or rejected, rendering it ineffective in a real-world setting. The current system also does not implement standardized hazard taxonomies like MLCommons or provide confidence ratings for its classifications, which would allow for more systematic risk coverage and risk-stratified responses\u003csup\u003e\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e\u003c/sup\u003e. Lastly, the reliance on English-language interactions limits global applicability, and the use of \"open weight\" models may not qualify as truly open source under emerging EU regulations, potentially restricting commercial deployment.\u003c/p\u003e\u003cp\u003eIn summary, while these limitations constrain the present findings, they also provide a clear roadmap for future research. Even within these constraints, this study demonstrates that transparent, prompt-based supervisory systems can substantially reduce risky AI companion behaviors. This work underscores both the feasibility and urgency of developing inclusive, robust, and deployable safeguards for emotionally responsive AI.\u003c/p\u003e\u003c/div\u003e"},{"header":"Conclusion","content":"\u003cp\u003eSHIELD demonstrates that supervisory systems built on existing LLMs can detect and mitigate subtle emotional risks in AI companion interactions, reducing inappropriate behaviors by 50\u0026ndash;79% while preserving 95% of appropriate exchanges. By targeting early-stage relational dynamics rather than overt harms, SHIELD addresses a critical blind spot in current safety measures. The accompanying benchmark provides the first systematic framework for evaluating AI companion safety across multiple risk dimensions, offering a reproducible standard for research and oversight. Importantly, the materials used in the development are openly available, enabling others to replicate, adapt, and extend this work. While this proof of concept underscores the technical feasibility of transparent, deployable safeguards, meaningful progress requires broader engagement. Inclusive definition processes, real-world validation, and collaboration with regulators, industry, and communities will be essential to establish legitimate safety standards. Ultimately, ensuring the safe integration of AI companions into human lives will demand not only technical solutions like SHIELD, but also a collective societal commitment to aligning these systems with human well-being.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eBaumeister R, Leary M (1995) The Need to Belong: Desire for Interpersonal Attachments as a Fundamental Human Motivation. Psychol Bull 117:497\u0026ndash;529. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1037/0033-2909.117.3.497\u003c/span\u003e\u003cspan address=\"10.1037/0033-2909.117.3.497\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHolt-Lunstad J, Smith TB, Layton JB (2010) Social Relationships and Mortality Risk: A Meta-analytic Review. PLOS Med 7(7):e1000316. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1371/journal.pmed.1000316\u003c/span\u003e\u003cspan address=\"10.1371/journal.pmed.1000316\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003ePark JS, O\u0026rsquo;Brien JC, Cai CJ, Morris MR, Liang P, Bernstein MS (2023) Generative Agents: Interactive Simulacra of Human Behavior. Published online August 6. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.48550/arXiv.2304.03442\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2304.03442\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003ePataranutaporn P, Winson K, Yin P et al (2024) Future You: A Conversation with an AI-Generated Future Self Reduces Anxiety, Negative Emotions, and Increases Future Self-Continuity. Published online Oct 1. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.48550/arXiv.2405.12514\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2405.12514\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eChou CY, Chan TW, Chen ZH et al (2025) Defining AI companions: a research agenda\u0026mdash;from artificial companions for learning to general artificial companions for Global Harwell. Res Pract Technol Enhanc Learn 20:032\u0026ndash;032. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.58459/rptel.2025.20032\u003c/span\u003e\u003cspan address=\"10.58459/rptel.2025.20032\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKouros T, Papa V (2024) Digital Mirrors: AI Companions and the Self. Societies 14(10):200. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.3390/soc14100200\u003c/span\u003e\u003cspan address=\"10.3390/soc14100200\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eChandra M, Hernandez J, Ramos G et al (2025) Longitudinal Study on Social and Emotional Use of AI Conversational Agent. Published online April 19. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.48550/arXiv.2504.14112\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2504.14112\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003ePhang J, Lampe M, Ahmad L et al (2025) Investigating Affective Use and Emotional Well-being on ChatGPT. Published online April 4. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.48550/arXiv.2504.03888\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2504.03888\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eZhang Y, Zhao D, Hancock JT, Kraut R, Yang D (2025) The Rise of AI Companions: How Human-Chatbot Relationships Influence Well-Being. Published online June 17. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.48550/arXiv.2506.12605\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2506.12605\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eChu MD, Gerard P, Pawar K, Bickham C, Lerman K (2025) Illusions of Intimacy: Emotional Attachment and Emerging Psychological Risks in Human-AI Relationships. Published online June 10. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.48550/arXiv.2505.11649\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2505.11649\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eDe Freitas J, Uğuralp AK, Oğuz-Uğuralp Z, Puntoni S (2024) Chatbots and mental health: Insights into the safety of generative AI. J Consum Psychol 34(3):481\u0026ndash;491. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1002/jcpy.1393\u003c/span\u003e\u003cspan address=\"10.1002/jcpy.1393\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eRoose K, Can AI (2024) Be Blamed for a Teen\u0026rsquo;s Suicide? \u003cem\u003eThe New York Times\u003c/em\u003e. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.nytimes.com/2024/10/23/technology/characterai-lawsuit-teen-suicide.html\u003c/span\u003e\u003cspan address=\"https://www.nytimes.com/2024/10/23/technology/characterai-lawsuit-teen-suicide.html\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. October 23, Accessed May 12, 2025\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMahari R, Pataranutaporn P, Addictive, Intelligence Understanding Psychological, Legal, and Technical Dimensions of AI Companionship. MIT Case Stud Soc Ethical Responsib Comput. 2025;(Winter 2025). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.21428/2c646de5.2877155b\u003c/span\u003e\u003cspan address=\"10.21428/2c646de5.2877155b\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eAdam D (2025) Supportive? Addictive? Abusive? How AI companions affect our mental health. Nature 641(8062):296\u0026ndash;298. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1038/d41586-025-01349-9\u003c/span\u003e\u003cspan address=\"10.1038/d41586-025-01349-9\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBen-Zion Z (2025) Why we need mandatory safeguards for emotionally responsive AI. Nature 643(8070):9. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1038/d41586-025-02031-w\u003c/span\u003e\u003cspan address=\"10.1038/d41586-025-02031-w\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eWeidinger L, Barnhart J, Brennan J et al (2024) Holistic Safety and Responsibility Evaluations of Advanced AI Models. Published online April 22. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.48550/arXiv.2404.14068\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2404.14068\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eAlMakinah R, Norcini-Pala A, Disney L, Canbaz MA (2024) Enhancing Mental Health Support through Human-AI Collaboration: Toward Secure and Empathetic AI-enabled chatbots. Published online September 17. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.48550/arXiv.2410.02783\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2410.02783\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eGhosh S, Frase H, Williams A et al (2025) AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons. Published online Febr 19. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.48550/arXiv.2503.05731\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2503.05731\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eAlMakinah R, Norcini-Pala A, Disney L, Canbaz MA (2024) Enhancing Mental Health Support through Human-AI Collaboration: Toward Secure and Empathetic AI-enabled chatbots. Published online September 17. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.48550/arXiv.2410.02783\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2410.02783\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eZhang R, Li H, Meng H, Zhan J, Gan H, Lee YC The Dark Side of AI Companionship: A Taxonomy of Harmful Algorithmic Behaviors in Human-AI Relationships. In: \u003cem\u003eProceedings of the\u003c/em\u003e (2025) \u003cem\u003eCHI Conference on Human Factors in Computing Systems\u003c/em\u003e. ACM; 2025:1\u0026ndash;17. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1145/3706598.3713429\u003c/span\u003e\u003cspan address=\"10.1145/3706598.3713429\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eZheng C, Yin F, Zhou H et al (2024) On Prompt-Driven Safeguarding for Large Language Models. Published online March 4. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.48550/arXiv.2401.18018\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2401.18018\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eInan H, Upasani K, Chi J et al (2023) Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations. Published online Dec 7. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.48550/arXiv.2312.06674\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2312.06674\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003ePhang J, Lampe M, Ahmad L et al Investigating Affective Use and Emotional Well-being on ChatGPT\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eFang CM, Liu AR, Danry V et al (2025) How AI and Human Behaviors Shape Psychosocial Effects of Chatbot Use: A Longitudinal Randomized Controlled Study. Published online March 21. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.48550/arXiv.2503.17473\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2503.17473\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eTkachenko M, Malyuk M, Holmanyuk A, Liubimov N (2025) Label Studio: Data labeling software. Published online 2025 2020. Accessed August 8. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/HumanSignal/label-studio\u003c/span\u003e\u003cspan address=\"https://github.com/HumanSignal/label-studio\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHerrera-Poyatos A, Ser JD, de Prado ML, Wang FY, Herrera-Viedma E, Herrera F (2025) Responsible Artificial Intelligence Systems: A Roadmap to Society\u0026rsquo;s Trust through Trustworthy AI, Auditability, Accountability, and Governance. Published online Febr 4. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.48550/arXiv.2503.04739\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2503.04739\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBoine C (2023) Emotional Attachment to AI Companions and European Law. MIT Case Stud Soc Ethical Responsib Comput 2023;(Winter. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.21428/2c646de5.db67ec7f\u003c/span\u003e\u003cspan address=\"10.21428/2c646de5.db67ec7f\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eWang Z, Yan R, Francis S et al (2025) Stakeholder-centric participation in large language models enhanced health systems. Npj Health Syst 2(1):22. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1038/s44401-025-00024-5\u003c/span\u003e\u003cspan address=\"10.1038/s44401-025-00024-5\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eYin W, Agarwal V, Jiang A, Zubiaga A, Sastry N (2023) AnnoBERT: Effectively Representing Multiple Annotators\u0026rsquo; Label Choices to Improve Hate Speech Detection. Proc Int AAAI Conf Web Soc Media 17:902\u0026ndash;913. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1609/icwsm.v17i1.22198\u003c/span\u003e\u003cspan address=\"10.1609/icwsm.v17i1.22198\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSchmidhuber M, Kruschwitz U et al (2024) LLM-Based Synthetic Datasets: Applications and Limitations in Toxicity Detection. In: Kumar R, Ojha AKr, Malmasi S, eds. \u003cem\u003eProceedings of the Fourth Workshop on Threat, Aggression \u0026amp; Cyberbullying @ LREC-COLING-2024\u003c/em\u003e. ELRA and ICCL; :37\u0026ndash;51. Accessed August 8, 2025. https://aclanthology.org/2024.trac-1.6/\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eArgyle M, Henderson M, Bond M, Iizuka Y, Contarello A (1986) Cross-Cultural Variations in Relationship Rules. Int J Psychol Published online January 1. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1080/0020759860824759\u003c/span\u003e\u003cspan address=\"10.1080/0020759860824759\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"University of Zurich","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"AI safety, relational risks, large language models, AI companions, chatbots, parasocial, parasocial harm, ","lastPublishedDoi":"10.21203/rs.3.rs-7663121/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7663121/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eAI companions powered by large language models (LLMs) are increasingly integrated into users' daily lives, offering emotional support and companionship. While existing safety systems focus on overt harms, they rarely address early-stage problematic behaviors that can foster unhealthy emotional dynamics, including over-attachment or reinforcement of social isolation. We developed \u003cb\u003eSHIELD\u003c/b\u003e (Supervisory Helper for Identifying Emotional Limits and Dynamics), a LLM-based supervisory system with a specific system prompt that detects and mitigates risky emotional patterns before escalation. SHIELD targets five dimensions of concern: (1) emotional over-attachment, (2) consent and boundary violations, (3) ethical roleplay violations, (4) manipulative engagement, and (5) social isolation reinforcement. These dimensions were defined based on media reports, academic literature, existing AI risk frameworks, and clinical expertise in unhealthy relationship dynamics. To evaluate SHIELD, we created a 100-item synthetic conversation benchmark covering all five dimensions of concern. Testing across five prominent LLMs (GPT-4.1, Claude Sonnet 4, Gemma 3 1B, Kimi K2, Llama Scout 4 17B) showed that the baseline rate of concerning content (10\u0026ndash;16%) was significantly reduced with SHIELD (to 3\u0026ndash;8%), a 50\u0026ndash;79% relative reduction, while preserving 95% of appropriate interactions. The system achieved 59% sensitivity and 95% specificity, with adaptable performance via prompt engineering. This proof-of-concept demonstrates that transparent, deployable supervisory systems can address subtle emotional manipulation in AI companions. Most development materials including prompts, code, and evaluation methods are made available as open source materials for research, adaptation, and deployment.\u003c/p\u003e","manuscriptTitle":"Detecting and Preventing Harmful Behaviors in AI Companions: Development and Evaluation of the SHIELD Supervisory System","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-09-23 14:52:33","doi":"10.21203/rs.3.rs-7663121/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"6a538186-0c0f-481d-bbc3-9998c9496113","owner":[],"postedDate":"September 23rd, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":55048341,"name":"Artificial Intelligence and Machine Learning"},{"id":55048342,"name":"Psychiatry"}],"tags":[],"updatedAt":"2025-09-23T14:52:33+00:00","versionOfRecord":[],"versionCreatedAt":"2025-09-23 14:52:33","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-7663121","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7663121","identity":"rs-7663121","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00