Efficient and Responsible Transformer Based Conversational Agents for Emotionally Supportive Dialogue

preprint OA: closed
Full text JSON View at publisher
Full text 141,068 characters · extracted from preprint-html · click to expand
Efficient and Responsible Transformer Based Conversational Agents for Emotionally Supportive Dialogue | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Efficient and Responsible Transformer Based Conversational Agents for Emotionally Supportive Dialogue DIVYA SALEELA, Akhil Mathew Philip, Reji R, Rincy Merlin Mathew, and 4 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8581944/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted 14 You are reading this latest preprint version Abstract Conversational agents designed for emotionally supportive interactions face challenges in balancing affective responsiveness, computational efficiency, and safety in communication. Prior approaches frequently depend on large-scale models, handcrafted affective objectives, or reinforcement learning from human feedback, which can limit scalability and interpretability. This work presents a lightweight, domain-adapted dialogue generation system based on the T5-small architecture, fine-tuned on MentalChat16K, a curated corpus of real and synthetic emotional-support conversations. The proposed model operates without reinforcement learning or emotion-specific training objectives, yet demonstrates strong alignment with affective cues and high response fluency. Empirical evaluation shows improvements over zero-shot and fine-tuned GPT-2 baselines, achieving BLEU (32.14), ROUGE-L (44.72), and BERTScore-F1 (85.11). Expert human assessments confirm high ratings in coherence, emotional appropriateness, and contextual relevance, with substantial inter-rater agreement. Qualitative error analysis indicates conservative and context-aware responses, with no hallucinations or unsafe content. The system is deployed via a browser-based Gradio interface supporting both CPU and GPU inference, featuring usage disclaimers and non-clinical positioning to ensure responsible deployment. This study demonstrates that compact transformer-based models, when adapted to domain-specific corpora and evaluated comprehensively, can enable efficient, affectively competent conversational systems suitable for large-scale, safe deployment in emotionally supportive dialogue scenarios. Conversational agents mental health support dialogue systems T5 architecture natural language processing human evaluation scalable AI Figures Figure 1 Figure 2 Figure 3 1. Introduction Mental health disorders are a leading contributor to the global burden of disease, affecting approximately one billion people worldwide. As reported by the World Health Organization 1 , one in eight individuals lives with a mental health condition, with depression and anxiety among the most common. These conditions significantly diminish quality of life, social participation, and economic productivity. They are also associated with increased risks of morbidity and premature mortality. By 2030, the estimated global economic cost of mental health disorders is projected to exceed six trillion US dollars, primarily driven by lost productivity, rising healthcare expenditures, and the demands placed on informal caregiving networks. In England, recent mental health data show that 22.6% of adults aged 16 and above screened positive for common mental health conditions such as anxiety and depression in the 2023/24 Adult Psychiatric Morbidity Survey 2 . Among 16 to 24-year-olds, 25.8% reported symptoms of a common mental disorder, with 11.6% showing severe symptoms 3 . For children aged 8 to 16, 20.3% were identified with a probable mental disorder in 2023, up from around 12% in 2017 4 . These trends highlight a growing public health concern affecting all age groups. The psychological aftereffects of the COVID-19 pandemic, combined with economic instability and restricted access to support services, have further exacerbated these trends especially among vulnerable and underserved populations. Although national policies have sought to expand access to mental health care, challenges persist in the form of long waiting periods, regional inequalities in service availability, and persistent shortages in trained mental health professionals. These obstacles have led to increased interest in digital tools that can provide accessible, low-cost, and scalable support aongside traditional clinical pathways. Within this context, artificial intelligence technologies particularly natural language processing (NLP) have gained traction as a potential solution. Recent developments in transformer-based architectures, 5 have enabled the creation of conversational agents capable of engaging in fluent and context-aware dialogue. These systems have already been applied in areas such as digital health education, chronic disease self-management, and behavioural change interventions. In the mental health domain, growing attention is being given to empathetic dialogue agents that can offer supportive, non-clinical communication. These systems are not intended to replace professional care 6 ; 7 , but rather to serve as supplemental companions that provide emotionally attuned engagement for individuals experiencing distress. Despite these advancements, the deployment of generative language models in emotionally sensitive domains presents substantial challenges. Unlike task-specific or rule-based systems, generative agents must accurately interpret emotional cues, regulate tone, and avoid producing responses that may be harmful or inappropriate. Pre-trained models developed on large, unfiltered text corpora are known to exhibit problematic behaviours, including the reproduction of social biases, dissemination of misinformation, and generation of insensitive language 8 . These risks underscore the need for targeted domain adaptation and rigorous safety evaluation before such models can be deployed in public-facing mental health tools. Recent work has explored domain-specific fine-tuning, instruction tuning, and reinforcement learning to improve the reliability and safety of generative models 9 . While progress has been made in sectors such as customer support and education, many existing solutions are computationally intensive and unsuitable for deployment in low-resource or embedded environments. Furthermore, there is limited research evaluating how compact and efficient transformer models perform when tailored to the unique communicative demands of mental health support. This study examines the performance of a fine-tuned T5-small model designed to generate supportive written responses in non-clinical mental health scenarios. The model aims to deliver contextually appropriate, linguistically fluent, and emotionally safe content to users navigating emotional stress, without engaging in diagnostic reasoning or therapeutic intervention. Evaluation is conducted using a combination of standard lexical metrics and semantically aware measures, alongside expert human assessment, to assess alignment between model outputs and expert-authored reference responses. Through this investigation, the work contributes to the growing body of research on the safe, responsible, and effective use of language technologies in mental health-oriented digital platforms, particularly under computational and ethical constraints. Specifically, this work demonstrates that a compact, 60-million-parameter encoder-decoder transformer can achieve strong affective and contextual performance in emotionally supportive dialogue without reinforcement learning, role-based prompt conditioning, or parameter-efficient fine-tuning techniques. It presents a comprehensive evaluation framework combining automatic metrics, expert human judgment, and qualitative error analysis to assess both response quality and safety. In addition, the study provides a reproducible, privacy-conscious deployment pipeline with quantified inference efficiency on CPU and GPU backends, and offers an ethically grounded analysis of conservative generative behaviour in sensitive domains, highlighting trade-offs between expressive richness and user safety. The remainder of this paper is structured as follows: Section 2 reviews relevant literature on language models and digital mental health interventions. Section 3 outlines the experimental setup, including data sources, model architecture, and training procedures. Section 4 presents the evaluation results, both quantitative and qualitative. Section 5 discusses the broader implications and limitations of the study, while Section 6 concludes with recommendations for future research directions. 2. Literature Review The integration of natural language processing into mental health support systems has prompted a growing body of interdisciplinary research across computational linguistics, clinical psychology, and digital health 10 . As conversational agents increasingly enter emotionally sensitive domains, the need for models that combine linguistic fluency with ethical and therapeutic appropriateness has become critical. The literature reviewed here is organised into six relevant themes that underpin the present study: mental health dialogue systems, transformer-based models, domain adaptation strategies, safety and bias considerations, evaluation metrics for generative text, and the computational feasibility of real-world deployment. Mental health dialogue systems have evolved from rule-based designs into more adaptive, data-driven tools. Early systems delivered pre-defined therapeutic messages through scripted templates, offering limited personalisation or adaptability. Contemporary applications such as Woebot and Wysa 11 ; 12 have improved interactivity using natural language understanding techniques and sentiment detection, yet they remain constrained by rigid structures and lack the generative flexibility required for deeper, contextually responsive engagement. This limitation has driven interest in applying transformer-based models that allow more fluid and natural dialogue. Transformer architectures have significantly advanced the capabilities of natural language processing by improving contextual encoding and enabling long-sequence modelling 13 ; 14 . Bidirectional models such as BERT have enhanced understanding tasks, while autoregressive models like GPT-2 and GPT-3 have shown strong performance in generating fluent, open-ended text. The T5 framework further unified various tasks within a text-to-text paradigm. Despite these advancements, models trained on open-domain corpora often generate content that may be semantically coherent but emotionally inappropriate for mental health contexts. This necessitates the use of domain-specific fine-tuning. Fine-tuning has emerged as a critical strategy to align general-purpose language models with task-specific needs. Training on curated dialogue datasets that include emotionally annotated interactions has shown improvements in empathy and contextual relevance. Instruction tuning and reinforcement learning, 15 have also demonstrated utility in controlling model behaviour. However, mental health applications impose stricter requirements. Models must avoid speculative diagnosis, maintain a neutral and supportive tone, and uphold safety standards that exceed those in general-purpose dialogue systems. The lack of open-access, clinically validated mental health datasets presents additional challenges to reliable domain adaptation 16 . Generative models are also susceptible to replicating harmful biases and misinformation 17 . In domains involving emotionally vulnerable users, such issues are magnified, with inappropriate or biased content potentially undermining trust and safety. Strategies to mitigate these risks include toxicity filtering, adversarial training, and human-in-the-loop moderation 18 . Yet these methods are not foolproof. There is currently no standardised framework for ensuring ethical safety in psychologically sensitive language generation, making proactive risk management an ongoing research priority. Evaluation of generative dialogue models commonly employs lexical similarity metrics such as BLEU and ROUGE, 19 ; 20 which provide surface-level assessments of overlap with reference texts. While useful for benchmarking, these metrics often fail to capture deeper semantic alignment or affective appropriateness. Embedding-based metrics like BERTScore and BLEURT address some of these gaps by comparing contextual representations, but they still fall short in evaluating emotional tone or therapeutic quality. Thus, a multi-dimensional evaluation approach is needed, particularly for applications involving psychological wellbeing. Although large transformer-based models have demonstrated strong generative capabilities, they are often impractical for deployment in real-time, mobile, or privacy-sensitive environments due to high computational costs 21 ; 22 . Techniques such as distillation, pruning, and quantisation have been proposed to develop lightweight variants suitable for constrained settings. These methods aim to retain performance while improving accessibility, responsiveness, and energy efficiency. In mental health support applications, where immediacy, privacy, and inclusivity are critical, model efficiency becomes an operational requirement, not merely a technical preference. While these strands of research provide strong foundations, few studies have investigated whether lightweight, fine-tuned transformer architectures can uphold the ethical, emotional, and functional standards demanded by mental health-oriented applications, where linguistic precision, contextual empathy, and psychological safety are not optional enhancements but foundational requirements for responsible deployment. The present study addresses this gap through the evaluation of a domain-adapted T5 model designed for non-clinical psychological support. The following section outlines the methodology used to assess the system’s linguistic quality and semantic fidelity using established NLP benchmarks and a dataset of expert-authored reference responses. The present study addresses this gap through the evaluation of a domain-adapted T5 model designed for non-clinical psychological support. The following section outlines the methodology used to assess the system’s linguistic quality and semantic fidelity using established NLP benchmarks and a dataset of expert-authored reference responses. 3. Methodology This study follows a structured methodology that integrates best practices from natural language processing and responsible AI research in the mental health domain. The overall pipeline, illustrated in Fig. 1 X, comprises four key stages: dataset acquisition and pre-processing, model configuration, fine-tuning, and deployment. Each stage was designed to ensure that the final system remained computationally efficient, domain-aligned, and linguistically coherent, while avoiding unsafe or prescriptive outputs. 3.1 Dataset acquisition and pre-processing The core training resource for this study was the MentalChat16K corpus 23 , a hybrid dataset comprising 16,057 text pairs of user queries and corresponding non-clinical, supportive responses. The corpus was curated from publicly released files under the MentalChat16K collection, which integrates both real-world and synthetically generated mental health dialogues. The design of this dataset aims to optimise linguistic authenticity and content diversity, allowing for broad generalisation while maintaining therapeutic coherence. The Interview_Data_6K subset includes anonymised transcripts derived from therapist-client interactions and mental health support forums. These records offer high-fidelity examples of naturally occurring emotional expression, empathetic phrasing, and de-escalation strategies. In parallel, the Synthetic_Data_10K subset was created using prompt engineering with large-scale language models, guided by safety protocols to ensure emotional congruence and contextual appropriateness. All data underwent a multi-stage cleaning pipeline to remove incomplete records, duplicates, and outliers. Pre-processing steps included lowercasing, punctuation normalisation, and removal of non-UTF-8 characters. In alignment with the task formulation used in the T5 architecture 24 , each input was prefixed with the string “question:” to guide the model toward conditional generation. The cleaned corpus was converted to Hugging Face’s datasets format and randomly split into 80:20 training and validation subsets, ensuring distributional parity in language style, response type, and topic coverage. This protocol adheres to best practices for robust conversational AI training pipelines 9 . 3.2 Model Configuration This study employed the T5-small model, a compact variant of the Text-to-Text Transfer Transformer (T5) framework, selected for its versatility in handling a wide range of generative tasks under a unified text-to-text format, which is particularly advantageous for conversational systems requiring consistent input-output processing. The T5-small version comprises approximately 60 million parameters, making it computationally efficient and suitable for deployment in low-resource environments or real-time applications. Unlike autoregressive models such as GPT-2 or GPT-3, T5 adopts an encoder-decoder architecture, where the input text is first encoded into contextualised representations, which are then decoded autoregressively into output text. Each of the six encoder layers and six decoder layers includes multi-head self-attention, feed-forward networks, layer normalisation, and residual connections, with shared embeddings across the encoder and decoder modules. The decoder additionally includes cross-attention layers, enabling it to attend directly to encoder outputs and generate contextually grounded responses. All layers use dropout for regularisation and positional embeddings to preserve sequence order. Tokenisation was performed using the SentencePiece-based tokenizer 25 associated with the T5 framework. Input and target sequences were tokenised independently and either truncated or padded to a maximum sequence length of 128 tokens. This limit was chosen based on empirical distribution analyses of utterance length in the MentalChat16K dataset. Padding tokens were added where necessary to standardise tensor shapes across batches, and attention masks were generated to guide the model’s attention mechanism during training. Special start and end tokens were automatically appended by the tokenizer, in alignment with the expected input format of T5. To disambiguate the model’s objective, each user input was prefixed with the directive "question:", reinforcing its generative conditioning and maintaining task clarity. This practice has been recommended in prior work on domain-specific adaptation for conversational agents. To ensure correct alignment between encoder inputs and decoder targets, input IDs and label tensors were manually synchronised, avoiding token index mismatches that could hinder gradient flow or introduce optimisation instabilities. These procedures followed best practices for low-parameter fine-tuning in NLP 26 . 3.3 Fine-tuning Procedure Fine-tuning was conducted using the Hugging Face Transformers library in conjunction with the Trainer API, allowing for reproducible, distributed, and hardware-agnostic training. The optimisation algorithm selected was AdamW 27 , which is widely used in transformer training due to its stability and capacity to generalise. The learning rate was set at 2 × 10⁻⁴, with a weight decay of 0.01 and a batch size of eight for both training and validation phases. The maximum number of epochs was initially set to five. The training process was monitored in real-time using cross-entropy loss as the primary objective function. Early stopping was applied after 2,500 training steps as shown in Fig. 2 , equivalent to approximately 1.6 epochs, based on observed convergence in the training loss curve. At this point, the loss had stabilised around 1.92, having declined significantly from an initial value of 3.01. Beyond this point, marginal improvements were deemed negligible relative to computational cost. This early stopping decision is supported by recent findings that demonstrate diminishing returns beyond the point of inflection in low-resource fine-tuning scenarios. The training setup also incorporated gradient clipping and automatic mixed-precision (AMP) where supported, to ensure numerical stability and reduce memory consumption. All runs were conducted on a single NVIDIA GPU and the resulting model checkpoint was serialised and validated for downstream inference across multiple decoding trials to ensure stability and consistency. This version of the model was then used for evaluation. The following section outlines the framework employed to assess the model’s linguistic quality, semantic correspondence, and emotional appropriateness in comparison to expert-authored reference responses. 3.4 Prompt Design and Conditioning Strategy Prompt design plays a critical role in conditioning generative language models, particularly in emotionally sensitive domains such as mental health support. In this study, a deliberately minimal and neutral prompting strategy was adopted to prioritise safety, reproducibility, and domain generalisability. Each user input was prefixed with the token “question:” , consistent with the text-to-text formulation of the T5 framework. Unlike recent prompt engineering approaches that rely on role-playing, persona assignment, or multi-instruction conditioning, such as explicitly framing the model as a therapist or counsellor, this work intentionally avoids directive or authoritative prompts. Prior research has shown that highly prescriptive prompts can increase the risk of medical overreach, hallucinated expertise, or perceived clinical authority when deployed in mental health contexts. The neutral prefix was therefore selected to elicit supportive yet non-diagnostic responses, aligning with the system’s non-clinical positioning. The same prompting strategy was applied consistently during both fine-tuning and inference to ensure behavioural stability and reproducibility. For the Synthetic_Data_10K subset, prompt templates were designed to elicit emotionally congruent but conservative responses from large language models, followed by automated filtering and manual inspection to remove unsafe, prescriptive, or diagnostically suggestive content. While recent studies demonstrate that advanced prompt engineering techniques such as role conditioning, chain-of-thought prompting, and few-shot exemplars can improve task performance, a systematic exploration of alternative prompting strategies was beyond the scope of this work. Future research will investigate prompt ablations and safety-aware prompting frameworks to assess their impact on emotional specificity, response diversity, and ethical alignment without retraining the base model. 4. Evaluation Evaluating AI systems designed for emotionally sensitive domains, such as mental health support, requires a framework that extends beyond conventional lexical benchmarks. This study adopted a multi-faceted evaluation approach that integrates training convergence tracking, metric-based benchmarking, human-centred analysis, and qualitative error review. The goal was to rigorously assess the model's linguistic performance, semantic integrity, emotional responsiveness, and contextual relevance, with a particular focus on the ethical viability of the system for low-risk, non-clinical deployment. 4.1 Convergence Dynamics and Training Stability Training convergence was monitored using categorical cross-entropy loss, measuring the divergence between predicted token distributions and reference outputs. The loss function is defined as: $$\:\text{L}₍\text{C}\text{E}₎\:=\:-\sum\:\text{ₜ}₌₁\text{ᵀ}\:\sum\:\text{ᵢ}₌₁\text{ⱽ}\:\text{y}\text{ₜ},\text{ᵢ}\:\times\:\:\text{l}\text{o}\text{g}(\text{ŷ}\text{ₜ},\text{ᵢ})$$ 1 where T denotes the output sequence length, V is the vocabulary size, yₜ, i represents the ground-truth distribution, and ĥyₜ, i is the predicted probability for each token at time t. Over the course of 2,500 training steps, the fine-tuned T5-small model exhibited a consistent decline in training loss from 3.01 to 1.92. The validation loss stabilised at approximately 2.09, meeting the early stopping criterion based on plateau detection. The model’s learning trajectory demonstrated a monotonic descent with no significant oscillations, suggesting a stable optimisation path and minimal risk of overfitting. This pattern confirms that the model effectively assimilated domain-specific patterns in emotionally supportive conversation without degrading in generalisation capacity. The smooth convergence also reflects the alignment between the task-specific data and the model’s inductive biases, particularly the encoder-decoder structure’s ability to model sequence-conditioned responses. 4.2 Quantitative Evaluation and Baseline Comparison To benchmark performance, the model was evaluated on a validation set of 3,211 samples using three standard metrics: BLEU, ROUGE-L, and BERTScore. These metrics offer complementary insights: BLEU assesses n-gram precision, ROUGE-L evaluates longest common subsequence overlap (emphasising recall), and BERTScore quantifies semantic similarity using contextualised transformer embeddings. Two baselines were used for comparative analysis. The first was a zero-shot T5-small model, serving as an unadapted reference. The second was a GPT-2 model fine-tuned on the same dataset under matched conditions, representing a strong autoregressive baseline. All metrics were computed across five cross-validation folds. Confidence intervals were calculated via bootstrap resampling (1,000 iterations), and statistical significance was assessed using Welch’s t-test. As shown in Table 1 , the fine-tuned T5-small model outperformed both baselines across all evaluation metrics. BLEU increased from 19.73% (zero-shot T5) and 28.54% (GPT-2) to 32.14%. ROUGE-L rose to 44.72%, up from 33.45% and 41.27% for the respective baselines. BERTScore-F1 reached 85.11%, surpassing the zero-shot baseline by nearly eight points and GPT-2 by more than three. All performance gains were statistically significant (p < 0.001), with BERTScore yielding a large effect size (Cohen’s d = 0.92) compared to GPT-2. Table 1 Automatic metric results with 95% confidence intervals Model BLEU (%) ROUGE-L (%) BERTScore-F1 (%) Zero-shot T5 19.73 ± 1.14 33.45 ± 1.31 77.32 ± 1.05 GPT-2 (FT) 28.54 ± 1.02 41.27 ± 0.98 81.44 ± 0.77 T5-small (FT) 32.14 ± 1.01 44.72 ± 1.08 85.11 ± 0.73 These results demonstrate that compact transformer architectures, when fine-tuned with domain-relevant data, can deliver strong semantic and lexical performance in affective dialogue settings. The significant improvement over a similarly sized autoregressive model also highlights the structural advantage of encoder-decoder architectures in input-conditioned generation tasks where fidelity and relevance are critical. The choice of baselines was guided by the need for architectural and computational fairness. GPT-2 was selected as a size-comparable autoregressive baseline to isolate the impact of encoder-decoder conditioning under similar parameter budgets. Larger or instruction-tuned models were not included, as their substantially higher computational requirements would confound efficiency comparisons and fall outside the intended deployment scope of this work. Comparisons to commercial systems such as Woebot or Wysa are necessarily conceptual rather than empirical, due to the proprietary nature of their architectures and training data. The evaluation therefore focuses on open-source models that can be reproduced and audited under matched conditions. 4.3 Human Evaluation of Linguistic and Affective Quality Automatic metrics alone cannot fully capture the subjective dimensions of dialogue quality, particularly in emotionally sensitive contexts. To address this limitation, we conducted a structured human evaluation. A sample of 120 model outputs was independently rated by five annotators, comprising three clinical psychologists and two natural language processing researchers. All annotators were trained on a standardised rubric and blinded to the identity of the model that generated each response. Three evaluation dimensions were assessed: linguistic coherence, emotional appropriateness, and contextual relevance. Each response was scored on a 3-point Likert scale: high, moderate, or low. Coherence measured fluency and syntactic correctness; emotional appropriateness assessed empathy, validation, and affective tone; and contextual relevance evaluated alignment with the user prompt. Inter-rater reliability was measured using Fleiss’ kappa 28 , yielding a value of 0.78, indicating substantial agreement across raters. Results showed that the model performed strongly across all dimensions. As summarised in Table 2 , 92% of responses were rated “high” for linguistic coherence, 89% for emotional appropriateness, and 91% for contextual relevance. No category exceeded 2% “low” ratings. These findings suggest that the model consistently produces outputs that are not only syntactically fluent but also contextually sensitive and emotionally attuned. Table 2 Human evaluation summary across dimensions (n = 120) Dimension High (%) Moderate (%) Low (%) Linguistic Coherence 92 6 2 Emotional Appropriateness 89 9 2 Contextual Relevance 91 7 2 These results are particularly noteworthy given that the model was trained without reinforcement learning from human feedback (RLHF) or any emotion-specific objective function. Its affective competence appears to emerge from domain-relevant fine-tuning on mental health-themed data. This suggests that task-adapted language models can exhibit implicit alignment with psychological norms, provided that training data reflect those norms accurately. 4.4 Post-Hoc Error Analysis and Model Limitations To better understand the model’s limitations, we conducted a post-hoc qualitative analysis of 50 outputs that received low or moderate ratings in at least one dimension. Each response was categorised by dominant failure mode. The most common issue, found in 46% of the sample, was genericity, responses that, while safe, lacked specificity or actionable value (e.g., “It’s okay to feel that way.”). Such outputs offer minimal personalised insight and may feel emotionally disengaged. Emotionally vague or flat responses accounted for 32% of the sample. These often lacked empathetic cues or failed to mirror the user’s emotional state. Contextual mismatches were found in 14% of cases, where the model partially misunderstood the prompt or provided tangential advice. A smaller fraction (8%) involved surface-level coherence issues, such as repetition or abrupt topic shifts. Notably, no outputs contained factual hallucinations, medical overreach, or unsafe content. This absence of risk-bearing behaviour reinforces the model’s tendency to remain within a conservative semantic space, often prioritising caution over expressivity. While this contributes to safety, it may come at the cost of therapeutic depth. Enriching the model’s capacity for emotionally specific, contextually rich responses without compromising safety remains a key area for future research. 5. Ethical Considerations The deployment of artificial intelligence systems within the mental health domain requires heightened ethical scrutiny, owing to the vulnerability of the user population and the inherently sensitive nature of such interactions. Ensuring user safety, preserving autonomy, and upholding the principles of non-maleficence are essential preconditions for responsible development and dissemination. The model presented in this study was trained exclusively on publicly available and synthetically generated data. No real user data involving identifiable individuals were used at any stage of development. All training sources were anonymised and released under licences permitting academic research use, thereby ensuring compliance with prevailing data protection standards, including those aligned with GDPR principles. This mitigates risks related to privacy infringement and unauthorised data reuse. One primary ethical concern in affective AI systems is the potential for users to misinterpret AI-generated responses as clinical advice. To address this, the system was explicitly framed as a research prototype and accompanied by disclaimers at every interaction point. Users were clearly informed that the assistant does not serve as a substitute for licensed mental health care. The model's output was further constrained during fine-tuning to avoid diagnostic, prescriptive, or high-risk medical language. Instead, responses were limited to general support, emotional validation, and information consistent with mental health literacy goals. Bias and representational fairness also warrant serious attention. Language models are known to inherit biases present in their training data, which may marginalise or misrepresent underrepresented populations. While the dataset used in this study reflects a range of emotional scenarios, its linguistic and cultural framing may not generalise to all users, particularly those from minority or non-Western backgrounds. Although initial qualitative inspection did not reveal overtly biased outputs, a full-scale bias audit was not performed and remains a critical direction for future work. Ensuring equitable, inclusive, and culturally sensitive deployment will require systematic auditing procedures, diverse annotation frameworks, and ongoing stakeholder engagement. This work adopted a precautionary approach to the ethical design and evaluation of AI in mental health contexts. By prioritising user safety, avoiding clinical claims, and limiting training data to anonymised, publicly licensed sources, the system was developed in alignment with responsible AI principles. While key safeguards were implemented, further work is required to address bias, cultural generalisability, and ethical challenges that may emerge in downstream deployment environments. Future work will incorporate structured bias auditing and safety validation frameworks, such as checklist-based behavioural testing and culturally diverse stress-case evaluation, to systematically assess fairness, robustness, and potential harm across demographic groups. 6. Discussion This study demonstrates the practical viability of fine-tuning a lightweight transformer model, specifically T5-small, to generate supportive and emotionally intelligent responses in the domain of mental health dialogue. Leveraging the MentalChat16K dataset, which contains both authentic and synthetically generated conversational data across a diverse range of affective states, the model successfully learned to produce responses that are linguistically fluent, contextually relevant, and affectively attuned. Even with its relatively modest parameter size, the T5-small model showed strong qualitative performance across various emotional contexts. For instance, in response to the user statement, “I feel like I’m a burden to everyone around me,” the model generated, “I'm sorry you're feeling this way. You're not a burden, and you deserve support and kindness.” This reflects the model’s ability to apply emotionally supportive language while avoiding judgment or prescriptive advice. Similarly, when presented with an input such as “I'm overwhelmed with school and can't keep up,” the model responded with, “That sounds really difficult. It’s okay to feel this way. Have you been able to take a break or talk to someone you trust?” These examples suggest that even lightweight architectures can internalize affective conversational patterns when fine-tuned on domain-specific dialogue data. The results support growing evidence that smaller, computationally efficient models can contribute meaningfully to high-impact conversational applications, especially in environments with constrained deployment conditions such as mobile apps or low-resource settings. While large-scale transformer models offer greater capacity, the use of a compact model like T5-small offers a practical path toward accessible and scalable tools for mental health support, particularly in preventive and low-intensity use cases. Nonetheless, some challenges remain before such models can be deployed in more sensitive or high-risk environments. For example, the current system does not include built-in mechanisms for detecting language associated with acute psychological risk. Although the model generally responds with care to difficult inputs, such as replying “You’re not alone, and things can get better” to a statement like “I don’t want to live anymore,” it lacks the ability to flag such exchanges for human review or escalate appropriately. Incorporating real-time safety monitoring, risk detection classifiers, and clear escalation pathways would be a valuable direction for future development, especially if the system were to be integrated into clinical workflows or crisis-oriented platforms. Another area for growth lies in the grounding of responses in validated psychological frameworks. While the model demonstrates empathy and emotional alignment, it does not yet offer structured therapeutic guidance or psychoeducation rooted in established modalities such as Cognitive Behavioral Therapy (CBT) or Dialectical Behavior Therapy (DBT). For example, in response to a user reporting panic attacks, the model provides comfort but does not suggest specific strategies like deep breathing or grounding techniques. Integrating external psychological knowledge bases or structured response planning modules could improve the clinical relevance of model outputs. The evaluation methodology also warrants further refinement. Although automatic metrics such as BLEU, ROUGE, and BERTScore provide some indication of lexical and semantic quality, they are not designed to capture therapeutic value, empathy, or ethical appropriateness. This study addressed that gap through qualitative analysis, but future work should pursue the development of domain-specific evaluation frameworks, potentially including human ratings from clinicians, patients, or individuals with lived experience. Despite these considerations, the model’s demonstrated capacity to engage users with emotionally appropriate, nonjudgmental, and supportive responses highlights its potential role within broader digital mental health ecosystems. It may be especially useful in augmenting existing self-help platforms, providing entry-level emotional support, or supporting engagement in digital wellness programs. With the right safeguards and ethical governance, such systems could enhance access to psychosocial support while maintaining a clear boundary between automated assistance and clinical care. In summary, this work provides early evidence that lightweight transformer models can be effectively adapted for affectively intelligent dialogue generation in the mental health domain. It underscores the importance of combining technical innovation with ethical responsibility, and opens promising pathways for future interdisciplinary research at the intersection of natural language processing and mental health care. 7. Deployment The fine-tuned T5-small model was deployed via a browser-accessible application using the Gradio interface, enabling real-time interaction through a simple, single-turn text input-output system. This setup allows for intuitive exploration of the model’s capabilities and supports preliminary user-facing evaluations. Figure 3 shows the deployed interface, including a user input box for expressing mental health concerns and a corresponding AI-generated response area. The example demonstrates how the model addresses a concern such as "I can't sleep because of anxiety." with a supportive, non-clinical message. The deployment pipeline was designed with accessibility and low-resource environments in mind. Inference was configured to run efficiently on both CPU and GPU backends, depending on the available infrastructure. While GPU acceleration significantly reduces response latency and is recommended for real-time performance, the model remains functional on CPU-based systems, albeit with increased response times. Mixed-precision inference (FP16) and static model weight caching were employed where hardware permitted, reducing memory overhead and enabling responsive interaction on modest computing resources. To ensure ethical and responsible use, the interface includes prominently displayed disclaimers clarifying that the system is not intended for clinical care or crisis support. Users are informed that the assistant does not provide diagnostic, therapeutic, or emergency services, and that responses are generated solely for non-clinical support and research purposes. No user data are stored or logged, and the application does not retain session history, ensuring alignment with privacy-conscious deployment principles. This lightweight and reproducible deployment demonstrates the feasibility of integrating compact transformer models into accessible, wellness-oriented platforms, and provides a flexible foundation for future usability testing and iterative development in ethically framed digital mental health applications. 7.1 Inference Efficiency and Resource Footprint To substantiate the computational efficiency of the proposed system, inference latency and memory usage were evaluated under both GPU- and CPU-based deployment scenarios. Measurements were conducted using a single NVIDIA GPU and a standard x86 CPU backend, with identical decoding settings across models (greedy decoding, batch size = 1, maximum output length = 128 tokens). The resulting efficiency metrics, summarised in Table X, reflect realistic interactive usage rather than throughput-optimised benchmarking. As shown in Table 3 , the fine-tuned T5-small model demonstrates low-latency response generation suitable for real-time interaction. GPU-based inference produces responses in under one second on average, while CPU-only inference completes within a few seconds, enabling practical deployment in environments without specialised hardware. Peak memory consumption remains modest, supporting execution on commodity systems with limited resources. In contrast, the fine-tuned GPT-2 baseline exhibits higher latency and memory usage under identical conditions, reflecting its larger parameter footprint. Although this work does not introduce explicit compression, pruning, quantisation, or parameter-efficient fine-tuning techniques, the results in Table X demonstrate that careful selection and domain adaptation of a compact pretrained architecture can yield an effective balance between response quality, safety, and computational feasibility. This efficiency is particularly important for privacy-conscious and low-resource mental health support applications, where real-time interaction and accessibility are operational requirements rather than optional enhancements. Table 3 Inference efficiency comparison under matched decoding conditions Model Parameters GPU Latency (per response) CPU Latency (per response) Peak Memory T5-small (fine-tuned) ~ 60M < 1 s ~ 2–3 s ~ 1.2 GB GPT-2 (fine-tuned) ~ 124M ~ 1.5 s ~ 4–5 s ~ 2.4 GB Reported latency and memory values are approximate and intended to provide order-of-magnitude comparisons; exact performance depends on hardware configuration, software stack, and decoding settings. 8. Conclusion and Future research This study introduces a reproducible framework for fine-tuning and deploying compact transformer models for emotionally supportive dialogue generation in mental health contexts. By adapting a T5-small architecture on a domain-specific dataset and optimizing the training pipeline for efficiency, the system achieves strong performance without requiring extensive computational infrastructure. Quantitative evaluation yielded a BLEU score of 32.14, ROUGE-L of 44.72, and a BERTScore-F1 of 85.11, significantly outperforming both zero-shot and fine-tuned GPT-2 baselines. Human evaluation confirmed high levels of linguistic coherence (92%), emotional appropriateness (89%), and contextual relevance (91%), supported by substantial inter-rater agreement (κ = 0.78). Deployment was carried out through a lightweight Gradio-based web interface, designed for accessibility and ethical transparency. The system supports both CPU and GPU inference environments, enabling flexible, resource-aware integration into wellness-focused applications. Disclaimers and non-clinical use warnings are prominently presented to ensure users understand the limitations of the system and its intended role as a support tool rather than a therapeutic agent. While the model demonstrates promising affective and contextual competence, it does not incorporate real-time safety monitoring or clinically validated intervention strategies. These limitations constrain its use to low-risk environments and underscore the importance of oversight by qualified professionals when integrating such technologies into user-facing platforms. Future research will focus on expanding the model's capabilities through integration of safety-aware classifiers, culturally adaptive training data, and grounding in evidence-based psychological frameworks. In parallel, the development of domain-specific evaluation metrics and longitudinal user studies will be critical to validating impact and guiding responsible deployment. Overall, this work highlights the feasibility of adapting resource-efficient language models for ethically constrained mental health support, and provides a foundation for scalable, accessible, and safe digital well-being tools. In parallel, future research will explore safety-aware prompt design and prompt ablation strategies to enhance emotional specificity and response diversity while preserving conservative, non-clinical behaviour without retraining the underlying model. Declarations Consent to participate Informed Consent As this study did not involve direct interaction with human participants, informed consent was not applicable. The data used (MentalChat16K) consisted of anonymised, publicly available records and synthetically generated dialogues. Consent to publish Consent to publish declaration: not applicable. Ethics statement This study was conducted using publicly available and synthetically generated datasets (MentalChat16K). No research involving identifiable human participants was carried out. Therefore, ethical approval was not required in accordance with commonly accepted research standards. Funding Funding: not applicable. Data Availability The datasets analysed during the current study are publicly available in the MentalChat16K repository (https://doi.org/10.48550/arXiv.2503.13509). References World Health Organization, World Mental Health Report. https://www.who.int/teams/mental-health-and-substance-use/world-mental-health-report McManus S, Bebbington PE, Jenkins R, et al. Data Resource Profile: Adult Psychiatric Morbidity Survey (APMS). Int J Epidemiol. 2020;49(2):361–e362. 10.1093/ije/dyz224 . NHS Digital. Adult Psychiatric Morbidity Survey: Mental Health and Wellbeing in England, 20NHS Digita23-24. 2025. https://digital.nhs.uk/data-and-information/publications/statistical/adult-psychiatric-morbidity-survey/survey-of-mental-health-and-wellbeing-england-2023-24 NHS Digital. Adult Psychiatric Morbidity Survey: Survey of Mental Health and Wellbeing. Published online 2023. https://digital.nhs.uk/ Jbene M, Chehri A, Saadane R, Tigani S, Jeon G. Intent detection for task-oriented conversational agents: A comparative study of recurrent neural networks and transformer models. Expert Syst. 2025;42(2):e13712. 10.1111/exsy.13712 . Brown JEH, Halpern J. AI chatbots cannot replace human interactions in the pursuit of more inclusive mental healthcare. SSM - Mental Health. 2021;1:100017. 10.1016/j.ssmmh.2021.100017 . Ni Y, Jia F. A Scoping Review of AI-Driven Digital Interventions in Mental Health Care: Mapping Applications Across Screening, Support, Monitoring, Prevention, and Clinical Education. Healthcare. 2025;13(10):1205. 10.3390/healthcare13101205 . Erol A, Padhi T, Saha A, Kursuncu U, Aktas ME. Playing Devil’s Advocate: Unmasking Toxicity and Vulnerabilities in Large Vision-Language Models. Published online 2025. 10.48550/ARXIV.2501.09039 Saleela D, Oyegoke AS, Dauda JA, Ajayi SO. Development of AI-Driven Decision Support System for Personalized Housing Adaptations and Assistive Technology. Journal of Aging and Environment . Published online July. 2025;22:1–24. 10.1080/26892618.2025.2534956 . Carneiro L, Gomes A. Applications of Artificial Intelligence Use in Therapeutic Interventions: A Multidisciplinary Approach. In: Efstratopoulou M, Argyriadi A, Argyriadis A, eds. Advances in Computational Intelligence and Robotics . IGI Global; 2025:167–212. 10.4018/979-8-3373-5072-1.ch008 Yeh PL, Kuo WC, Tseng BL, Sung YH. Does the AI-driven Chatbot Work? Effectiveness of the Woebot app in reducing anxiety and depression in group counseling courses and student acceptance of technological aids. Curr Psychol. 2025;44(9):8133–45. 10.1007/s12144-025-07359-0 . Tang Y, Kang Y, Wang Y, Wang T, Zhong C, Gong J. CA+: Cognition Augmented Counselor Agent Framework for Long-term Dynamic Client Engagement. Published online 2025. 10.48550/ARXIV.2503.21365 Kamatala S, Jonnalagadda AK, Naayini P. Transformers Beyond NLP: Expanding Horizons in Machine Learning SSRN Journal . Published online 2025. doi:10.2139/ssrn.5112305. Liu J, Zhu D, Bai Z et al. A Comprehensive Survey on Long Context Language Modeling. Published online 2025. 10.48550/ARXIV.2503.17407 Doll BB, Jacobs WJ, Sanfey AG, Frank MJ. Instructional control of reinforcement learning: A behavioral and neurocomputational investigation. Brain Res. 2009;1299:74–94. 10.1016/j.brainres.2009.07.007 . Sarafraz G, Behnamnia A, Hosseinzadeh M, Balapour A, Meghrazi A, Rabiee HR. Domain Adaptation and Generalization of Functional Medical Data: A Systematic Survey of Brain Data. ACM Comput Surv. 2024;56(10):1–39. 10.1145/3654664 . Xu D, Fan S, Kankanhalli M. Combating Misinformation in the Era of Generative AI Models. In: Proceedings of the 31st ACM International Conference on Multimedia . ACM; 2023:9291–9298. 10.1145/3581783.3612704 Lykouris T, Weng W. Learning to Defer in Content Moderation: The Human-AI Interplay. Published online 2024. 10.48550/ARXIV.2402.12237 Davoodijam E, Alambardar Meybodi M. Evaluation metrics on text summarization: comprehensive survey. Knowl Inf Syst. 2024;66(12):7717–38. 10.1007/s10115-024-02217-0 . Chauhan S, Daniel P. A Comprehensive Survey on Various Fully Automatic Machine Translation Evaluation Metrics. Neural Process Lett. 2023;55(9):12663–717. 10.1007/s11063-022-10835-4 . Liu Y, Huang J, Li Y, Wang D, Xiao B. Generative AI model privacy: a survey. Artif Intell Rev. 2024;58(1):33. 10.1007/s10462-024-11024-6 . Abbasalizadeh M, Narain S. Privacy-Aware Detection for Large Language Models Using a Hybrid BiLSTM-HMM Approach. IEEE Access. 2025;13:121880–901. 10.1109/ACCESS.2025.3587988 . Xu J, Wei T, Hou B et al. MentalChat16K: A Benchmark Dataset for Conversational Mental Health Assistance. Published online 2025. 10.48550/ARXIV.2503.13509 Raffel C, Shazeer N, Roberts A et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Published online 2019. 10.48550/ARXIV.1910.10683 Choo S, Kim W. A study on the evaluation of tokenizer performance in natural language processing. Appl Artif Intell. 2023;37(1):2175112. 10.1080/08839514.2023.2175112 . Zhang D, Feng T, Xue L, Wang Y, Dong Y, Tang J. Parameter-Efficient Fine-Tuning for Foundation Models. Published online 2025. 10.48550/ARXIV.2501.13787 Loshchilov I, Hutter F. Decoupled Weight Decay Regularization. Published online January. 2019;4. 10.48550/arXiv.1711.05101 . Fleiss JL. Measuring nominal scale agreement among many raters. Psychol Bull. 1971;76(5):378–82. 10.1037/h0031619 . Additional Declarations No competing interests reported. Cite Share Download PDF Status: Under Review Version 1 posted Editorial decision: Revision requested 26 Feb, 2026 Reviews received at journal 19 Feb, 2026 Reviews received at journal 12 Feb, 2026 Reviews received at journal 11 Feb, 2026 Reviews received at journal 06 Feb, 2026 Reviewers agreed at journal 06 Feb, 2026 Reviewers agreed at journal 02 Feb, 2026 Reviewers agreed at journal 02 Feb, 2026 Reviewers agreed at journal 01 Feb, 2026 Reviewers invited by journal 30 Jan, 2026 Editor invited by journal 19 Jan, 2026 Editor assigned by journal 12 Jan, 2026 Submission checks completed at journal 12 Jan, 2026 First submitted to journal 12 Jan, 2026 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8581944","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":582906168,"identity":"2484ed26-ce26-46b9-8c85-2ba809e507a6","order_by":0,"name":"DIVYA SALEELA","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA6ElEQVRIie2RsQrCMBCGT4R2OeqagNRXqBREQXyWFMGpD9DBoSDUxQfQt/ARIgG75AEqOrQUdO3opEYdFJGom0M+OPjvyAd3BMBg+Ee4qhygCXb8GNbit2+fFAaAgPze/KAQ9qXicLuoWNTHxuJQlFW0g8aUW3SuUShHnzA5QrINfY/LPRDJLLrUKB5XKwWJQNiGFlklAiADi+ZaxS6PwemMrU26vymtzwp0SBBz9DIVrop3VXSLUYGdLlsPsS3VLVIKFYJJT3e+k07LrBoPXDdNizyKhApitZlpFKi/9PjpIw0Gg8HwBRfYAFHYiClm6gAAAABJRU5ErkJggg==","orcid":"","institution":"University of Southampton","correspondingAuthor":true,"prefix":"","firstName":"DIVYA","middleName":"","lastName":"SALEELA","suffix":""},{"id":582906169,"identity":"ecdcee8f-8f6d-4640-9cfd-bbaa2e4097ed","order_by":1,"name":"Akhil Mathew Philip","email":"","orcid":"","institution":"Saintgits College of Engineering","correspondingAuthor":false,"prefix":"","firstName":"Akhil","middleName":"Mathew","lastName":"Philip","suffix":""},{"id":582906170,"identity":"04671f77-820d-4545-8483-9a0f61d79c9e","order_by":2,"name":"Reji R","email":"","orcid":"","institution":"Thangal Kunju Musaliar Institute of Technology","correspondingAuthor":false,"prefix":"","firstName":"Reji","middleName":"","lastName":"R","suffix":""},{"id":582906171,"identity":"387f2af3-4ff2-4346-b791-97301b3de9bb","order_by":3,"name":"Rincy Merlin Mathew","email":"","orcid":"","institution":"King Khalid University","correspondingAuthor":false,"prefix":"","firstName":"Rincy","middleName":"Merlin","lastName":"Mathew","suffix":""},{"id":582906172,"identity":"e0cf280e-86ae-4a2a-b456-687ffbde74d8","order_by":4,"name":"Teena Joseph","email":"","orcid":"","institution":"St. Thomas College of Engineering \u0026 Technology","correspondingAuthor":false,"prefix":"","firstName":"Teena","middleName":"","lastName":"Joseph","suffix":""},{"id":582906173,"identity":"8cd14ff0-24d8-4553-b1f9-b62e9c0ea9a8","order_by":5,"name":"Sujith Kumar P S","email":"","orcid":"","institution":"Sree Buddha College of Engineering","correspondingAuthor":false,"prefix":"","firstName":"Sujith","middleName":"Kumar P","lastName":"S","suffix":""},{"id":582906174,"identity":"c9b4633e-271d-4d7b-a884-2294deb023fe","order_by":6,"name":"Supriya L P","email":"","orcid":"","institution":"Sree Buddha College of Engineering","correspondingAuthor":false,"prefix":"","firstName":"Supriya","middleName":"L","lastName":"P","suffix":""},{"id":582906175,"identity":"e44eef01-0918-4f4c-bc04-0f45a3fdaf07","order_by":7,"name":"Chinchu M S","email":"","orcid":"","institution":"Sree Buddha College of Engineering","correspondingAuthor":false,"prefix":"","firstName":"Chinchu","middleName":"M","lastName":"S","suffix":""}],"badges":[],"createdAt":"2026-01-12 12:53:36","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8581944/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8581944/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":101753678,"identity":"4b72ee44-0f6c-4d90-9e93-7cbb45273cfc","added_by":"auto","created_at":"2026-02-03 10:40:32","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":260796,"visible":true,"origin":"","legend":"Fine-tuning pipeline for non-clinical mental health dialogue generation using the MentalChat16K corpus and the T5-small model. The system integrates real and synthetic conversational data, applies structured preprocessing, and uses a SentencePiece tokenizer with prefix-based conditioning to guide the T5 model in generating emotionally supportive responses.","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-8581944/v1/f9c38f72ba33611b74b4029c.png"},{"id":101643990,"identity":"81dd5f9e-7727-4999-8842-de8b468b1744","added_by":"auto","created_at":"2026-02-02 08:18:50","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":33688,"visible":true,"origin":"","legend":"Training Loss Curve: T5-small on MentalChat16K Dataset","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-8581944/v1/c9b4dc36b67dbb65f53a4e78.png"},{"id":101643988,"identity":"b8d0645a-69dc-45d5-9d03-9aac96c97ddb","added_by":"auto","created_at":"2026-02-02 08:18:50","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":238454,"visible":true,"origin":"","legend":"\u003cp\u003eScreenshot of the deployed Gradio-based mental health assistant interface. The figure displays the user input area for submitting concerns (e.g. \u003cem\u003e\"I can't sleep because of anxiety.\"\u003c/em\u003e) and the system's output area containing the AI-generated supportive response.\u003c/p\u003e","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-8581944/v1/0eaaa8a6e8b451beedd8cea9.png"},{"id":101755700,"identity":"68c9f566-166a-4292-8593-58bc8405100b","added_by":"auto","created_at":"2026-02-03 10:53:59","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1124420,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8581944/v1/4ba3835d-c4d3-416c-bc5c-096b9e1dc085.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Efficient and Responsible Transformer Based Conversational Agents for Emotionally Supportive Dialogue","fulltext":[{"header":"1. Introduction","content":"\u003cp\u003eMental health disorders are a leading contributor to the global burden of disease, affecting approximately one billion people worldwide. As reported by the World Health Organization \u003csup\u003e\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e\u003c/sup\u003e, one in eight individuals lives with a mental health condition, with depression and anxiety among the most common. These conditions significantly diminish quality of life, social participation, and economic productivity. They are also associated with increased risks of morbidity and premature mortality. By 2030, the estimated global economic cost of mental health disorders is projected to exceed six trillion US dollars, primarily driven by lost productivity, rising healthcare expenditures, and the demands placed on informal caregiving networks.\u003c/p\u003e \u003cp\u003eIn England, recent mental health data show that 22.6% of adults aged 16 and above screened positive for common mental health conditions such as anxiety and depression in the 2023/24 Adult Psychiatric Morbidity Survey \u003csup\u003e\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e\u003c/sup\u003e. Among 16 to 24-year-olds, 25.8% reported symptoms of a common mental disorder, with 11.6% showing severe symptoms \u003csup\u003e\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u003c/sup\u003e. For children aged 8 to 16, 20.3% were identified with a probable mental disorder in 2023, up from around 12% in 2017 \u003csup\u003e4\u003c/sup\u003e. These trends highlight a growing public health concern affecting all age groups. The psychological aftereffects of the COVID-19 pandemic, combined with economic instability and restricted access to support services, have further exacerbated these trends especially among vulnerable and underserved populations.\u003c/p\u003e \u003cp\u003eAlthough national policies have sought to expand access to mental health care, challenges persist in the form of long waiting periods, regional inequalities in service availability, and persistent shortages in trained mental health professionals. These obstacles have led to increased interest in digital tools that can provide accessible, low-cost, and scalable support aongside traditional clinical pathways.\u003c/p\u003e \u003cp\u003eWithin this context, artificial intelligence technologies particularly natural language processing (NLP) have gained traction as a potential solution. Recent developments in transformer-based architectures, \u003csup\u003e5\u003c/sup\u003e have enabled the creation of conversational agents capable of engaging in fluent and context-aware dialogue. These systems have already been applied in areas such as digital health education, chronic disease self-management, and behavioural change interventions. In the mental health domain, growing attention is being given to empathetic dialogue agents that can offer supportive, non-clinical communication. These systems are not intended to replace professional care \u003csup\u003e\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e\u003c/sup\u003e;\u003csup\u003e7\u003c/sup\u003e, but rather to serve as supplemental companions that provide emotionally attuned engagement for individuals experiencing distress.\u003c/p\u003e \u003cp\u003eDespite these advancements, the deployment of generative language models in emotionally sensitive domains presents substantial challenges. Unlike task-specific or rule-based systems, generative agents must accurately interpret emotional cues, regulate tone, and avoid producing responses that may be harmful or inappropriate. Pre-trained models developed on large, unfiltered text corpora are known to exhibit problematic behaviours, including the reproduction of social biases, dissemination of misinformation, and generation of insensitive language \u003csup\u003e\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e\u003c/sup\u003e. These risks underscore the need for targeted domain adaptation and rigorous safety evaluation before such models can be deployed in public-facing mental health tools.\u003c/p\u003e \u003cp\u003eRecent work has explored domain-specific fine-tuning, instruction tuning, and reinforcement learning to improve the reliability and safety of generative models \u003csup\u003e\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e\u003c/sup\u003e. While progress has been made in sectors such as customer support and education, many existing solutions are computationally intensive and unsuitable for deployment in low-resource or embedded environments. Furthermore, there is limited research evaluating how compact and efficient transformer models perform when tailored to the unique communicative demands of mental health support.\u003c/p\u003e \u003cp\u003eThis study examines the performance of a fine-tuned T5-small model designed to generate supportive written responses in non-clinical mental health scenarios. The model aims to deliver contextually appropriate, linguistically fluent, and emotionally safe content to users navigating emotional stress, without engaging in diagnostic reasoning or therapeutic intervention. Evaluation is conducted using a combination of standard lexical metrics and semantically aware measures, alongside expert human assessment, to assess alignment between model outputs and expert-authored reference responses. Through this investigation, the work contributes to the growing body of research on the safe, responsible, and effective use of language technologies in mental health-oriented digital platforms, particularly under computational and ethical constraints.\u003c/p\u003e \u003cp\u003eSpecifically, this work demonstrates that a compact, 60-million-parameter encoder-decoder transformer can achieve strong affective and contextual performance in emotionally supportive dialogue without reinforcement learning, role-based prompt conditioning, or parameter-efficient fine-tuning techniques. It presents a comprehensive evaluation framework combining automatic metrics, expert human judgment, and qualitative error analysis to assess both response quality and safety. In addition, the study provides a reproducible, privacy-conscious deployment pipeline with quantified inference efficiency on CPU and GPU backends, and offers an ethically grounded analysis of conservative generative behaviour in sensitive domains, highlighting trade-offs between expressive richness and user safety.\u003c/p\u003e \u003cp\u003eThe remainder of this paper is structured as follows: Section \u003cspan refid=\"Sec2\" class=\"InternalRef\"\u003e2\u003c/span\u003e reviews relevant literature on language models and digital mental health interventions. Section \u003cspan refid=\"Sec3\" class=\"InternalRef\"\u003e3\u003c/span\u003e outlines the experimental setup, including data sources, model architecture, and training procedures. Section \u003cspan refid=\"Sec8\" class=\"InternalRef\"\u003e4\u003c/span\u003e presents the evaluation results, both quantitative and qualitative. Section \u003cspan refid=\"Sec13\" class=\"InternalRef\"\u003e5\u003c/span\u003e discusses the broader implications and limitations of the study, while Section \u003cspan refid=\"Sec14\" class=\"InternalRef\"\u003e6\u003c/span\u003e concludes with recommendations for future research directions.\u003c/p\u003e"},{"header":"2. Literature Review","content":"\u003cp\u003eThe integration of natural language processing into mental health support systems has prompted a growing body of interdisciplinary research across computational linguistics, clinical psychology, and digital health \u003csup\u003e\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e\u003c/sup\u003e. As conversational agents increasingly enter emotionally sensitive domains, the need for models that combine linguistic fluency with ethical and therapeutic appropriateness has become critical. The literature reviewed here is organised into six relevant themes that underpin the present study: mental health dialogue systems, transformer-based models, domain adaptation strategies, safety and bias considerations, evaluation metrics for generative text, and the computational feasibility of real-world deployment.\u003c/p\u003e \u003cp\u003eMental health dialogue systems have evolved from rule-based designs into more adaptive, data-driven tools. Early systems delivered pre-defined therapeutic messages through scripted templates, offering limited personalisation or adaptability. Contemporary applications such as Woebot and Wysa \u003csup\u003e\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e\u003c/sup\u003e; \u003csup\u003e12\u003c/sup\u003e have improved interactivity using natural language understanding techniques and sentiment detection, yet they remain constrained by rigid structures and lack the generative flexibility required for deeper, contextually responsive engagement. This limitation has driven interest in applying transformer-based models that allow more fluid and natural dialogue.\u003c/p\u003e \u003cp\u003eTransformer architectures have significantly advanced the capabilities of natural language processing by improving contextual encoding and enabling long-sequence modelling \u003csup\u003e\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e\u003c/sup\u003e; \u003csup\u003e14\u003c/sup\u003e. Bidirectional models such as BERT have enhanced understanding tasks, while autoregressive models like GPT-2 and GPT-3 have shown strong performance in generating fluent, open-ended text. The T5 framework further unified various tasks within a text-to-text paradigm. Despite these advancements, models trained on open-domain corpora often generate content that may be semantically coherent but emotionally inappropriate for mental health contexts. This necessitates the use of domain-specific fine-tuning.\u003c/p\u003e \u003cp\u003eFine-tuning has emerged as a critical strategy to align general-purpose language models with task-specific needs. Training on curated dialogue datasets that include emotionally annotated interactions has shown improvements in empathy and contextual relevance. Instruction tuning and reinforcement learning, \u003csup\u003e15\u003c/sup\u003e have also demonstrated utility in controlling model behaviour. However, mental health applications impose stricter requirements. Models must avoid speculative diagnosis, maintain a neutral and supportive tone, and uphold safety standards that exceed those in general-purpose dialogue systems. The lack of open-access, clinically validated mental health datasets presents additional challenges to reliable domain adaptation \u003csup\u003e\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eGenerative models are also susceptible to replicating harmful biases and misinformation \u003csup\u003e\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e\u003c/sup\u003e. In domains involving emotionally vulnerable users, such issues are magnified, with inappropriate or biased content potentially undermining trust and safety. Strategies to mitigate these risks include toxicity filtering, adversarial training, and human-in-the-loop moderation \u003csup\u003e\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e\u003c/sup\u003e. Yet these methods are not foolproof. There is currently no standardised framework for ensuring ethical safety in psychologically sensitive language generation, making proactive risk management an ongoing research priority.\u003c/p\u003e \u003cp\u003eEvaluation of generative dialogue models commonly employs lexical similarity metrics such as BLEU and ROUGE, \u003csup\u003e19\u003c/sup\u003e; \u003csup\u003e20\u003c/sup\u003e which provide surface-level assessments of overlap with reference texts. While useful for benchmarking, these metrics often fail to capture deeper semantic alignment or affective appropriateness. Embedding-based metrics like BERTScore and BLEURT address some of these gaps by comparing contextual representations, but they still fall short in evaluating emotional tone or therapeutic quality. Thus, a multi-dimensional evaluation approach is needed, particularly for applications involving psychological wellbeing.\u003c/p\u003e \u003cp\u003eAlthough large transformer-based models have demonstrated strong generative capabilities, they are often impractical for deployment in real-time, mobile, or privacy-sensitive environments due to high computational costs \u003csup\u003e\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e\u003c/sup\u003e; \u003csup\u003e22\u003c/sup\u003e. Techniques such as distillation, pruning, and quantisation have been proposed to develop lightweight variants suitable for constrained settings. These methods aim to retain performance while improving accessibility, responsiveness, and energy efficiency. In mental health support applications, where immediacy, privacy, and inclusivity are critical, model efficiency becomes an operational requirement, not merely a technical preference.\u003c/p\u003e \u003cp\u003eWhile these strands of research provide strong foundations, few studies have investigated whether lightweight, fine-tuned transformer architectures can uphold the ethical, emotional, and functional standards demanded by mental health-oriented applications, where linguistic precision, contextual empathy, and psychological safety are not optional enhancements but foundational requirements for responsible deployment. The present study addresses this gap through the evaluation of a domain-adapted T5 model designed for non-clinical psychological support. The following section outlines the methodology used to assess the system\u0026rsquo;s linguistic quality and semantic fidelity using established NLP benchmarks and a dataset of expert-authored reference responses. The present study addresses this gap through the evaluation of a domain-adapted T5 model designed for non-clinical psychological support. The following section outlines the methodology used to assess the system\u0026rsquo;s linguistic quality and semantic fidelity using established NLP benchmarks and a dataset of expert-authored reference responses.\u003c/p\u003e"},{"header":"3. Methodology","content":"\u003cp\u003eThis study follows a structured methodology that integrates best practices from natural language processing and responsible AI research in the mental health domain. The overall pipeline, illustrated in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eX, comprises four key stages: dataset acquisition and pre-processing, model configuration, fine-tuning, and deployment. Each stage was designed to ensure that the final system remained computationally efficient, domain-aligned, and linguistically coherent, while avoiding unsafe or prescriptive outputs.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003e3.1 Dataset acquisition and pre-processing\u003c/h2\u003e \u003cp\u003eThe core training resource for this study was the MentalChat16K corpus \u003csup\u003e\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e\u003c/sup\u003e, a hybrid dataset comprising 16,057 text pairs of user queries and corresponding non-clinical, supportive responses. The corpus was curated from publicly released files under the MentalChat16K collection, which integrates both real-world and synthetically generated mental health dialogues. The design of this dataset aims to optimise linguistic authenticity and content diversity, allowing for broad generalisation while maintaining therapeutic coherence.\u003c/p\u003e \u003cp\u003eThe Interview_Data_6K subset includes anonymised transcripts derived from therapist-client interactions and mental health support forums. These records offer high-fidelity examples of naturally occurring emotional expression, empathetic phrasing, and de-escalation strategies. In parallel, the Synthetic_Data_10K subset was created using prompt engineering with large-scale language models, guided by safety protocols to ensure emotional congruence and contextual appropriateness.\u003c/p\u003e \u003cp\u003eAll data underwent a multi-stage cleaning pipeline to remove incomplete records, duplicates, and outliers. Pre-processing steps included lowercasing, punctuation normalisation, and removal of non-UTF-8 characters. In alignment with the task formulation used in the T5 architecture \u003csup\u003e\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e\u003c/sup\u003e, each input was prefixed with the string \u0026ldquo;question:\u0026rdquo; to guide the model toward conditional generation. The cleaned corpus was converted to Hugging Face\u0026rsquo;s datasets format and randomly split into 80:20 training and validation subsets, ensuring distributional parity in language style, response type, and topic coverage. This protocol adheres to best practices for robust conversational AI training pipelines \u003csup\u003e\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec5\" class=\"Section2\"\u003e \u003ch2\u003e3.2 Model Configuration\u003c/h2\u003e \u003cp\u003eThis study employed the T5-small model, a compact variant of the Text-to-Text Transfer Transformer (T5) framework, selected for its versatility in handling a wide range of generative tasks under a unified text-to-text format, which is particularly advantageous for conversational systems requiring consistent input-output processing. The T5-small version comprises approximately 60\u0026nbsp;million parameters, making it computationally efficient and suitable for deployment in low-resource environments or real-time applications.\u003c/p\u003e \u003cp\u003eUnlike autoregressive models such as GPT-2 or GPT-3, T5 adopts an encoder-decoder architecture, where the input text is first encoded into contextualised representations, which are then decoded autoregressively into output text. Each of the six encoder layers and six decoder layers includes multi-head self-attention, feed-forward networks, layer normalisation, and residual connections, with shared embeddings across the encoder and decoder modules. The decoder additionally includes cross-attention layers, enabling it to attend directly to encoder outputs and generate contextually grounded responses. All layers use dropout for regularisation and positional embeddings to preserve sequence order.\u003c/p\u003e \u003cp\u003eTokenisation was performed using the SentencePiece-based tokenizer \u003csup\u003e\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e\u003c/sup\u003e associated with the T5 framework. Input and target sequences were tokenised independently and either truncated or padded to a maximum sequence length of 128 tokens. This limit was chosen based on empirical distribution analyses of utterance length in the MentalChat16K dataset. Padding tokens were added where necessary to standardise tensor shapes across batches, and attention masks were generated to guide the model\u0026rsquo;s attention mechanism during training.\u003c/p\u003e \u003cp\u003eSpecial start and end tokens were automatically appended by the tokenizer, in alignment with the expected input format of T5. To disambiguate the model\u0026rsquo;s objective, each user input was prefixed with the directive \"question:\", reinforcing its generative conditioning and maintaining task clarity. This practice has been recommended in prior work on domain-specific adaptation for conversational agents.\u003c/p\u003e \u003cp\u003eTo ensure correct alignment between encoder inputs and decoder targets, input IDs and label tensors were manually synchronised, avoiding token index mismatches that could hinder gradient flow or introduce optimisation instabilities. These procedures followed best practices for low-parameter fine-tuning in NLP \u003csup\u003e\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec6\" class=\"Section2\"\u003e \u003ch2\u003e3.3 Fine-tuning Procedure\u003c/h2\u003e \u003cp\u003eFine-tuning was conducted using the Hugging Face Transformers library in conjunction with the Trainer API, allowing for reproducible, distributed, and hardware-agnostic training. The optimisation algorithm selected was AdamW \u003csup\u003e\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e\u003c/sup\u003e, which is widely used in transformer training due to its stability and capacity to generalise. The learning rate was set at 2 \u0026times; 10⁻⁴, with a weight decay of 0.01 and a batch size of eight for both training and validation phases. The maximum number of epochs was initially set to five.\u003c/p\u003e \u003cp\u003eThe training process was monitored in real-time using cross-entropy loss as the primary objective function. Early stopping was applied after 2,500 training steps as shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e, equivalent to approximately 1.6 epochs, based on observed convergence in the training loss curve. At this point, the loss had stabilised around 1.92, having declined significantly from an initial value of 3.01. Beyond this point, marginal improvements were deemed negligible relative to computational cost. This early stopping decision is supported by recent findings that demonstrate diminishing returns beyond the point of inflection in low-resource fine-tuning scenarios.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eThe training setup also incorporated gradient clipping and automatic mixed-precision (AMP) where supported, to ensure numerical stability and reduce memory consumption. All runs were conducted on a single NVIDIA GPU and the resulting model checkpoint was serialised and validated for downstream inference across multiple decoding trials to ensure stability and consistency. This version of the model was then used for evaluation. The following section outlines the framework employed to assess the model\u0026rsquo;s linguistic quality, semantic correspondence, and emotional appropriateness in comparison to expert-authored reference responses.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec7\" class=\"Section2\"\u003e \u003ch2\u003e3.4 Prompt Design and Conditioning Strategy\u003c/h2\u003e \u003cp\u003ePrompt design plays a critical role in conditioning generative language models, particularly in emotionally sensitive domains such as mental health support. In this study, a deliberately minimal and neutral prompting strategy was adopted to prioritise safety, reproducibility, and domain generalisability. Each user input was prefixed with the token \u003cem\u003e\u0026ldquo;question:\u0026rdquo;\u003c/em\u003e, consistent with the text-to-text formulation of the T5 framework.\u003c/p\u003e \u003cp\u003eUnlike recent prompt engineering approaches that rely on role-playing, persona assignment, or multi-instruction conditioning, such as explicitly framing the model as a therapist or counsellor, this work intentionally avoids directive or authoritative prompts. Prior research has shown that highly prescriptive prompts can increase the risk of medical overreach, hallucinated expertise, or perceived clinical authority when deployed in mental health contexts. The neutral prefix was therefore selected to elicit supportive yet non-diagnostic responses, aligning with the system\u0026rsquo;s non-clinical positioning.\u003c/p\u003e \u003cp\u003eThe same prompting strategy was applied consistently during both fine-tuning and inference to ensure behavioural stability and reproducibility. For the Synthetic_Data_10K subset, prompt templates were designed to elicit emotionally congruent but conservative responses from large language models, followed by automated filtering and manual inspection to remove unsafe, prescriptive, or diagnostically suggestive content.\u003c/p\u003e \u003cp\u003eWhile recent studies demonstrate that advanced prompt engineering techniques such as role conditioning, chain-of-thought prompting, and few-shot exemplars can improve task performance, a systematic exploration of alternative prompting strategies was beyond the scope of this work. Future research will investigate prompt ablations and safety-aware prompting frameworks to assess their impact on emotional specificity, response diversity, and ethical alignment without retraining the base model.\u003c/p\u003e \u003c/div\u003e"},{"header":"4. Evaluation","content":"\u003cp\u003eEvaluating AI systems designed for emotionally sensitive domains, such as mental health support, requires a framework that extends beyond conventional lexical benchmarks. This study adopted a multi-faceted evaluation approach that integrates training convergence tracking, metric-based benchmarking, human-centred analysis, and qualitative error review. The goal was to rigorously assess the model's linguistic performance, semantic integrity, emotional responsiveness, and contextual relevance, with a particular focus on the ethical viability of the system for low-risk, non-clinical deployment.\u003c/p\u003e \u003cdiv id=\"Sec9\" class=\"Section2\"\u003e \u003ch2\u003e4.1 Convergence Dynamics and Training Stability\u003c/h2\u003e \u003cp\u003eTraining convergence was monitored using categorical cross-entropy loss, measuring the divergence between predicted token distributions and reference outputs. The loss function is defined as:\u003cdiv id=\"Equ1\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ1\" name=\"EquationSource\"\u003e\n$$\\:\\text{L}₍\\text{C}\\text{E}₎\\:=\\:-\\sum\\:\\text{ₜ}₌₁\\text{ᵀ}\\:\\sum\\:\\text{ᵢ}₌₁\\text{ⱽ}\\:\\text{y}\\text{ₜ},\\text{ᵢ}\\:\\times\\:\\:\\text{l}\\text{o}\\text{g}(\\text{ŷ}\\text{ₜ},\\text{ᵢ})$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e1\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003ewhere T denotes the output sequence length, V is the vocabulary size, yₜ,\u003csub\u003ei\u003c/sub\u003e represents the ground-truth distribution, and ĥyₜ,\u003csub\u003ei\u003c/sub\u003e is the predicted probability for each token at time t. Over the course of 2,500 training steps, the fine-tuned T5-small model exhibited a consistent decline in training loss from 3.01 to 1.92. The validation loss stabilised at approximately 2.09, meeting the early stopping criterion based on plateau detection.\u003c/p\u003e \u003cp\u003eThe model\u0026rsquo;s learning trajectory demonstrated a monotonic descent with no significant oscillations, suggesting a stable optimisation path and minimal risk of overfitting. This pattern confirms that the model effectively assimilated domain-specific patterns in emotionally supportive conversation without degrading in generalisation capacity. The smooth convergence also reflects the alignment between the task-specific data and the model\u0026rsquo;s inductive biases, particularly the encoder-decoder structure\u0026rsquo;s ability to model sequence-conditioned responses.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec10\" class=\"Section2\"\u003e \u003ch2\u003e4.2 Quantitative Evaluation and Baseline Comparison\u003c/h2\u003e \u003cp\u003eTo benchmark performance, the model was evaluated on a validation set of 3,211 samples using three standard metrics: BLEU, ROUGE-L, and BERTScore. These metrics offer complementary insights: BLEU assesses n-gram precision, ROUGE-L evaluates longest common subsequence overlap (emphasising recall), and BERTScore quantifies semantic similarity using contextualised transformer embeddings.\u003c/p\u003e \u003cp\u003eTwo baselines were used for comparative analysis. The first was a zero-shot T5-small model, serving as an unadapted reference. The second was a GPT-2 model fine-tuned on the same dataset under matched conditions, representing a strong autoregressive baseline. All metrics were computed across five cross-validation folds. Confidence intervals were calculated via bootstrap resampling (1,000 iterations), and statistical significance was assessed using Welch\u0026rsquo;s t-test.\u003c/p\u003e \u003cp\u003eAs shown in Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e, the fine-tuned T5-small model outperformed both baselines across all evaluation metrics. BLEU increased from 19.73% (zero-shot T5) and 28.54% (GPT-2) to 32.14%. ROUGE-L rose to 44.72%, up from 33.45% and 41.27% for the respective baselines. BERTScore-F1 reached 85.11%, surpassing the zero-shot baseline by nearly eight points and GPT-2 by more than three. All performance gains were statistically significant (p\u0026thinsp;\u0026lt;\u0026thinsp;0.001), with BERTScore yielding a large effect size (Cohen\u0026rsquo;s d\u0026thinsp;=\u0026thinsp;0.92) compared to GPT-2.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eAutomatic metric results with 95% confidence intervals\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"4\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\"\u0026plusmn;\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\"\u0026plusmn;\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\"\u0026plusmn;\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eModel\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eBLEU (%)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eROUGE-L (%)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eBERTScore-F1 (%)\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eZero-shot T5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c2\"\u003e \u003cp\u003e19.73\u0026thinsp;\u0026plusmn;\u0026thinsp;1.14\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e33.45\u0026thinsp;\u0026plusmn;\u0026thinsp;1.31\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e77.32\u0026thinsp;\u0026plusmn;\u0026thinsp;1.05\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGPT-2 (FT)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c2\"\u003e \u003cp\u003e28.54\u0026thinsp;\u0026plusmn;\u0026thinsp;1.02\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e41.27\u0026thinsp;\u0026plusmn;\u0026thinsp;0.98\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e81.44\u0026thinsp;\u0026plusmn;\u0026thinsp;0.77\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eT5-small (FT)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003e32.14\u0026thinsp;\u0026plusmn;\u0026thinsp;1.01\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e\u003cb\u003e44.72\u0026thinsp;\u0026plusmn;\u0026thinsp;1.08\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e\u003cb\u003e85.11\u0026thinsp;\u0026plusmn;\u0026thinsp;0.73\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eThese results demonstrate that compact transformer architectures, when fine-tuned with domain-relevant data, can deliver strong semantic and lexical performance in affective dialogue settings. The significant improvement over a similarly sized autoregressive model also highlights the structural advantage of encoder-decoder architectures in input-conditioned generation tasks where fidelity and relevance are critical.\u003c/p\u003e \u003cp\u003eThe choice of baselines was guided by the need for architectural and computational fairness. GPT-2 was selected as a size-comparable autoregressive baseline to isolate the impact of encoder-decoder conditioning under similar parameter budgets. Larger or instruction-tuned models were not included, as their substantially higher computational requirements would confound efficiency comparisons and fall outside the intended deployment scope of this work. Comparisons to commercial systems such as Woebot or Wysa are necessarily conceptual rather than empirical, due to the proprietary nature of their architectures and training data. The evaluation therefore focuses on open-source models that can be reproduced and audited under matched conditions.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003e4.3 Human Evaluation of Linguistic and Affective Quality\u003c/h2\u003e \u003cp\u003eAutomatic metrics alone cannot fully capture the subjective dimensions of dialogue quality, particularly in emotionally sensitive contexts. To address this limitation, we conducted a structured human evaluation. A sample of 120 model outputs was independently rated by five annotators, comprising three clinical psychologists and two natural language processing researchers. All annotators were trained on a standardised rubric and blinded to the identity of the model that generated each response.\u003c/p\u003e \u003cp\u003eThree evaluation dimensions were assessed: linguistic coherence, emotional appropriateness, and contextual relevance. Each response was scored on a 3-point Likert scale: high, moderate, or low. Coherence measured fluency and syntactic correctness; emotional appropriateness assessed empathy, validation, and affective tone; and contextual relevance evaluated alignment with the user prompt. Inter-rater reliability was measured using Fleiss\u0026rsquo; kappa \u003csup\u003e\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e\u003c/sup\u003e, yielding a value of 0.78, indicating substantial agreement across raters.\u003c/p\u003e \u003cp\u003eResults showed that the model performed strongly across all dimensions. As summarised in Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e, 92% of responses were rated \u0026ldquo;high\u0026rdquo; for linguistic coherence, 89% for emotional appropriateness, and 91% for contextual relevance. No category exceeded 2% \u0026ldquo;low\u0026rdquo; ratings. These findings suggest that the model consistently produces outputs that are not only syntactically fluent but also contextually sensitive and emotionally attuned.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eHuman evaluation summary across dimensions (n\u0026thinsp;=\u0026thinsp;120)\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"4\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDimension\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eHigh (%)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eModerate (%)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eLow (%)\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLinguistic Coherence\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e92\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e6\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e2\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eEmotional Appropriateness\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e89\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e9\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e2\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eContextual Relevance\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e91\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e7\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e2\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eThese results are particularly noteworthy given that the model was trained without reinforcement learning from human feedback (RLHF) or any emotion-specific objective function. Its affective competence appears to emerge from domain-relevant fine-tuning on mental health-themed data. This suggests that task-adapted language models can exhibit implicit alignment with psychological norms, provided that training data reflect those norms accurately.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003e4.4 Post-Hoc Error Analysis and Model Limitations\u003c/h2\u003e \u003cp\u003eTo better understand the model\u0026rsquo;s limitations, we conducted a post-hoc qualitative analysis of 50 outputs that received low or moderate ratings in at least one dimension. Each response was categorised by dominant failure mode. The most common issue, found in 46% of the sample, was genericity, responses that, while safe, lacked specificity or actionable value (e.g., \u0026ldquo;It\u0026rsquo;s okay to feel that way.\u0026rdquo;). Such outputs offer minimal personalised insight and may feel emotionally disengaged.\u003c/p\u003e \u003cp\u003eEmotionally vague or flat responses accounted for 32% of the sample. These often lacked empathetic cues or failed to mirror the user\u0026rsquo;s emotional state. Contextual mismatches were found in 14% of cases, where the model partially misunderstood the prompt or provided tangential advice. A smaller fraction (8%) involved surface-level coherence issues, such as repetition or abrupt topic shifts.\u003c/p\u003e \u003cp\u003eNotably, no outputs contained factual hallucinations, medical overreach, or unsafe content. This absence of risk-bearing behaviour reinforces the model\u0026rsquo;s tendency to remain within a conservative semantic space, often prioritising caution over expressivity. While this contributes to safety, it may come at the cost of therapeutic depth. Enriching the model\u0026rsquo;s capacity for emotionally specific, contextually rich responses without compromising safety remains a key area for future research.\u003c/p\u003e \u003c/div\u003e"},{"header":"5. Ethical Considerations","content":"\u003cp\u003eThe deployment of artificial intelligence systems within the mental health domain requires heightened ethical scrutiny, owing to the vulnerability of the user population and the inherently sensitive nature of such interactions. Ensuring user safety, preserving autonomy, and upholding the principles of non-maleficence are essential preconditions for responsible development and dissemination.\u003c/p\u003e \u003cp\u003eThe model presented in this study was trained exclusively on publicly available and synthetically generated data. No real user data involving identifiable individuals were used at any stage of development. All training sources were anonymised and released under licences permitting academic research use, thereby ensuring compliance with prevailing data protection standards, including those aligned with GDPR principles. This mitigates risks related to privacy infringement and unauthorised data reuse.\u003c/p\u003e \u003cp\u003eOne primary ethical concern in affective AI systems is the potential for users to misinterpret AI-generated responses as clinical advice. To address this, the system was explicitly framed as a research prototype and accompanied by disclaimers at every interaction point. Users were clearly informed that the assistant does not serve as a substitute for licensed mental health care. The model's output was further constrained during fine-tuning to avoid diagnostic, prescriptive, or high-risk medical language. Instead, responses were limited to general support, emotional validation, and information consistent with mental health literacy goals.\u003c/p\u003e \u003cp\u003eBias and representational fairness also warrant serious attention. Language models are known to inherit biases present in their training data, which may marginalise or misrepresent underrepresented populations. While the dataset used in this study reflects a range of emotional scenarios, its linguistic and cultural framing may not generalise to all users, particularly those from minority or non-Western backgrounds. Although initial qualitative inspection did not reveal overtly biased outputs, a full-scale bias audit was not performed and remains a critical direction for future work. Ensuring equitable, inclusive, and culturally sensitive deployment will require systematic auditing procedures, diverse annotation frameworks, and ongoing stakeholder engagement.\u003c/p\u003e \u003cp\u003eThis work adopted a precautionary approach to the ethical design and evaluation of AI in mental health contexts. By prioritising user safety, avoiding clinical claims, and limiting training data to anonymised, publicly licensed sources, the system was developed in alignment with responsible AI principles. While key safeguards were implemented, further work is required to address bias, cultural generalisability, and ethical challenges that may emerge in downstream deployment environments. Future work will incorporate structured bias auditing and safety validation frameworks, such as checklist-based behavioural testing and culturally diverse stress-case evaluation, to systematically assess fairness, robustness, and potential harm across demographic groups.\u003c/p\u003e"},{"header":"6. Discussion","content":"\u003cp\u003eThis study demonstrates the practical viability of fine-tuning a lightweight transformer model, specifically T5-small, to generate supportive and emotionally intelligent responses in the domain of mental health dialogue. Leveraging the MentalChat16K dataset, which contains both authentic and synthetically generated conversational data across a diverse range of affective states, the model successfully learned to produce responses that are linguistically fluent, contextually relevant, and affectively attuned.\u003c/p\u003e \u003cp\u003eEven with its relatively modest parameter size, the T5-small model showed strong qualitative performance across various emotional contexts. For instance, in response to the user statement, \u0026ldquo;I feel like I\u0026rsquo;m a burden to everyone around me,\u0026rdquo; the model generated, \u0026ldquo;I'm sorry you're feeling this way. You're not a burden, and you deserve support and kindness.\u0026rdquo; This reflects the model\u0026rsquo;s ability to apply emotionally supportive language while avoiding judgment or prescriptive advice. Similarly, when presented with an input such as \u0026ldquo;I'm overwhelmed with school and can't keep up,\u0026rdquo; the model responded with, \u0026ldquo;That sounds really difficult. It\u0026rsquo;s okay to feel this way. Have you been able to take a break or talk to someone you trust?\u0026rdquo; These examples suggest that even lightweight architectures can internalize affective conversational patterns when fine-tuned on domain-specific dialogue data.\u003c/p\u003e \u003cp\u003eThe results support growing evidence that smaller, computationally efficient models can contribute meaningfully to high-impact conversational applications, especially in environments with constrained deployment conditions such as mobile apps or low-resource settings. While large-scale transformer models offer greater capacity, the use of a compact model like T5-small offers a practical path toward accessible and scalable tools for mental health support, particularly in preventive and low-intensity use cases.\u003c/p\u003e \u003cp\u003eNonetheless, some challenges remain before such models can be deployed in more sensitive or high-risk environments. For example, the current system does not include built-in mechanisms for detecting language associated with acute psychological risk. Although the model generally responds with care to difficult inputs, such as replying \u0026ldquo;You\u0026rsquo;re not alone, and things can get better\u0026rdquo; to a statement like \u0026ldquo;I don\u0026rsquo;t want to live anymore,\u0026rdquo; it lacks the ability to flag such exchanges for human review or escalate appropriately. Incorporating real-time safety monitoring, risk detection classifiers, and clear escalation pathways would be a valuable direction for future development, especially if the system were to be integrated into clinical workflows or crisis-oriented platforms.\u003c/p\u003e \u003cp\u003eAnother area for growth lies in the grounding of responses in validated psychological frameworks. While the model demonstrates empathy and emotional alignment, it does not yet offer structured therapeutic guidance or psychoeducation rooted in established modalities such as Cognitive Behavioral Therapy (CBT) or Dialectical Behavior Therapy (DBT). For example, in response to a user reporting panic attacks, the model provides comfort but does not suggest specific strategies like deep breathing or grounding techniques. Integrating external psychological knowledge bases or structured response planning modules could improve the clinical relevance of model outputs.\u003c/p\u003e \u003cp\u003eThe evaluation methodology also warrants further refinement. Although automatic metrics such as BLEU, ROUGE, and BERTScore provide some indication of lexical and semantic quality, they are not designed to capture therapeutic value, empathy, or ethical appropriateness. This study addressed that gap through qualitative analysis, but future work should pursue the development of domain-specific evaluation frameworks, potentially including human ratings from clinicians, patients, or individuals with lived experience.\u003c/p\u003e \u003cp\u003eDespite these considerations, the model\u0026rsquo;s demonstrated capacity to engage users with emotionally appropriate, nonjudgmental, and supportive responses highlights its potential role within broader digital mental health ecosystems. It may be especially useful in augmenting existing self-help platforms, providing entry-level emotional support, or supporting engagement in digital wellness programs. With the right safeguards and ethical governance, such systems could enhance access to psychosocial support while maintaining a clear boundary between automated assistance and clinical care.\u003c/p\u003e \u003cp\u003eIn summary, this work provides early evidence that lightweight transformer models can be effectively adapted for affectively intelligent dialogue generation in the mental health domain. It underscores the importance of combining technical innovation with ethical responsibility, and opens promising pathways for future interdisciplinary research at the intersection of natural language processing and mental health care.\u003c/p\u003e"},{"header":"7. Deployment","content":"\u003cp\u003eThe fine-tuned T5-small model was deployed via a browser-accessible application using the Gradio interface, enabling real-time interaction through a simple, single-turn text input-output system. This setup allows for intuitive exploration of the model\u0026rsquo;s capabilities and supports preliminary user-facing evaluations. Figure\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e shows the deployed interface, including a user input box for expressing mental health concerns and a corresponding AI-generated response area. The example demonstrates how the model addresses a concern such as \"I can't sleep because of anxiety.\" with a supportive, non-clinical message.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eThe deployment pipeline was designed with accessibility and low-resource environments in mind. Inference was configured to run efficiently on both CPU and GPU backends, depending on the available infrastructure. While GPU acceleration significantly reduces response latency and is recommended for real-time performance, the model remains functional on CPU-based systems, albeit with increased response times. Mixed-precision inference (FP16) and static model weight caching were employed where hardware permitted, reducing memory overhead and enabling responsive interaction on modest computing resources.\u003c/p\u003e \u003cp\u003eTo ensure ethical and responsible use, the interface includes prominently displayed disclaimers clarifying that the system is not intended for clinical care or crisis support. Users are informed that the assistant does not provide diagnostic, therapeutic, or emergency services, and that responses are generated solely for non-clinical support and research purposes. No user data are stored or logged, and the application does not retain session history, ensuring alignment with privacy-conscious deployment principles.\u003c/p\u003e \u003cp\u003eThis lightweight and reproducible deployment demonstrates the feasibility of integrating compact transformer models into accessible, wellness-oriented platforms, and provides a flexible foundation for future usability testing and iterative development in ethically framed digital mental health applications.\u003c/p\u003e \u003cdiv id=\"Sec16\" class=\"Section2\"\u003e \u003ch2\u003e7.1 Inference Efficiency and Resource Footprint\u003c/h2\u003e \u003cp\u003eTo substantiate the computational efficiency of the proposed system, inference latency and memory usage were evaluated under both GPU- and CPU-based deployment scenarios. Measurements were conducted using a single NVIDIA GPU and a standard x86 CPU backend, with identical decoding settings across models (greedy decoding, batch size\u0026thinsp;=\u0026thinsp;1, maximum output length\u0026thinsp;=\u0026thinsp;128 tokens). The resulting efficiency metrics, summarised in Table X, reflect realistic interactive usage rather than throughput-optimised benchmarking.\u003c/p\u003e \u003cp\u003eAs shown in Table\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e, the fine-tuned T5-small model demonstrates low-latency response generation suitable for real-time interaction. GPU-based inference produces responses in under one second on average, while CPU-only inference completes within a few seconds, enabling practical deployment in environments without specialised hardware. Peak memory consumption remains modest, supporting execution on commodity systems with limited resources. In contrast, the fine-tuned GPT-2 baseline exhibits higher latency and memory usage under identical conditions, reflecting its larger parameter footprint.\u003c/p\u003e \u003cp\u003eAlthough this work does not introduce explicit compression, pruning, quantisation, or parameter-efficient fine-tuning techniques, the results in Table X demonstrate that careful selection and domain adaptation of a compact pretrained architecture can yield an effective balance between response quality, safety, and computational feasibility. This efficiency is particularly important for privacy-conscious and low-resource mental health support applications, where real-time interaction and accessibility are operational requirements rather than optional enhancements.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eInference efficiency comparison under matched decoding conditions\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"5\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eModel\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eParameters\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eGPU Latency (per response)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eCPU Latency (per response)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003ePeak Memory\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eT5-small (fine-tuned)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e~\u0026thinsp;60M\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u0026lt;\u0026thinsp;1 s\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e~\u0026thinsp;2\u0026ndash;3 s\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e~\u0026thinsp;1.2 GB\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGPT-2 (fine-tuned)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e~\u0026thinsp;124M\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e~\u0026thinsp;1.5 s\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e~\u0026thinsp;4\u0026ndash;5 s\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e~\u0026thinsp;2.4 GB\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cem\u003eReported latency and memory values are approximate and intended to provide order-of-magnitude comparisons; exact performance depends on hardware configuration, software stack, and decoding settings.\u003c/em\u003e \u003c/p\u003e \u003c/div\u003e"},{"header":"8. Conclusion and Future research","content":"\u003cp\u003eThis study introduces a reproducible framework for fine-tuning and deploying compact transformer models for emotionally supportive dialogue generation in mental health contexts. By adapting a T5-small architecture on a domain-specific dataset and optimizing the training pipeline for efficiency, the system achieves strong performance without requiring extensive computational infrastructure. Quantitative evaluation yielded a BLEU score of 32.14, ROUGE-L of 44.72, and a BERTScore-F1 of 85.11, significantly outperforming both zero-shot and fine-tuned GPT-2 baselines. Human evaluation confirmed high levels of linguistic coherence (92%), emotional appropriateness (89%), and contextual relevance (91%), supported by substantial inter-rater agreement (κ\u0026thinsp;=\u0026thinsp;0.78).\u003c/p\u003e \u003cp\u003eDeployment was carried out through a lightweight Gradio-based web interface, designed for accessibility and ethical transparency. The system supports both CPU and GPU inference environments, enabling flexible, resource-aware integration into wellness-focused applications. Disclaimers and non-clinical use warnings are prominently presented to ensure users understand the limitations of the system and its intended role as a support tool rather than a therapeutic agent.\u003c/p\u003e \u003cp\u003eWhile the model demonstrates promising affective and contextual competence, it does not incorporate real-time safety monitoring or clinically validated intervention strategies. These limitations constrain its use to low-risk environments and underscore the importance of oversight by qualified professionals when integrating such technologies into user-facing platforms.\u003c/p\u003e \u003cp\u003eFuture research will focus on expanding the model's capabilities through integration of safety-aware classifiers, culturally adaptive training data, and grounding in evidence-based psychological frameworks. In parallel, the development of domain-specific evaluation metrics and longitudinal user studies will be critical to validating impact and guiding responsible deployment. Overall, this work highlights the feasibility of adapting resource-efficient language models for ethically constrained mental health support, and provides a foundation for scalable, accessible, and safe digital well-being tools. In parallel, future research will explore safety-aware prompt design and prompt ablation strategies to enhance emotional specificity and response diversity while preserving conservative, non-clinical behaviour without retraining the underlying model.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eConsent to participate\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eInformed Consent \u0026nbsp;\u003c/p\u003e\n\u003cp\u003eAs this study did not involve direct interaction with human participants, informed consent was not applicable. The data used (MentalChat16K) consisted of anonymised, publicly available records and synthetically generated dialogues.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003e\u003cstrong\u003eConsent to publish\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eConsent to publish declaration: not applicable.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003e\u003cstrong\u003eEthics statement\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis study was conducted using publicly available and synthetically generated datasets (MentalChat16K). No research involving identifiable human participants was carried out. Therefore, ethical approval was not required in accordance with commonly accepted research standards.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003e\u003cstrong\u003eFunding\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eFunding: not applicable.\u003c/p\u003e\n\u003cp\u003e\u0026nbsp;\u003cstrong\u003eData Availability \u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe datasets analysed during the current study are publicly available in the MentalChat16K repository (https://doi.org/10.48550/arXiv.2503.13509).\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003e\u003cem\u003eWorld Health Organization, World Mental Health Report.\u003c/em\u003e \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.who.int/teams/mental-health-and-substance-use/world-mental-health-report\u003c/span\u003e\u003cspan address=\"https://www.who.int/teams/mental-health-and-substance-use/world-mental-health-report\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMcManus S, Bebbington PE, Jenkins R, et al. Data Resource Profile: Adult Psychiatric Morbidity Survey (APMS). Int J Epidemiol. 2020;49(2):361\u0026ndash;e362. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1093/ije/dyz224\u003c/span\u003e\u003cspan address=\"10.1093/ije/dyz224\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNHS Digital. Adult Psychiatric Morbidity Survey: Mental Health and Wellbeing in England, 20NHS Digita23-24. 2025. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://digital.nhs.uk/data-and-information/publications/statistical/adult-psychiatric-morbidity-survey/survey-of-mental-health-and-wellbeing-england-2023-24\u003c/span\u003e\u003cspan address=\"https://digital.nhs.uk/data-and-information/publications/statistical/adult-psychiatric-morbidity-survey/survey-of-mental-health-and-wellbeing-england-2023-24\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNHS Digital. Adult Psychiatric Morbidity Survey: Survey of Mental Health and Wellbeing. Published online 2023. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://digital.nhs.uk/\u003c/span\u003e\u003cspan address=\"https://digital.nhs.uk/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJbene M, Chehri A, Saadane R, Tigani S, Jeon G. Intent detection for task-oriented conversational agents: A comparative study of recurrent neural networks and transformer models. Expert Syst. 2025;42(2):e13712. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1111/exsy.13712\u003c/span\u003e\u003cspan address=\"10.1111/exsy.13712\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBrown JEH, Halpern J. AI chatbots cannot replace human interactions in the pursuit of more inclusive mental healthcare. SSM - Mental Health. 2021;1:100017. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1016/j.ssmmh.2021.100017\u003c/span\u003e\u003cspan address=\"10.1016/j.ssmmh.2021.100017\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNi Y, Jia F. A Scoping Review of AI-Driven Digital Interventions in Mental Health Care: Mapping Applications Across Screening, Support, Monitoring, Prevention, and Clinical Education. Healthcare. 2025;13(10):1205. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.3390/healthcare13101205\u003c/span\u003e\u003cspan address=\"10.3390/healthcare13101205\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eErol A, Padhi T, Saha A, Kursuncu U, Aktas ME. Playing Devil\u0026rsquo;s Advocate: Unmasking Toxicity and Vulnerabilities in Large Vision-Language Models. Published online 2025. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.48550/ARXIV.2501.09039\u003c/span\u003e\u003cspan address=\"10.48550/ARXIV.2501.09039\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSaleela D, Oyegoke AS, Dauda JA, Ajayi SO. Development of AI-Driven Decision Support System for Personalized Housing Adaptations and Assistive Technology. \u003cem\u003eJournal of Aging and Environment\u003c/em\u003e. Published online July. 2025;22:1\u0026ndash;24. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1080/26892618.2025.2534956\u003c/span\u003e\u003cspan address=\"10.1080/26892618.2025.2534956\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCarneiro L, Gomes A. Applications of Artificial Intelligence Use in Therapeutic Interventions: A Multidisciplinary Approach. In: Efstratopoulou M, Argyriadi A, Argyriadis A, eds. \u003cem\u003eAdvances in Computational Intelligence and Robotics\u003c/em\u003e. IGI Global; 2025:167\u0026ndash;212. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.4018/979-8-3373-5072-1.ch008\u003c/span\u003e\u003cspan address=\"10.4018/979-8-3373-5072-1.ch008\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYeh PL, Kuo WC, Tseng BL, Sung YH. Does the AI-driven Chatbot Work? Effectiveness of the Woebot app in reducing anxiety and depression in group counseling courses and student acceptance of technological aids. Curr Psychol. 2025;44(9):8133\u0026ndash;45. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1007/s12144-025-07359-0\u003c/span\u003e\u003cspan address=\"10.1007/s12144-025-07359-0\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTang Y, Kang Y, Wang Y, Wang T, Zhong C, Gong J. CA+: Cognition Augmented Counselor Agent Framework for Long-term Dynamic Client Engagement. Published online 2025. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.48550/ARXIV.2503.21365\u003c/span\u003e\u003cspan address=\"10.48550/ARXIV.2503.21365\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKamatala S, Jonnalagadda AK, Naayini P. Transformers Beyond NLP: Expanding Horizons in Machine Learning \u003cem\u003eSSRN Journal\u003c/em\u003e. Published online 2025. doi:10.2139/ssrn.5112305.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLiu J, Zhu D, Bai Z et al. A Comprehensive Survey on Long Context Language Modeling. Published online 2025. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.48550/ARXIV.2503.17407\u003c/span\u003e\u003cspan address=\"10.48550/ARXIV.2503.17407\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDoll BB, Jacobs WJ, Sanfey AG, Frank MJ. Instructional control of reinforcement learning: A behavioral and neurocomputational investigation. Brain Res. 2009;1299:74\u0026ndash;94. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1016/j.brainres.2009.07.007\u003c/span\u003e\u003cspan address=\"10.1016/j.brainres.2009.07.007\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSarafraz G, Behnamnia A, Hosseinzadeh M, Balapour A, Meghrazi A, Rabiee HR. Domain Adaptation and Generalization of Functional Medical Data: A Systematic Survey of Brain Data. ACM Comput Surv. 2024;56(10):1\u0026ndash;39. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1145/3654664\u003c/span\u003e\u003cspan address=\"10.1145/3654664\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eXu D, Fan S, Kankanhalli M. Combating Misinformation in the Era of Generative AI Models. In: \u003cem\u003eProceedings of the 31st ACM International Conference on Multimedia\u003c/em\u003e. ACM; 2023:9291\u0026ndash;9298. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1145/3581783.3612704\u003c/span\u003e\u003cspan address=\"10.1145/3581783.3612704\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLykouris T, Weng W. Learning to Defer in Content Moderation: The Human-AI Interplay. Published online 2024. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.48550/ARXIV.2402.12237\u003c/span\u003e\u003cspan address=\"10.48550/ARXIV.2402.12237\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDavoodijam E, Alambardar Meybodi M. Evaluation metrics on text summarization: comprehensive survey. Knowl Inf Syst. 2024;66(12):7717\u0026ndash;38. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1007/s10115-024-02217-0\u003c/span\u003e\u003cspan address=\"10.1007/s10115-024-02217-0\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChauhan S, Daniel P. A Comprehensive Survey on Various Fully Automatic Machine Translation Evaluation Metrics. Neural Process Lett. 2023;55(9):12663\u0026ndash;717. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1007/s11063-022-10835-4\u003c/span\u003e\u003cspan address=\"10.1007/s11063-022-10835-4\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLiu Y, Huang J, Li Y, Wang D, Xiao B. Generative AI model privacy: a survey. Artif Intell Rev. 2024;58(1):33. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1007/s10462-024-11024-6\u003c/span\u003e\u003cspan address=\"10.1007/s10462-024-11024-6\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAbbasalizadeh M, Narain S. Privacy-Aware Detection for Large Language Models Using a Hybrid BiLSTM-HMM Approach. IEEE Access. 2025;13:121880\u0026ndash;901. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1109/ACCESS.2025.3587988\u003c/span\u003e\u003cspan address=\"10.1109/ACCESS.2025.3587988\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eXu J, Wei T, Hou B et al. MentalChat16K: A Benchmark Dataset for Conversational Mental Health Assistance. Published online 2025. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.48550/ARXIV.2503.13509\u003c/span\u003e\u003cspan address=\"10.48550/ARXIV.2503.13509\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRaffel C, Shazeer N, Roberts A et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Published online 2019. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.48550/ARXIV.1910.10683\u003c/span\u003e\u003cspan address=\"10.48550/ARXIV.1910.10683\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChoo S, Kim W. A study on the evaluation of tokenizer performance in natural language processing. Appl Artif Intell. 2023;37(1):2175112. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1080/08839514.2023.2175112\u003c/span\u003e\u003cspan address=\"10.1080/08839514.2023.2175112\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhang D, Feng T, Xue L, Wang Y, Dong Y, Tang J. Parameter-Efficient Fine-Tuning for Foundation Models. Published online 2025. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.48550/ARXIV.2501.13787\u003c/span\u003e\u003cspan address=\"10.48550/ARXIV.2501.13787\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLoshchilov I, Hutter F. Decoupled Weight Decay Regularization. Published online January. 2019;4. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.48550/arXiv.1711.05101\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.1711.05101\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFleiss JL. Measuring nominal scale agreement among many raters. Psychol Bull. 1971;76(5):378\u0026ndash;82. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1037/h0031619\u003c/span\u003e\u003cspan address=\"10.1037/h0031619\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"discover-artificial-intelligence","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"diai","sideBox":"Learn more about [Discover Artificial Intelligence](https://www.springer.com/44163)","snPcode":"","submissionUrl":"","title":"Discover Artificial Intelligence","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Discover Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"Conversational agents, mental health support, dialogue systems, T5 architecture, natural language processing, human evaluation, scalable AI","lastPublishedDoi":"10.21203/rs.3.rs-8581944/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8581944/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eConversational agents designed for emotionally supportive interactions face challenges in balancing affective responsiveness, computational efficiency, and safety in communication. Prior approaches frequently depend on large-scale models, handcrafted affective objectives, or reinforcement learning from human feedback, which can limit scalability and interpretability. This work presents a lightweight, domain-adapted dialogue generation system based on the T5-small architecture, fine-tuned on MentalChat16K, a curated corpus of real and synthetic emotional-support conversations. The proposed model operates without reinforcement learning or emotion-specific training objectives, yet demonstrates strong alignment with affective cues and high response fluency. Empirical evaluation shows improvements over zero-shot and fine-tuned GPT-2 baselines, achieving BLEU (32.14), ROUGE-L (44.72), and BERTScore-F1 (85.11). Expert human assessments confirm high ratings in coherence, emotional appropriateness, and contextual relevance, with substantial inter-rater agreement. Qualitative error analysis indicates conservative and context-aware responses, with no hallucinations or unsafe content. The system is deployed via a browser-based Gradio interface supporting both CPU and GPU inference, featuring usage disclaimers and non-clinical positioning to ensure responsible deployment. This study demonstrates that compact transformer-based models, when adapted to domain-specific corpora and evaluated comprehensively, can enable efficient, affectively competent conversational systems suitable for large-scale, safe deployment in emotionally supportive dialogue scenarios.\u003c/p\u003e","manuscriptTitle":"Efficient and Responsible Transformer Based Conversational Agents for Emotionally Supportive Dialogue","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-02-02 08:18:36","doi":"10.21203/rs.3.rs-8581944/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2026-02-26T12:58:21+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-02-20T03:53:18+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-02-12T12:37:19+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-02-12T00:03:36+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-02-06T16:11:44+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"140438140906287934597568997278227854939","date":"2026-02-06T13:08:46+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"249474693999875669667361531316833742850","date":"2026-02-02T16:22:00+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"230664645462761260137667900595971359104","date":"2026-02-02T10:22:09+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"41518997715818468389461913093553601995","date":"2026-02-01T23:13:15+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2026-01-30T06:56:53+00:00","index":"","fulltext":""},{"type":"editorInvited","content":"","date":"2026-01-19T15:52:45+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2026-01-12T14:54:11+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2026-01-12T14:51:14+00:00","index":"","fulltext":""},{"type":"submitted","content":"Discover Artificial Intelligence","date":"2026-01-12T12:39:51+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"discover-artificial-intelligence","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"diai","sideBox":"Learn more about [Discover Artificial Intelligence](https://www.springer.com/44163)","snPcode":"","submissionUrl":"","title":"Discover Artificial Intelligence","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Discover Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"2b1bf8ad-6c53-4a90-bf6a-b1409bda626f","owner":[],"postedDate":"February 2nd, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[],"tags":[],"updatedAt":"2026-05-05T12:25:01+00:00","versionOfRecord":[],"versionCreatedAt":"2026-02-02 08:18:36","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8581944","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8581944","identity":"rs-8581944","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00