Development and Evaluation of HopeBot: an LLM-based chatbot for structured and interactive PHQ-9 depression screening

doi:10.21203/rs.3.rs-6976450/v1

Development and Evaluation of HopeBot: an LLM-based chatbot for structured and interactive PHQ-9 depression screening

2025 · doi:10.21203/rs.3.rs-6976450/v1

preprint OA: closed

Full text JSON View at publisher

Full text 186,924 characters · extracted from preprint-html · click to expand

Development and Evaluation of HopeBot: an LLM-based chatbot for structured and interactive PHQ-9 depression screening | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Development and Evaluation of HopeBot: an LLM-based chatbot for structured and interactive PHQ-9 depression screening Zhijun Guo, Alvina Lai, Julia Ive, Alexandru Petcu Petcu, Yutong Wang, and 3 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-6976450/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Static tools like the Patient Health Questionnaire-9 (PHQ-9) effectively screen depression but lack interactivity and adaptability. We developed HopeBot, a chatbot powered by a large language model (LLM) that administers the PHQ-9 using retrieval-augmented generation and real-time clarification. In a within-subject study, 132 adults in the United Kingdom and China completed both self-administered and chatbot versions. Scores demonstrated strong agreement (ICC = 0.91; 45% identical). Among 75 participants providing comparative feedback, 71% reported greater trust in the chatbot, highlighting clearer structure, interpretive guidance, and a supportive tone. Mean ratings (0–10) were 8.4 for comfort, 7.7 for voice clarity, 7.6 for handling sensitive topics, and 7.4 for recommendation helpfulness; the latter varied significantly by employment status and prior mental-health service use (p < 0.05). Overall, 87.1% expressed willingness to reuse or recommend HopeBot. These findings demonstrate voice-based LLM chatbots can feasibly serve as scalable, low-burden adjuncts for routine depression screening. Biological sciences/Psychology Health sciences/Health care Figures Figure 1 Figure 2 Figure 3 Figure 4 Introduction Depression is a major global health issue characterised by persistent low mood, loss of interest or pleasure in daily activities, and impaired cognitive and emotional functioning 1 . It often results in sleep disturbances, fatigue, social withdrawal, and reduced occupational or academic productivity, imposing significant emotional and economic burdens on individuals and society 2 . The World Health Organisation (WHO) estimates that depression affects approximately 3.8% of the global population 1 , yet only about half receive minimally adequate counselling or antidepressant treatment 3 . Delayed identification of depression can exacerbate symptoms, increasing risks for chronic disability and suicide, with over 700,000 individuals dying by suicide annually due to depression 1 . This underscores the critical importance of timely screening and intervention. Traditional approaches such as psychological counselling and psychiatric assessments typically require trained professionals, extensive time commitments, and substantial financial resources 4 , 5 , posing notable barriers in resource-limited settings and economically disadvantaged populations 6 , 7 . Additionally, societal stigma associated with mental illness frequently discourages affected individuals from actively seeking care, further impeding timely identification and treatment 4 . The Patient Health Questionnaire-9 (PHQ-9) is one of the most widely used and validated instruments for screening and grading depressive symptoms, with pooled sensitivity and specificity of approximately 88% at the standard cut-off score of 10 8 . It has demonstrated strong validity across diverse populations; however, it is highly contingent on the mode of administration 9 , 10 . Clinician‑guided or semi‑structured delivery detects suicidal ideation and psychiatric comorbidity more reliably than self‑administered completion at home or online, where comprehension, engagement, and health‑literacy levels can vary 11 . Traditional face‑to‑face or paper formats may also feel emotionally taxing, impersonal, and time‑consuming, discouraging candid disclosure and full adherence 12 . In addition, their static, non‑interactive design cannot adjust to users’ fluctuating emotional or cognitive states 13 . Together, these limitations underscore the need for alternative delivery approaches that preserve diagnostic rigour while enhancing usability, engagement, and cultural adaptability. Conversational agents powered by LLMs have emerged as a promising means of addressing limitations in traditional mental health screening 14 . Trained on extensive corpora, LLMs can generate contextually appropriate and syntactically coherent responses, support real-time clarification of user input 15 , 16 , adapt to individual linguistic patterns, and maintain coherence over extended interactions 17 , 18 . These capabilities are particularly valuable in mental health contexts, where communication is often ambiguous, incomplete, or emotionally nuanced 19 , 20 . The integration of LLMs into clinical workflows, however, raises important concerns. These include the risk of inaccurate or unsafe outputs, opaque reasoning processes, and lack of real-time oversight in high-risk situations such as suicidal disclosures 18 , 21 . Additional ethical challenges include data privacy, informed consent, and the interpretability of model-generated recommendations 18 . These limitations highlight the need for rigorous validation, transparent design, and appropriate safeguards prior to clinical deployment. Several prior chatbot-based depression screening systems, such as DEPRA 17 , IGOR 22 , Perla 14 , Marcus 23 , and EmoScan 24 have demonstrated initial feasibility using structured frameworks and standardised assessments (e.g., PHQ-9, SIGH-D, IDS-C). DEPRA employs structured conversational flows guided by the SIGH-D and IDS-C scales, enabling natural language responses but relying heavily on predefined conversational intents, which constrain nuanced interaction 17 . IGOR similarly emphasises predictable and structured dialogue paths, explicitly guiding users through the PHQ-9 to minimise conversational ambiguity and potential risks; however, it does not provide real-time interpretative feedback 22 . Perla integrates the PHQ-9 within a structured framework, supporting natural language interaction, yet remains restricted by predefined intents and entities, limiting conversational flexibility 14 . Marcus uses BERT-based classifiers but faces challenges in effectively addressing ambiguous user inputs and providing transparent scoring explanations 23 . EmoScan aims to improve linguistic generalisability through synthetic clinical dialogues, but it does not directly incorporate standardised diagnostic tools such as the PHQ-9 24 . Taken together, these systems made progress yet reveal persistent limitations in their capacity to support flexible dialogue, foster emotional engagement, and deliver transparent explanations, which are key attributes necessary for building user trust and encouraging sustained participation. To address these constraints, we developed HopeBot, a voice-interactive chatbot designed to deliver structured PHQ-9 depression screening within a flexible, empathic conversational environment. The system integrates an LLM (GPT‑4o) with retrieval-augmented generation (RAG). This setup enables adaptive interpretation of user input, generation of item-specific clarifications grounded in clinical sources, and enhanced transparency of the interaction 25 . While PHQ-9 remains the core diagnostic framework, HopeBot supports open-ended dialogue before and after formal administration, adapting to users' conversational cues and engagement styles 23 . We conducted a mixed-methods investigation involving 132 participants from diverse educational and cultural backgrounds. Quantitative analyses examined demographic distributions, internal consistency of PHQ-9 items, and score concordance between self-reported and HopeBot-assisted assessments. Qualitative feedback, obtained through a structured 25-item questionnaire and follow-up interviews, explored perceptions of trust, clarity, comfort, and perceived empathy. These findings provide empirical insight into the feasibility and acceptability of LLM-driven systems as potential adjuncts to traditional depression screening pathways. Methods Ethical Approval This study was reviewed and approved by the University College London (UCL) Research Ethics Committee following submission of a high-risk application (ID: 26133.001). An amendment and extension to the original protocol was subsequently approved, with ethics coverage extended until 29 January 2026. All procedures were conducted in accordance with institutional ethical standards and the principles outlined in the Declaration of Helsinki 27 . Prior to participation, informed consent was obtained from all individuals. The study was also prospectively registered on ClinicalTrials.gov under reference number NCT06801925. Chatbot System Design HopeBot was developed as a real-time, voice-interactive assistant for depression screening through naturalistic dialogue. The system integrates GPT-4o with an RAG architecture to support open-domain dialogue while grounding responses in clinically relevant content 28 , including Cognitive Behavioral Therapy (CBT) transcripts, therapists’ guidelines, and helpline directories. The complete system workflow is illustrated in Fig.1. The user interface was developed using Streamlit 29 to enable synchronous multimodal input via keyboard or microphone (Fig.2.). Voice input was processed through an automatic speech recognition module, and system responses were synthesised into audio. All components of transcription, generation, and rendering were managed within an asynchronous event loop to preserve natural turn-taking and maintain interactional fluidity. This architecture was adopted to facilitate a seamless user experience while aligning with ethical and clinical communication standards. The system supported both English and Mandarin through GPT-4o’s native multilingual capabilities. Responses were generated directly in the input language without translation. Mandarin outputs were produced from Chinese prompts, and audio synthesis was handled by a general-purpose text-to-speech engine. To ground the chatbot’s responses in validated psychological knowledge, we implemented a multi-source RAG layer using LangChain and Chroma 31 . Four primary data sources were assembled: (i) A curated corpus of 34 anonymised CBT session transcripts compiled from publicly available training materials, including YouTube-based simulations, therapist role-plays, and anonymised transcripts from online repositories. (ii) The full text of A Therapist’s Guide to Brief CBT was included to ensure coverage of structured, evidence-based strategies 32 . (iii) Two public corpora were integrated to support emotional relevance: ESConv, an English dialogue dataset annotated for user emotions and support strategies 33 ; and PsyQA_example, a Chinese mental health QA corpus covering topics such as depression and anxiety 34 . (iv) Bilingual helpline directories from the United Kingdom (UK) and China, containing validated contact information and service descriptions. The CBT vector store integrated publicly accessible materials selected for their structured, clinically grounded nature 35 , 36 , including annotated scripts (e.g., from learn.problemgambling.ca), case dialogues, and educational videos by licensed clinicians. Subtitles from video content were extracted, speaker-segmented, and cleaned. All resources were used solely for research in accordance with their stated terms and screened for alignment with core CBT principles such as socratic questioning, cognitive restructuring, and behavioural activation 35 . All documents were pre-processed using a recursive character-level chunking strategy with 512-token segments and a 20% overlap. Text embeddings were generated using the text-embedding-3-small model 37 . At each conversational turn, semantic retrieval was performed in parallel across the three vector stores. The top-ranked passages were concatenated and incorporated into the GPT-4o prompt to generate evidence-informed and contextually appropriate responses. This architecture enabled the chatbot to alternate seamlessly between open-ended therapeutic dialogue and structured screening procedures, while maintaining psychological validity and factual coherence. The chatbot operated under a structured three-phase protocol: (1) rapport building through open conversation, (2) PHQ-9 administration, and (3) personalised feedback. A mandatory transition to PHQ-9 was enforced within 20 dialogue turns to maintain screening focus. This constraint applied only before assessment; users could continue engaging with the system without dialogue limits following PHQ-9 completion. PHQ-9 items were administered sequentially, and user responses were categorised into standard A–D scoring brackets (0 to 3 points). When responses were ambiguous, the model generated clarification queries to users in conversations rather than imposing premature classification. Final output included item-level interpretations, a total score, severity classification based on validated PHQ-9 thresholds, and tailored resource recommendations. All classification and clarification logic was embedded within the system prompt and dynamically executed by the language model. On average, GPT-4o generated each response in 1.47 ± 0.30 seconds, corresponding to brief single-turn responses of 49.2 ± 7.6 tokens, based on 100 representative interactions collected during internal testing. Speech synthesis using OpenAI’s TTS‑1 model required an additional 2.36 ± 0.49 seconds, resulting in a total latency of ~3.83 seconds per user–bot turn. The prototype was reviewed by four domain experts, including a practising NHS clinical psychiatrist in the UK, two doctoral researchers at UCL, and a licensed mental health counsellor in China. Reviewers noted that the system maintained acceptable response latency and did not disrupt conversational flow. Their feedback also addressed scoring validity, linguistic tone, empathy, and the handling of ambiguous responses, informing iterative refinements before participant deployment. In addition to its technical functions, the system incorporated safeguards to address ethical, emotional, and data privacy concerns during human–artificial intelligence (AI) interactions. Please refer to Supplementary Material 1 for details. Evaluation: Participant Recruitment and Procedure To evaluate the performance of Hopebot as a mental health screening tool, we conducted a completed trial involving a diverse participant sample. This manuscript reports the final analysis of the collected data. Participant recruitment was carried out concurrently in the UK and China to ensure demographic and experiential diversity using both online and offline strategies. Recruitment targeted adults aged 18 to 70 years. Advertisements were distributed via social media platforms (Facebook, X, and Xiaohongshu) and printed posters at university buildings and community venues. Interested individuals were instructed to contact the research team directly, upon which they were provided with a participation information pack, including a Participant Information Sheet and a consent form. After providing informed consent, participants were asked to complete a self-administered PHQ-9 online, serving as a baseline measure. Participants selected either English or Mandarin Chinese according to their language preference. The Chinese version of PHQ‑9 used in this study was based on the validated mainland translation widely adopted in clinical and research settings. They were then invited to interact with HopeBot using either a desktop or mobile device, with the option of submitting inputs via keyboard or microphone. Each interaction lasted approximately 25 minutes. Following the chatbot session, participants were required to complete a 25-item post-interaction survey (see Supplementary Material 2) covering demographic information, PHQ-9 results, and experiential feedback. The final questionnaire included 5 demographic items, 2 PHQ-9 result entries (self-reported and HopeBot-assisted), and 18 open-ended questions such as Likert-style ratings assessing comfort, empathy, voice clarity, and perceived usefulness. Participants were encouraged to elaborate on their responses by providing reasons or examples. On average, completing the survey took about 35 minutes. Data were collected between 1 March and 3 April 2025. A total of 191 individuals were initially enrolled. Submissions were excluded if they (i) completed less than 80% of the questionnaire (n = 32), (ii) submitted incoherent or AI-generated responses (n = 12), or (iii) provided non-substantive answers to open-ended questions, such as single-word replies, vague affirmations (e.g., “good” or “helpful”), or content copied from external sources (n = 15). After quality screening, 132 responses were retained for analysis. Data Analysis Method Descriptive statistics were generated for all structured survey responses using Python 3.11. To evaluate consistency between self-administered and HopeBot-assisted PHQ-9 scores, we employed a within-subject design. Absolute and signed score differences were computed, and measures of central tendency (mean, median) and dispersion (interquartile range, standard deviation) were reported 38 . Paired t-test and a Wilcoxon signed-rank test were conducted to compare PHQ-9 scores between formats 39 . Spearman's rank correlation 40 and ICC(3,1) were used to assess correlation and agreement between formats 41 , respectively. To explore associations between demographic factors and user ratings across four key outcomes (Q17–Q20), independent samples t-tests and one-way ANOVA were applied, depending on the variable structure 42 . All significance tests were two-sided with an α threshold of 0.05. Multilevel demographic variables were dichotomised a priori to maintain expected cell counts ≥ 5 (e.g., age ≤ 34 vs ≥ 35 years; ethnicity White vs non-White; education degree vs non-degree). Each demographic factor was cross-tabulated (2 × 2) against three binary endpoints: (i) perceived trustworthiness of PHQ-9 scores, (ii) preferred screening modality, and (iii) intention to recommend or reuse HopeBot. Pearson’s χ² test with Yates’ correction was used when appropriate 43 ; otherwise, Fisher’s exact test was applied 44 . Statistical significance was set at α = 0.05, with Holm–Bonferroni adjustment for multiple comparisons. Open-ended responses were thematically analysed using Braun and Clarke’s six-phase framework 30 (see Fig.1.). Coding was conducted inductively by the first author to allow themes to emerge from the data. To ensure analytic rigour, a second qualitative researcher (KL) independently reviewed the codes. Inter-coder agreement was 86%, indicating good consistency. Discrepancies in code assignment or theme mapping were resolved through discussion until consensus was reached. A full codebook outlining code definitions, inclusion criteria, and exemplar quotes is provided in Supplementary Material 3. Word frequency statistics were computed using Python to support theme validation and lexical salience analysis; the distribution of word frequencies is presented in Supplementary Material 4. Results Participant characteristics Of the 132 participants included in the final analysis, 68 (51.5%) were recruited in the UK. 75% were under 45 years of age, 54.5% identified as female, and 56.1% as Asian or Asian British, while 38.6% identified as White. Most participants held an undergraduate or postgraduate degree (88.7%) and were either in full-time employment (59.1%) or full-time education (22.7%). Familiarity with LLMs was high overall, with 85 participants (64.4%) describing themselves as regular users, and only 2 (1.5%) reporting no prior experience. In total, 56 participants (42.4%) had previously interacted with chatbot technologies, most (48/56, 85.7%) reported using general-purpose LLMs (e.g., ChatGPT, Doubao) for emotional disclosure or mental health–related interactions, rather than specialised mental health chatbots. Prior experience with conventional mental health support was reported by 26 participants (19.7%) (see Table.1). Table. 1 . Sociodemographic and background characteristics of the survey respondents (N = 132). Characteristic Category n % Country of Recruitment UK 68 51.5 China 64 48.5 Age group (years) 18 – 24 27 20.5 25 – 34 40 30.3 35 – 44 32 24.2 45 – 54 19 14.4 55 – 64 12 9.1 65 – 70 2 1.5 Gender Female 72 54.5 Male 60 45.5 Ethnicity Asian or Asian British 74 56.1 White 51 38.6 Black / Black British / Caribbean 4 3.0 Mixed / Multiple groups 2 1.5 Prefer not to say 1 0.8 Highest education Undergraduate degree 81 61.4 Post-graduate degree (Master’s/PhD) 36 27.3 Further education (e.g., A-levels/NVQ) 13 9.9 No formal qualification / Prefer not to say 2 1.5 Employment status Full-time employment 78 59.1 Full-time education/training 30 22.7 Part-time employment 12 9.1 Looking after home 5 3.8 Other / Retired 6 4.5 Prefer not to say 1 0.8 Familiarity with LLMs* Regular user 85 64.4 Heard of / tried once 29 22.0 Occasional user 8 6.1 Technical expert 8 6.1 No experience 2 1.5 Mental health chatbot experience Yes 56 42.4 No 76 57.6 Previous mental health support experience Yes 26 19.7 No 106 80.3 PHQ-9 Severity Result – Self-report Minimal/None (0–4) 51 38.6 Mild (5–9) 42 31.8 Moderate (10–14) 25 18.9 Moderately Severe (15–19) 9 6.8 Severe (20–27) 5 3.8 PHQ-9 Severity Result – HopeBot Minimal/None (0–4) 48 36.4 Mild (5–9) 47 35.6 Moderate (10–14) 24 18.2 Moderately Severe (15–19) 9 6.8 Severe (20–27) 4 3.0 *LLM = large language model. Chatbot System Design While administering the PHQ-9, HopeBot actively sought clarification when user input was vague or non-categorical. For example, responses such as 'maybe sometimes?' triggered follow-up prompts offering standardised response options. This mechanism improved scoring accuracy and reduced the risk of misclassification. However, its effectiveness depended on user engagement and could be limited by cognitive load, language barriers, or low responsiveness, highlighting a trade-off between flexibility and robustness in automated screening. Following completion, the system generated a structured summary comprising item-level scores, overall severity classification, and general resource recommendations. Representative outputs illustrating responses to crisis language, ambiguous input, and summary generation are shown in Fig.2(b,c,d). Feedback was designed to be emotionally sensitive and clinically interpretable. While participants generally found the summaries clear and supportive, the recommendations remained generic and did not incorporate prior psychiatric history or comorbidities, reflecting broader limitations in personalisation within scalable AI-driven screening tools. PHQ-9 Score Concordance A within-subject comparison was conducted to evaluate alignment between self-administered and HopeBot-assisted PHQ-9 assessments. As shown in Fig.3, scores were identical across both administrations in 59 participants (44.7%). The median absolute difference between scores was 1 point (IQR = 2.00; mean = 1.33), indicating strong overall consistency. The signed difference distribution had a median of 0.00 and a mean of 0.05, suggesting no systematic tendency for HopeBot to over- or underestimate participants’ symptom severity. A paired Wilcoxon signed-rank test confirmed the absence of systematic bias (Z = 1304.0, p = .649). Consistency between formats was high: Spearman’s rank correlation coefficient was ρ = 0.92 (p < .001), and the ICC(3,1) was 0.91 (95% CI: 0.88–0.93), indicating excellent agreement in both absolute score magnitude and relative rank order. Despite small score differences, 37 participants (28.0%) were assigned to a different PHQ-9 severity category in the HopeBot-assisted version due to score shifts across categorical cutoffs. For the subsample of participants who were asked which PHQ-9 result they trusted more, 75 provided qualitative justifications; of these, 55 (73.3%) had discrepant scores across formats, while 20 (26.7%) gave feedback despite reporting identical scores. The majority (n = 53, 70.7%) expressed greater confidence in the chatbot-assisted result, whereas 14 (18.7%) preferred their self-assessment, and 8 (10.7%) considered both formats equally valid. Participants who preferred HopeBot's result often cited its clearer structure and interpretive scaffolding. The most common rationale (33 mentions, 42.9%) described the chatbot as providing ‘detailed guidance’ or ‘examples that clarified my emotions’. Others highlighted the emotional support HopeBot offered (15 mentions, 19.5%) or its ability to facilitate deeper self-reflection (8 mentions, 10.4%), contrasting with the quicker, more instinctive nature of the self-test. Conversely, some participants expressed greater trust in their self-administered PHQ-9 scores. The most frequently coded rationale (8 mentions, 10.4%) described the self-assessment as more intuitive and spontaneous, with several responses noting that the chatbot’s guided prompts occasionally encouraged overthinking. Privacy-related discomfort with disclosing sensitive information to an AI system was also reported (5 mentions, 6.5%). Others pointed to technical limitations (3 mentions, 3.9%), including delays in input recognition or submission issues. One mention (1.3%) described reduced concentration due to the slower pacing of the chatbot interaction. While a chi-square test showed a significant association between self-reported PHQ-9 severity and trust in HopeBot (χ² = 11.65, df = 4, p = 0.020), this was not supported by logistic regression assuming a linear trend (OR = 1.32, 95% CI 0.79–2.20, p = 0.29), suggesting the relationship may be non-monotonic or driven by specific subgroups. Feedback and User Experience Participant feedback, coded by mention frequency, highlighted both strengths and limitations of HopeBot. Personalised advice (50 mentions, 17.9%) was the most frequently mentioned, followed by emotional support (31 mentions, 11.1%) and prompt response timing (30 mentions, 10.7%). A few participants also highlighted affirming communication (5 mentions, 1.8%). Criticisms focused on shallow or generic replies (33 mentions, 11.8%) and voice-related issues, including delayed output (10 mentions, 3.6%) and mechanical delivery (8 mentions, 2.9%). Building on these impressions, participants also evaluated HopeBot’s performance during the PHQ-9 screening phase. The transition from open dialogue to the PHQ-9 was generally well received: 79.5% of all participants described it as natural, and 97.7% found the instructions and questions easy to understand. However, 33.3% of participants (n = 44) requested clarification on item interpretation or scoring; among them, 93.2% (n = 41/44) found the chatbot’s explanations helpful. While 77.3% of participants characterised the overall interaction as natural, some noted pacing concerns: 17 responses (7.0%) described the transition as abrupt, and 15 (6.1%) mentioned it felt rushed. These findings suggest that fixed dialogue limits—such as the 20-turn threshold before initiating the PHQ-9—may not always align with users’ conversational flow or emotional readiness. Quantitative ratings reinforced these observations (Fig.4). On a 10-point scale, participants rated HopeBot’s handling of sensitive topics at a mean of 7.60 (SD = 1.53), supported by 63 mentions (31.0%) citing its empathic tone and 50 mentions (24.6%) referencing practical guidance. However, concerns were also raised regarding shallow responses (36 mentions, 17.7%), robotic delivery (16 mentions, 7.9%), and repetitive scripted messages (7 mentions, 3.4%). For example, in response to intense emotional disclosures, the chatbot often reiterated that it was not a licensed psychologist and advised users to seek professional help. HopeBot’s capacity to facilitate emotional expression without judgment received a higher mean rating of 8.44 (SD = 1.53). This was frequently attributed to perceived confidentiality and a non-intrusive communication style. Anonymity was referenced in 72 mentions (38.7%), while 24 mentions (12.9%) highlighted its neutral and non-moralising language. Perceived usefulness of the chatbot’s advice was moderately high, with a mean score of 7.36 (SD = 2.06). Many participants reported that the recommendations were clear and actionable (72 mentions, 35.0%). In contrast, 43 mentions (20.9%) described the content as overly generic or lacking in depth. One participant noted that while the guidance was accurate, its similarity to publicly available information reduced its perceived value. HopeBot’s voice output was generally well received, with a mean clarity rating of 7.73 (SD = 1.49). Positive feedback most frequently cited clear pronunciation (117 mentions, 33.0%) and an empathetic, human-like tone (45 mentions, 13.0%). Criticisms centred on slow or inaccurate speech recognition (32 mentions, 9.3%) and limited personalisation (25 mentions, 7.2%). Additionally, 45.5% of participants (n = 60) preferred reading the on-screen transcript over listening to the full audio, citing greater convenience and discretion. Across all demographic comparisons (Table.2), no statistically significant associations were found between age, gender, ethnicity, education level, or PHQ-9 severity (both self-reported and HopeBot-assisted) and any of the four HopeBot ratings. Employment status produced a significant omnibus effect for the perceived helpfulness of HopeBot’s recommendations (Q19: F = 3.20, p = .006), whereas its impact on the remaining dimensions was nonsignificant (Q17, Q18, Q20: p > .25). Follow-up contrasts indicated that the difference reflected variability among employment sub-groups rather than a uniform shift across the full sample. Participants who had prior experience with mental-health treatment gave slightly lower Q19 scores than those without such experience (t = –2.65, p = .012); their ratings of handling sensitive topics (Q17), comfort expressing feelings (Q18), and voice clarity (Q20) did not differ (p ≥ .10). Previous use of mental-health chatbots was unrelated to any rating (all p ≥ .15). Taken together, perceptions of HopeBot were largely stable across demographic groups, with the sole notable finding being reduced perceived helpfulness of recommendations among participants in certain employment categories and among those who had already engaged with mental-health services. Preferences for interaction modality varied. Just over half of the participants (51.5%) preferred text-based communication, citing convenience, reduced transcription errors, and greater suitability for private contexts. In comparison, 40.9% favoured voice-based interaction, highlighting its interactivity and perceived naturalness. A smaller subset (7.6%) reported no clear preference. Table. 2 . Association between demographic characteristics and HopeBot user ratings (Q17–Q20). Notes. Q17 = handling of sensitive depression topics; Q18 = comfort expressing feelings without judgment; Q19 = helpfulness of recommendations; Q20 = clarity and tone of voice output. All values are rounded to three significant figures. Bold indicates p < 0.05. Demographic variable Test Q17 Q18 Q19 Q20 F-statistic p-value F-statistic p-value F-statistic p-value F-statistic p-value Age group (6 levels) ANOVA 0.559 0.731 1.37 0.241 1.43 0.219 1.67 0.147 Gender (2 levels) t test 0.696 0.488 0.186 0.853 1.72 0.087 1.22 0.227 Ethnicity (5 levels) ANOVA 0.889 0.473 0.398 0.810 0.821 0.514 0.709 0.587 Highest education (4 levels) ANOVA 0.174 0.951 0.831 0.508 1.18 0.323 0.607 0.658 Employment status (6 levels) ANOVA 0.661 0.681 1.32 0.253 3.20 0.006 1.18 0.320 Mental health support experience (binary) t test -0.637 0.528 -1.53 0.137 -2.65 0.012 -1.14 0.261 Chatbot experience (binary) t test -1.44 0.153 0.380 0.705 -0.859 0.392 0.134 0.894 PHQ-9 Severity Result – HopeBot (5 levels) ANOVA 1.23 0.300 0.462 0.763 1.57 0.188 1.72 0.150 PHQ-9 Severity Result – Self-test (5 levels) ANOVA 1.40 0.238 0.284 0.888 0.823 0.513 0.394 0.813 Perceived Acceptability and Adoption Intentions Participant preferences for PHQ-9 administration formats varied. A majority (n = 92, 69.7%) favoured the chatbot-assisted version over self-completion, citing greater engagement (39 mentions, 20.2%), emotionally supported and interactive communication (36 mentions, 18.7%), and real-time interpretive scaffolding (32 mentions, 16.6%). In contrast, participants who preferred self-administration emphasised its efficiency (20 mentions, 10.4%) and its perceived suitability for situations where users were not experiencing immediate emotional distress (13 mentions, 6.7%). A small subset (n = 6, 4.6%) expressed no clear preference. Preference was associated with employment status overall (χ² = 21.69, df = 12, p = .041), but no single employment subgroup (all p > 0.5) showed a statistically significant odds ratio relative to others, suggesting weak or diffuse effects. Although only 19.7% (n=26) of respondents reported prior engagement with professional mental health services, all participants were invited to reflect on HopeBot’s performance relative to mental counselling. Consistent with earlier themes, participants positively appraised the chatbot’s structured questioning and empathic tone (each 38 mentions, 12.3%). However, limitations were frequently noted, including perceptions of insufficient human-likeness (58 mentions, 18.7%), emotional shallowness or detachment (42 mentions, 13.5%), and overly generic or impersonal responses lacking individual tailoring (19 mentions, 6.1%). Despite some concerns, the majority of participants (n = 115, 87.1%) expressed willingness to use or recommend HopeBot in the future. Many highlighted its broader potential in mental health screening, particularly for early detection via algorithmic pattern recognition (56 mentions, 22.0%) and supportive, emotionally responsive communication (32 mentions, 12.6%). Other frequently cited advantages included immediate accessibility (15 mentions, 5.9%), rapid response time (14 mentions, 5.5%), and anonymous interaction (8 mentions, 3.1%). Several participants (n = 12, 4.7%) stressed that such tools should augment—not replace—professional care. Concerns centred on the need for clinical validation (4 mentions, 1.6%), potential diagnostic unreliability (3 mentions, 1.2%), and data privacy risks (11 mentions, 4.4%). A small number expressed cautious optimism, emphasising the importance of ethical governance and integration into trusted health systems (4 mentions, 1.6%). Willingness to recommend was significantly lower among participants with prior mental health service experience (Fisher’s exact OR = 0.22, CI 0.05-0.92, p = .0497), suggesting that those with firsthand experience may apply more critical standards in evaluating AI-based tools. Discussion The present study demonstrates that a GPT-4o-powered, voice-interactive chatbot (HopeBot) can feasibly administer the PHQ-9. HopeBot-assisted and self-administered scores showed high concordance (ICC = 0.91; median absolute difference = 1 point), without systematic bias in symptom severity. Participants positively described the chatbot as timely and accessible, supporting the potential of automated mental health screening beyond clinician-led settings, though broader validation remains necessary. Although clinician-administered PHQ-9 interviews detect suicidality and comorbid conditions with higher sensitivity 11 , self-administered formats remain standard in digital screening, showing acceptable psychometric performance (sensitivity ≈0.80; specificity ≈0.85) 8 , 45 . In this study, the self-test served as a pragmatic reference, aligning with real-world usage where individuals commonly complete online questionnaires independently. HopeBot achieved comparable agreement with this benchmark while offering additional benefits such as real-time clarification, empathic support, and increased user engagement. Beyond score concordance, HopeBot elicited considerable user trust. Among the 75 participants who directly compared the two formats, 70.7% (n = 53/73) expressed greater confidence in the chatbot-assisted scores, attributing this preference to features such as real-time clarification (33 mentions, 42.9%) and an empathic tone (15 mentions, 19.5%). These interactional advantages parallel those of semi-structured clinical interviews, while preserving the scalability, standardisation, and accessibility of automated delivery. However, perceptions of recommendation helpfulness (Q19), a key determinant of user trust, varied across subgroups. Full-time students and individuals managing household responsibilities gave lower ratings than full-time employees (F = 3.20, p = .006), and those with prior mental health service experience rated recommendations less helpful than first-time users (t = –2.65, p = .012). These differences likely reflect heightened expectations shaped by users’ life context and therapeutic background. Students and homemakers—often managing complex emotional demands with limited external support—may have anticipated greater empathy and personalisation 46 , 47 . Similarly, individuals with prior counselling experience may have evaluated responses against professional standards 4 , consistent with expectancy-disconfirmation theory 48 . These findings suggest that perceptions of chatbot utility are strongly influenced by users’ prior experiences and situational expectations. As summarised in Table.3, HopeBot introduces several substantive advancements over earlier PHQ-9 chatbots that rely primarily on Dialogflow-based intent matching (e.g., Perla 14 , Marcus 23 , DEPRA 17 ). By integrating RAG with the GPT-4o architecture, HopeBot supports fully open-ended interaction while meeting the technical constraints of real-time screening. GPT-4o was selected based on three key considerations: (1) independent benchmarks reported the lowest latency among publicly available LLMs (≈0.45 s for text, ≈0.32 s for audio) at that time, outperforming Claude 3 and Gemini 1.5 50 ;(2) its unified multimodal framework eliminates the need for separate ASR–TTS pipelines, which remain necessary for open-source and contemporary commercial alternatives 51 ; and (3) its extended context window (100k tokens) and multilingual tokeniser ensure compatibility with the demands of interactive PHQ-9 delivery while preserving clinical safety constraints 52 . Although emerging models such as Claude 3, Gemini 1.5, and Llama-3 warrant future investigation, their current limitations in latency, speech integration, and alignment tooling rendered them suboptimal for the present study. This architecture enables dynamic turn-taking, clarification of ambiguous responses, and seamless support for languages beyond English and Mandarin—capabilities not reported in prior systems. Transparency is further enhanced through item-level scoring and source-linked rationales, features absent in comparators such as IGOR 22 and Marcus 23 . Among participants who engaged with the clarification module, 93.2% indicated that these explanations improved their response accuracy. User feedback underscores these functional gains. In contrast to Marcus, where only 18.1% of users preferred the chatbot over conventional self-report 23 , 69.7% (n=92) of participants in this study favoured HopeBot, and 87.1% (n=115) expressed willingness to reuse and recommend the system. Collectively, HopeBot’s integration of low-latency generation, multilingual adaptability, explainable outputs, and improved user engagement positions it as a more transparent and clinically versatile alternative to earlier rule-based tools. Table. 3 . Comparison of automated depression screening tools across key functional dimensions Dimension Core architecture Language flexibility Screening instrument Explainability User study (N) Empathy/tone Deployment potential Key limitation in prior work Perla (2020) 14 Google Dialogflow with ML-based intent classification and Firebase backend Natural language input with ~200 phrases per item and synonym matching PHQ-9 (Spanish) Provides total score, risk status, and resource links at the end 276 participants; 108 completed both Perla and web-based PHQ-9 Supportive tone with female persona and encouraging prompts Web and major messaging platforms (e.g., Messenger, Google Assistant, Telegram) Limited validation, English-only tools, and low engagement in prior form-based tools Marcus (2023) 23 Dialogflow intents + BERT model (Node.js / GCP; Kommunicate UI) Free-text input classified to PHQ-9 PHQ-9 (English) Outputs total PHQ-9 score only 81 U.S. college students (130 enrolled) Neutral; static male avatar; no empathy modelling iOS app and web chat prototype Earlier, PHQ-9 chatbots lacked validation with U.S. college students and used only fixed-choice input IGOR (2021) 22 Dialogflow intents with Node.js + Firebase backend Button/option input (PHQ-9 scores 0–3) PHQ-9 Sends total score to clinician; not shown to user 10 university staff (usability test) Neutral; no empathic responses Prototype within MS self-management app Results hidden from the user; rule-based flow fails on off-topic input DEPRA (2023) 17 Dialogflow chatbot with 27 SIGH-D/IDS-C intents Open-text input with intent matching SIGH-D + IDS-C Final score and severity level only; no item-level feedback 50 Australian adults Neutral tone; no empathy modelling Facebook Messenger chatbot prototype High cognitive load; long completion time EmoScan (2024) 24 Mistral-7B fine-tuned on synthetic clinical interviews (PsyInterview) Free-text, multi-turn inputs; fine-tuned LLM DSM-5-based emotional disorder classification (coarse & fine-grained) LLM-generated explanations based on DSM-5 criteria 1,157 synthetic cases; 50 expert-evaluated; GPT-4-based performance evaluation Simulated empathy assessed by GPT-4 and clinical experts Research prototype; not deployed clinically Heavily synthetic Reliance on synthetic data; limited real-world generalisability Moodpath App (2021) 49 Smartphone app, 3× daily AA (45 ICD-10 items + mood) Tap yes/no → 4-level severity; 5-point mood scale DSM-5-based emotional disorder classification (coarse & fine-grained) 14-day summary with score, severity band & mood charts 113 general-population users Neutral; no empathic dialogue Live on iOS & Android Prior tools used retrospective questionnaires with little validation HopeBot (2025) GPT-4o with RAG Supports open-ended user input in various languages PHQ-9 administered via free-text/voice dialogue, with dynamic clarification and fuzzy score interpretation Provides item-level rationales and context-relevant evidence drawn from curated knowledge bases 132 participants (Chinese + UK residents) LLM-generated responses were generally perceived as empathic; mean comfort rating 8.51/10, though variation in emotional tone and delivery style was reported Available via web interface; compatible with both desktop and mobile devices (iOS, Android); supports both Mandarin and English voice/text modalities Absence of non-verbal cues, occasional mechanical tone in voice output, and lack of clinical validation for diagnostic reliability Despite its technical strengths, HopeBot did not fully replicate the relational depth of professional counselling. A substantial number of participants described the interaction as emotionally flat or impersonal (58 mentions, 18.7%), citing insufficient affective nuance (42 mentions, 13.5%) and reliance on generic responses (19 mentions, 6.1%). These limitations indicate that, even with prompt tuning and retrieval augmentation, simulated empathy remains perceptibly artificial. The findings highlight a persistent gap between the linguistic fluency of LLMs and the emotional authenticity expected in therapeutic dialogue. More broadly, these limitations reflect structural constraints inherent in current-generation conversational AI. While LLMs can generate fluent and contextually appropriate text, they lack access to non-verbal signals—including tone, facial expression, and posture—which clinicians routinely rely upon to identify distress, hesitancy, or latent risk 18 . Cross-linguistic speech synthesis poses further challenges. Although GPT-4o natively supports Mandarin text, the deployment of a text-to-speech engine optimised for English prosody introduced prosodic inconsistencies that reduced perceived naturalness and constrained affective expressiveness 53 . Culturally adaptive voice synthesis models may be required to ensure emotional fidelity and communicative clarity across diverse linguistic contexts. Beyond technical limitations, ethical and epistemic challenges must also be addressed. Repeated exposure to standardised, syntactically polished language may subtly influence how users articulate their experiences 54 , potentially narrowing expressive nuance. Emerging evidence further suggests that clinicians may revise their judgments when presented with opaque AI-generated recommendations, even in the absence of a clear clinical rationale 55 , raising concerns about automation bias and the erosion of clinical autonomy. Accordingly, systems such as HopeBot should be positioned as adjunctive tools that support—rather than replace—professional expertise 18 . Responsible deployment will require transparent algorithmic logic, explicit clinical oversight, and safeguards that ensure interpretability, accountability, and informed consent. Future research should extend beyond screening accuracy to investigate how such technologies influence therapeutic relationships, user trust, and long-term mental health outcomes. This study has several limitations. Although chatbot-assisted scores aligned closely with self-reports, this does not establish diagnostic validity, given the lower sensitivity of self-assessments compared to clinician-led evaluations. The sequential completion of both PHQ-9 formats within a single session may have introduced recall bias, with concordance potentially influenced by short-term memory or transient mood states. The sample was skewed toward younger, digitally literate users, limiting generalisability to older or digitally excluded populations. In addition, the controlled testing environment may not reflect naturalistic conditions. All findings are specific to GPT-4o and may not extend to other LLM-based systems. Future research should evaluate clinical utility, safety, and equity across settings. Multisite trials and randomised comparisons with standard self-assessment tools could clarify HopeBot’s impact on referral accuracy, care access, and clinician workload. Development of governance frameworks—such as escalation protocols, audit trails, and transparent disclosures—will be essential to meet regulatory standards. Particular attention is needed for high-risk encounters requiring non-verbal cues, and for underserved groups with limited digital access or linguistic mismatch. Cross-cultural validation will also be necessary to determine the applicability of LLM-assisted screening across healthcare systems, including those in the UK and China. Declarations Data availability Restrictions apply to the availability of the full dataset generated and analysed during the current study in order to protect participant privacy; accordingly, these data are not publicly available. However, the custom training data used to develop the RAG component of HopeBot is available at: https://github.com/candiceguo0528/HopeBot-Candice. Code availability The code for the analysis is available through a GitHub code repository (https://github.com/candiceguo0528/HopeBot-Candice). Acknowledgements The authors gratefully acknowledge the invaluable contributions of the following colleagues to the evaluation of HopeBot: Dr Alexandru Petcu, MD (NHS consultant psychiatrist) for clinical insights and safety guidance; Kai Yao and Zuyu Wang (Post-Graduate Teaching Assistants, UCL Division of Psychiatry) for assistance with study design, participant recruitment and data interpretation; and Wei’an Li (licensed mental-health counsellor, China) for expert feedback on Mandarin content and cultural adaptation. Their support greatly strengthened the rigour and relevance of this work. Author information Authors and Affiliations Institute of Health Informatics University College, London, London, United Kingdom Zhijun Guo; Alvina Lai; Julia Ive; Yutong Wang; Luyuan Qi; Johan H Thygesen; Kezhi Li Lancashire and South Cumbria NHS Foundation Trust, Psychiatry Department, Lancashire, UK University of Medicine and Pharmacy "Victor Babeș", Timișoara, Romania Alexandru Petcu Contributions Conceptualisation: Z.G., K.L.; System development and deployment: Z.G.; Streamlit implementation: Z.G., L.Q.; Participant recruitment and facilitation: Z.G., Y.W.; Qualitative analysis: Z.G.; Secondary validation: K.L.; Writing—original draft: Z.G.; Writing—review and editing: Z.G., K.L., A.L., J.I., J.T.; Clinical evaluation and feedback: A.P. Corresponding authors Correspondence to Kezhi Li. Ethics declarations Competing interests The authors declare no competing interests. References World Health Organisation (WHO). Depressive disorder (depression). https://www.who.int/news-room/fact-sheets/detail/depression . (2023). NHS. Symptoms - Depression in adults. https://www.nhs.uk/mental-health/conditions/depression-in-adults/symptoms . (2021). Puyat, J. H., Kazanjian, A., Goldner, E. M. & Wong, H. How Often Do Individuals with Major Depression Receive Minimally Adequate Treatment? A Population-Based, Data Linkage Study. Can. J. Psychiatry Rev. Can. Psychiatr. 61, 394–404 (2016). Guo, Z., Lai, A., Deng, Z. & Li, K. Evaluating the Feasibility and Acceptability of a GPT-Based Chatbot for Depression Screening: A Mixed-Methods Study. in Artificial Intelligence in Healthcare (eds. Xie, X., Styles, I., Powathil, G. & Ceccarelli, M.) 249–263 (Springer Nature Switzerland, Cham, 2024). Cook, S. C., Schwartz, A. C. & Kaslow, N. J. Evidence-Based Psychotherapy: Advantages and Challenges. Neurotherapeutics 14, 537–545 (2017). GOV.UK. Health matters: reducing health inequalities in mental illness. https://www.gov.uk/government/publications/health-matters-reducing-health-inequalities-in-mental-illness/health-matters-reducing-health-inequalities-in-mental-illness . (2018). Faugno, E. et al. Experiences with diagnostic delay among underserved racial and ethnic patients: a systematic review of the qualitative literature. BMJ Qual. Saf. 34, 190–200 (2025). American Psychological Association. Patient Health Questionnaire (PHQ-9 & PHQ-2). https://www.apa.org/pi/about/publications/caregivers/practice-settings/assessment/tools/patient-health . (2011). Lee, P. W., Schulberg, H. C., Raue, P. J. & Kroenke, K. Concordance between the PHQ-9 and the HSCL-20 in depressed primary care patients. J. Affect. Disord. 99, 139–145 (2007). Robinson, J. et al. Why are there discrepancies between depressed patients’ Global Rating of Change and scores on the Patient Health Questionnaire depression module? A qualitative study of primary care in England. BMJ Open 7, e014519 (2017). Eack, S. M., Greeno, C. G. & Lee, B.-J. Limitations of the Patient Health Questionnaire in Identifying Anxiety and Depression: Many Cases Are Undetected. Res. Soc. Work Pract. 16, 625–631 (2006). Morris, R. R., Schueller, S. M. & Picard, R. W. Efficacy of a Web-Based, Crowdsourced Peer-To-Peer Cognitive Reappraisal Platform for Depression: Randomized Controlled Trial. J. Med. Internet Res. 17, e4167 (2015). Mohr, D. C., Burns, M. N., Schueller, S. M., Clarke, G. & Klinkman, M. Behavioral Intervention Technologies: Evidence review and recommendations for future research in mental health. Gen. Hosp. Psychiatry 35, 332–338 (2013). Arrabales, R. Perla: A Conversational Agent for Depression Screening in Digital Ecosystems. Design, Implementation and Validation. Preprint at https://doi.org/10.48550/arXiv.2008.12875 . (2021). Maples, B., Cerit, M., Vishwanath, A. & Pea, R. Loneliness and suicide mitigation for students using GPT3-enabled chatbots. Npj Ment. Health Res. 3, 1–6 (2024). Mahbub, M. et al. Decoding substance use disorder severity from clinical notes using a large language model. Npj Ment. Health Res. 4, 1–10 (2025). Kaywan P., Ahmed K., Ibaida A., Miao Y. & Gu B. Early detection of depression using a conversational AI bot: A non-clinical trial. PLOS ONE 18, e0279743 (2023). Guo, Z. et al. Large Language Models for Mental Health Applications: Systematic Review. JMIR Ment. Health 11, e57400 (2024). Bubeck, S. et al. Sparks of Artificial General Intelligence: Early experiments with GPT-4. Preprint at https://doi.org/10.48550/arXiv.2303.12712 . (2023). Fitzpatrick, K. K., Darcy, A. & Vierhile, M. Delivering Cognitive Behavior Therapy to Young Adults With Symptoms of Depression and Anxiety Using a Fully Automated Conversational Agent (Woebot): A Randomized Controlled Trial. JMIR Mental Health 4, e7785 (2017). Montgomery, B. Mother says AI chatbot led her son to kill himself in lawsuit against its maker. The Guardian https://www.theguardian.com/technology/2024/oct/23/character-ai-chatbot-sewell-setzer-death . (2024). Giunti, G., Isomursu, M., Gabarron, E. &Solad, Y. Designing Depression Screening Chatbots. in Nurses and Midwives in the Digital Age 259–263 (IOS Press, 2021). https://doi.org/10.3233/SHTI210719 . Toulme, P., Nanaw, J. & Apostolellis, P. Marcus: A Chatbot for Depression Screening Based on the PHQ-9 Assessment. In Proceedings of the 16th International Conference on Advances in Computer-Human Interactions (ACHI 2023) 97–105 (IARIA, Venice, 2023). Liu, J. M. et al. Enhanced Large Language Models for Effective Screening of Depression and Anxiety. Preprint at https://doi.org/10.48550/arXiv.2501.08769 . (2025). HatchWorks AI. Harnessing RAG in Healthcare: Use-Cases, Impact, & Solutions. https://hatchworks.com/blog/gen-ai/rag-for-healthcare . (2024). Thase, M. E., Khazanov, G. & Wright, J. H. Cognitive and behavioral therapies. In Tasman’s Psychiatry (eds Tasman, A. et al. ) 1–38 (Springer International Publishing, Cham, 2020). https://doi.org/10.1007/978-3-030-42825-9_35-1 . WMA - The World Medical Association-WMA Declaration of Helsinki – Ethical Principles for Medical Research Involving Human Participants. https://www.wma.net/policies-post/wma-declaration-of-helsinki . (2025). Gupta, S., Ranjan, R. & Singh, S. N. A Comprehensive Survey of Retrieval-Augmented Generation (RAG): Evolution, Current Landscape and Future Directions. Preprint at https://doi.org/10.48550/arXiv.2410.12837 . (2024). Streamlit • A faster way to build and share data apps. https://streamlit.io . (2021). Braun, V. & and Clarke, V. Using thematic analysis in psychology. Qualitative Research in Psychology 3, 77–101 (2006). LangChain. Chroma: LangChain integration for vector storage. https://python.langchain.com/docs/integrations/vectorstores/chroma . (2025). Cully, J. A. & Teten, A. L. A Therapist’s Guide to Brief Cognitive Behavioral Therapy (U.S. Department of Veterans Affairs, 2008). THU-COA. Emotional-Support-Conversation. https://github.com/thu-coai/Emotional-Support-Conversation . (2025). THU-COAI. PsyQA: PsyQA_example.json. GitHub https://github.com/thu-coai/PsyQA/blob/main/PsyQA_example.json . (2025). Fenn, K. & Byrne, M. The key principles of cognitive behavioural therapy. InnovAiT 6, 579–585 (2013). Nakao, M., Shirotsuki, K. & Sugaya, N. Cognitive–behavioral therapy for management of mental health and stress-related disorders: Recent advances in techniques and technologies. Biopsychosoc Med 15, 16 (2021). OpenAI. Vector embeddings - OpenAI API. https://platform.openai.com . (2025). Bland, J. M. & Altman, D. STATISTICAL METHODS FOR ASSESSING AGREEMENT BETWEEN TWO METHODS OF CLINICAL MEASUREMENT. The Lancet 327, 307–310 (1986). DATAtab. Wilcoxon Test Tutorial: t-Test, Chi-Square, ANOVA, Regression, Correlation. https://datatab.net/tutorial/wilcoxon-test . (2025). Spearman, C. The proof and measurement of association between two things. Am. J. Psychol. 15, 72–101 (1904). Shrout, P. E. & Fleiss, J. L. Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin 86, 420–428 (1979). Fisher, R. A. Statistical methods for research workers. In Breakthroughs in Statistics: Methodology and Distribution (eds Kotz, S. & Johnson, N. L.) 66–70 (Springer, New York, 1992). https://doi.org/10.1007/978-1-4612-4380-9_6 . Laerd Statistics. Chi-Square Test for Association using SPSS Statistics. https://statistics.laerd.com/spss-tutorials/chi-square-test-for-association-using-spss-statistics.php . (2018) Technology Networks. The Fisher’s Exact Test. Technology Networks. http://www.technologynetworks.com/tn/articles/the-fishers-exact-test-385738 . (2024) Miller, P. et al. The performance and accuracy of depression screening tools capable of self-administration in primary care: A systematic review and meta-analysis. Eur. J. Psychiatry 35, 1–18 (2021). Lattie, E. G. et al. Digital Mental Health Interventions for Depression, Anxiety, and Enhancement of Psychological Well-Being Among College Students: Systematic Review. J. Med. Internet Res. 21, e12869 (2019). Kaplan, V. Mental Health States of Housewives: an Evaluation in Terms of Self-perception and Codependency. Int. J. Ment. Health Addict. 21, 666–683 (2023). Oliver, R. L. A Cognitive Model of the Antecedents and Consequences of Satisfaction Decisions. J. Mark. Res. 17, 460–469 (1980). Burchert, S., Kerber, A., Zimmermann, J. &Knaevelsrud, C. Screening accuracy of a 14-day smartphone ambulatory assessment of depression symptoms and mood dynamics in a general population sample: Comparison with the PHQ-9 depression screening. PLoS One 16, e0244955 (2021). Vellum. Claude 3.5 Sonnet vs GPT-4o. https://www.vellum.ai/blog/claude-3-5-sonnet-vs-gpt4o . (2025). OpenAI. Introducing next-generation audio models in the API. https://openai.com/index/introducing-our-next-generation-audio-models . (2025). OpenAI. ChatGPT — Release Notes. https://help.openai.com/en/articles/6825453-chatgpt-release-notes?utm_source=chatgpt.com . (2025). OpenAI. Text to speech - OpenAI API. https://platform.openai.com . (2025). Esmaeilzadeh, P., Mirzaei, T. &Dharanikota, S. Patients’ Perceptions Toward Human–Artificial Intelligence Interaction in Health Care: Experimental Study. J. Med. Internet Res. 23, e25856 (2021). Ryan, K., Yang, H.-J., Kim, B. & Kim, J. P. Assessing the impact of AI on physician decision-making for mental health treatment in primary care. npj Ment. Health Res. 4, 1–8 (2025). Additional Declarations No competing interests reported. Supplementary Files SupplementaryMaterial1.docx SupplementaryMaterial2.docx SupplementaryMaterial3.docx SupplementaryMaterial4.docx Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-6976450","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":496206877,"identity":"b5fa4d76-9e60-41f4-81ea-8a58eb187070","order_by":0,"name":"Zhijun Guo","email":"","orcid":"","institution":"University College London","correspondingAuthor":false,"prefix":"","firstName":"Zhijun","middleName":"","lastName":"Guo","suffix":""},{"id":496206879,"identity":"a25bf6e9-2714-4b9a-bace-cbae30aae765","order_by":1,"name":"Alvina Lai","email":"","orcid":"","institution":"University College London","correspondingAuthor":false,"prefix":"","firstName":"Alvina","middleName":"","lastName":"Lai","suffix":""},{"id":496206880,"identity":"446de2eb-b9ad-4d9f-a407-4bd0a2df4814","order_by":2,"name":"Julia Ive","email":"","orcid":"","institution":"University College London","correspondingAuthor":false,"prefix":"","firstName":"Julia","middleName":"","lastName":"Ive","suffix":""},{"id":496206883,"identity":"98df60f6-a752-4af6-a976-d7bc33be8756","order_by":3,"name":"Alexandru Petcu Petcu","email":"","orcid":"","institution":"University of Medicine and Pharmacy \"Victor Babeș\"","correspondingAuthor":false,"prefix":"","firstName":"Alexandru","middleName":"Petcu","lastName":"Petcu","suffix":""},{"id":496206884,"identity":"3f27dc55-fefb-4598-a159-5aa008dd918e","order_by":4,"name":"Yutong Wang","email":"","orcid":"","institution":"University College London","correspondingAuthor":false,"prefix":"","firstName":"Yutong","middleName":"","lastName":"Wang","suffix":""},{"id":496206886,"identity":"0bd326a9-72ca-4b10-bd51-e68a8ed274e7","order_by":5,"name":"Luyuan Qi","email":"","orcid":"","institution":"University College London","correspondingAuthor":false,"prefix":"","firstName":"Luyuan","middleName":"","lastName":"Qi","suffix":""},{"id":496206887,"identity":"9e7f9856-3cd3-47cd-9e5c-81b6a33c08ad","order_by":6,"name":"Johan H Thygesen","email":"","orcid":"","institution":"University College London","correspondingAuthor":false,"prefix":"","firstName":"Johan","middleName":"H","lastName":"Thygesen","suffix":""},{"id":496206889,"identity":"e3944464-2c4a-403d-89ef-8ae19f830d0d","order_by":7,"name":"Kezhi Li","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAuklEQVRIiWNgGAWjYBACCRDxwMCGgYEdImBAnJYEgzQGBmbStDAcJkGLZHvvAYaEgvOJ/c0MjB9+MBw2JqhFmudcAtBhtxNnHGZgluxhOGxGUIucRI4BWMsGoMOkGRgO2xDWIv8GpOUcSAvzb6K0SEvwgLQcAGlhA9lC2GGSPTkGBxIMko1nHGZss+wxSCfsfYnjZwwffPhjJ9vf3nz4xo8Ka8MGgnqA4ACEYmwgJlZGwSgYBaNgFBADAHUzMwI9qf2pAAAAAElFTkSuQmCC","orcid":"","institution":"University College London","correspondingAuthor":true,"prefix":"","firstName":"Kezhi","middleName":"","lastName":"Li","suffix":""}],"badges":[],"createdAt":"2025-06-25 15:53:32","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-6976450/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-6976450/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":88524324,"identity":"c97a1df7-37cc-4aee-bfde-d6d2497abf58","added_by":"auto","created_at":"2025-08-07 10:10:10","extension":"jpg","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":145493,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eOverview of the HopeBot study workflow, from ethical approval to data analysis.\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003ea, Ethical approval and clinical trial registration.\u003c/p\u003e\n\u003cp\u003eb, HopeBot system design. A RAG pipeline grounded outputs in CBT transcripts, therapist’s guidelines, and helpline directories. All processes were coordinated asynchronously and refined through expert evaluations.\u003c/p\u003e\n\u003cp\u003ec, Interaction protocol. User conversations followed a three-phase structure: rapport building (max 20 turns), PHQ-9 administration, and personalised feedback, followed by a feedback survey.\u003c/p\u003e\n\u003cp\u003ed, Recruitment procedure. Participants aged 18–70 were recruited via online and offline strategies.\u003c/p\u003e\n\u003cp\u003ee, Data processing and analysis. Structured (e.g., Likert-scale) and unstructured (e.g., open-ended) data were cleaned and analysed. Qualitative analysis followed Braun and Clarke’s\u003csup\u003e30\u003c/sup\u003e thematic analysis framework.\u003c/p\u003e","description":"","filename":"Figure1.jpg","url":"https://assets-eu.researchsquare.com/files/rs-6976450/v1/0593dcf2e96721eaa76a4460.jpg"},{"id":88523120,"identity":"84c43fd6-e1ee-48c2-a27e-c001ec0add10","added_by":"auto","created_at":"2025-08-07 10:02:09","extension":"jpg","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":183500,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eHopeBot interface and representative outputs.\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003ea, The interface allows users to engage with the chatbot through either typed or spoken input. During system response, text is rendered incrementally in a character-by-character fashion, followed by automatic audio playback via OpenAI’s TTS-1 (voice: ‘sage’). Playback begins after the full transcript is displayed and can be paused or interrupted by the user at any time. To reduce cognitive load and ensure user privacy, particularly in the context of sensitive mental health conversations, only the most recent audio response was accessible during each turn. Previous responses were neither stored nor replayable.\u003c/p\u003e\n\u003cp\u003eb, Safety handling in response to crisis language, redirecting users to appropriate helplines.\u003c/p\u003e\n\u003cp\u003ec, Final PHQ-9 output with item-level scores, total severity classification, and tailored support recommendations, presented in both audio and text.\u003c/p\u003e\n\u003cp\u003ed, Clarification prompt issued when user responses are ambiguous, supporting scoring precision.\u003c/p\u003e","description":"","filename":"Figure2.jpg","url":"https://assets-eu.researchsquare.com/files/rs-6976450/v1/db3763243fffc056f4edb7a9.jpg"},{"id":88524333,"identity":"0696ec6e-0200-4293-bd05-72e9f0e14ac1","added_by":"auto","created_at":"2025-08-07 10:10:10","extension":"jpg","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":81556,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eDistribution and variability of absolute differences between self-reported and HopeBot-assisted PHQ-9 scores (n = 132).\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003ea, Frequency distribution of absolute differences in PHQ-9 scores between self-administered and HopeBot-assisted assessments.\u003cbr\u003e\nb, Boxplot summarising the range, central tendency, and outliers of absolute differences between the two formats.\u003c/p\u003e","description":"","filename":"Figure3.jpg","url":"https://assets-eu.researchsquare.com/files/rs-6976450/v1/cd453bd91e27990f1e8731fb.jpg"},{"id":88523144,"identity":"dcfe844c-0fa2-4570-ae80-cf54183d48b3","added_by":"auto","created_at":"2025-08-07 10:02:10","extension":"jpg","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":332881,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eDistribution histograms and boxplots for four HopeBot evaluation items (Q17–Q20).\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003ea, Stacked bar plot illustrating the frequency distribution of user ratings (0–10) across four evaluation items. Each bar represents the total number of responses at each score point, subdivided by item. Higher scores indicate more positive user evaluations. Numeric labels within each segment indicate absolute counts and corresponding percentages. To maintain legibility and avoid overcrowding, only segments with sufficient height (e.g., ≥2 units) display numeric labels. Low-frequency responses (e.g., ratings 0–4) are fully retained in the underlying analysis but may not be annotated if their bar height falls below the display threshold.\u003c/p\u003e\n\u003cp\u003eb, Boxplots summarising the score distributions for each evaluation item, with accompanying descriptive statistics including minimum, maximum, interquartile range (IQR), median, mean, standard deviation, and 95% confidence interval (CI). These values are displayed to the right of each box for easy comparison.\u003c/p\u003e\n\u003cp\u003eNotes. Q17 = handling of sensitive depression topics; Q18 = comfort expressing feelings without judgment; Q19 = helpfulness of recommendations; Q20 = clarity and tone of voice output.\u003c/p\u003e","description":"","filename":"Figure4.jpg","url":"https://assets-eu.researchsquare.com/files/rs-6976450/v1/89e6f504f0f4afa5109ee940.jpg"},{"id":94474564,"identity":"0f542def-414c-4f3e-8db8-588ff7ffc9b1","added_by":"auto","created_at":"2025-10-27 15:49:19","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":2119978,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6976450/v1/a18cae62-45ee-41d2-afbb-dd266ecfeb3f.pdf"},{"id":88523118,"identity":"59d301d9-08b1-4857-b976-172ee3636528","added_by":"auto","created_at":"2025-08-07 10:02:09","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"supplement","size":17485,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryMaterial1.docx","url":"https://assets-eu.researchsquare.com/files/rs-6976450/v1/801010c3a9067dbe0d71c179.docx"},{"id":88523117,"identity":"0fb32d39-9a3b-4a4d-bc4c-c881e58f276d","added_by":"auto","created_at":"2025-08-07 10:02:09","extension":"docx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":21275,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryMaterial2.docx","url":"https://assets-eu.researchsquare.com/files/rs-6976450/v1/d31a75c2f209959b99486200.docx"},{"id":88524322,"identity":"df10664f-0574-4c8c-8942-a6e0aa5a5621","added_by":"auto","created_at":"2025-08-07 10:10:09","extension":"docx","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":21498,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryMaterial3.docx","url":"https://assets-eu.researchsquare.com/files/rs-6976450/v1/6fb3273587e0b7c5ccfcbca0.docx"},{"id":88523135,"identity":"1cdaebea-b5b6-45f2-9536-56c2b2cc16fa","added_by":"auto","created_at":"2025-08-07 10:02:10","extension":"docx","order_by":3,"title":"","display":"","copyAsset":false,"role":"supplement","size":4047650,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryMaterial4.docx","url":"https://assets-eu.researchsquare.com/files/rs-6976450/v1/0ddde3ee43302b1dbe07b911.docx"}],"financialInterests":"No competing interests reported.","formattedTitle":"Development and Evaluation of HopeBot: an LLM-based chatbot for structured and interactive PHQ-9 depression screening","fulltext":[{"header":"Introduction","content":"\u003cp\u003eDepression is a major global health issue characterised by persistent low mood, loss of interest or pleasure in daily activities, and impaired cognitive and emotional functioning\u003csup\u003e\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e\u003c/sup\u003e. It often results in sleep disturbances, fatigue, social withdrawal, and reduced occupational or academic productivity, imposing significant emotional and economic burdens on individuals and society\u003csup\u003e\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e\u003c/sup\u003e. The World Health Organisation (WHO) estimates that depression affects approximately 3.8% of the global population\u003csup\u003e\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e\u003c/sup\u003e, yet only about half receive minimally adequate counselling or antidepressant treatment\u003csup\u003e\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u003c/sup\u003e. Delayed identification of depression can exacerbate symptoms, increasing risks for chronic disability and suicide, with over 700,000 individuals dying by suicide annually due to depression\u003csup\u003e\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e\u003c/sup\u003e. This underscores the critical importance of timely screening and intervention. Traditional approaches such as psychological counselling and psychiatric assessments typically require trained professionals, extensive time commitments, and substantial financial resources\u003csup\u003e\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e,\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e\u003c/sup\u003e, posing notable barriers in resource-limited settings and economically disadvantaged populations\u003csup\u003e\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e,\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u003c/sup\u003e. Additionally, societal stigma associated with mental illness frequently discourages affected individuals from actively seeking care, further impeding timely identification and treatment\u003csup\u003e\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e\u003cp\u003eThe Patient Health Questionnaire-9 (PHQ-9) is one of the most widely used and validated instruments for screening and grading depressive symptoms, with pooled sensitivity and specificity of approximately 88% at the standard cut-off score of 10\u003csup\u003e8\u003c/sup\u003e. It has demonstrated strong validity across diverse populations; however, it is highly contingent on the mode of administration\u003csup\u003e\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e,\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e\u003c/sup\u003e. Clinician‑guided or semi‑structured delivery detects suicidal ideation and psychiatric comorbidity more reliably than self‑administered completion at home or online, where comprehension, engagement, and health‑literacy levels can vary\u003csup\u003e\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e\u003c/sup\u003e. Traditional face‑to‑face or paper formats may also feel emotionally taxing, impersonal, and time‑consuming, discouraging candid disclosure and full adherence\u003csup\u003e\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e\u003c/sup\u003e. In addition, their static, non‑interactive design cannot adjust to users\u0026rsquo; fluctuating emotional or cognitive states\u003csup\u003e\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e\u003c/sup\u003e. Together, these limitations underscore the need for alternative delivery approaches that preserve diagnostic rigour while enhancing usability, engagement, and cultural adaptability.\u003c/p\u003e\u003cp\u003eConversational agents powered by LLMs have emerged as a promising means of addressing limitations in traditional mental health screening\u003csup\u003e\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e\u003c/sup\u003e. Trained on extensive corpora, LLMs can generate contextually appropriate and syntactically coherent responses, support real-time clarification of user input\u003csup\u003e\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e,\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e\u003c/sup\u003e, adapt to individual linguistic patterns, and maintain coherence over extended interactions\u003csup\u003e\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e,\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e\u003c/sup\u003e. These capabilities are particularly valuable in mental health contexts, where communication is often ambiguous, incomplete, or emotionally nuanced\u003csup\u003e\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e,\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e\u003cp\u003eThe integration of LLMs into clinical workflows, however, raises important concerns. These include the risk of inaccurate or unsafe outputs, opaque reasoning processes, and lack of real-time oversight in high-risk situations such as suicidal disclosures\u003csup\u003e\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e,\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e\u003c/sup\u003e. Additional ethical challenges include data privacy, informed consent, and the interpretability of model-generated recommendations\u003csup\u003e\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e\u003c/sup\u003e. These limitations highlight the need for rigorous validation, transparent design, and appropriate safeguards prior to clinical deployment.\u003c/p\u003e\u003cp\u003eSeveral prior chatbot-based depression screening systems, such as DEPRA\u003csup\u003e\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e\u003c/sup\u003e, IGOR\u003csup\u003e\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e\u003c/sup\u003e, Perla\u003csup\u003e\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e\u003c/sup\u003e, Marcus\u003csup\u003e\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e\u003c/sup\u003e, and EmoScan\u003csup\u003e\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e\u003c/sup\u003e have demonstrated initial feasibility using structured frameworks and standardised assessments (e.g., PHQ-9, SIGH-D, IDS-C). DEPRA employs structured conversational flows guided by the SIGH-D and IDS-C scales, enabling natural language responses but relying heavily on predefined conversational intents, which constrain nuanced interaction\u003csup\u003e\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e\u003c/sup\u003e. IGOR similarly emphasises predictable and structured dialogue paths, explicitly guiding users through the PHQ-9 to minimise conversational ambiguity and potential risks; however, it does not provide real-time interpretative feedback\u003csup\u003e\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e\u003c/sup\u003e. Perla integrates the PHQ-9 within a structured framework, supporting natural language interaction, yet remains restricted by predefined intents and entities, limiting conversational flexibility\u003csup\u003e\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e\u003c/sup\u003e. Marcus uses BERT-based classifiers but faces challenges in effectively addressing ambiguous user inputs and providing transparent scoring explanations\u003csup\u003e\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e\u003c/sup\u003e. EmoScan aims to improve linguistic generalisability through synthetic clinical dialogues, but it does not directly incorporate standardised diagnostic tools such as the PHQ-9\u003csup\u003e24\u003c/sup\u003e. Taken together, these systems made progress yet reveal persistent limitations in their capacity to support flexible dialogue, foster emotional engagement, and deliver transparent explanations, which are key attributes necessary for building user trust and encouraging sustained participation.\u003c/p\u003e\u003cp\u003eTo address these constraints, we developed HopeBot, a voice-interactive chatbot designed to deliver structured PHQ-9 depression screening within a flexible, empathic conversational environment. The system integrates an LLM (GPT‑4o) with retrieval-augmented generation (RAG). This setup enables adaptive interpretation of user input, generation of item-specific clarifications grounded in clinical sources, and enhanced transparency of the interaction\u003csup\u003e\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e\u003c/sup\u003e. While PHQ-9 remains the core diagnostic framework, HopeBot supports open-ended dialogue before and after formal administration, adapting to users' conversational cues and engagement styles\u003csup\u003e\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e\u003cp\u003eWe conducted a mixed-methods investigation involving 132 participants from diverse educational and cultural backgrounds. Quantitative analyses examined demographic distributions, internal consistency of PHQ-9 items, and score concordance between self-reported and HopeBot-assisted assessments. Qualitative feedback, obtained through a structured 25-item questionnaire and follow-up interviews, explored perceptions of trust, clarity, comfort, and perceived empathy. These findings provide empirical insight into the feasibility and acceptability of LLM-driven systems as potential adjuncts to traditional depression screening pathways.\u003c/p\u003e"},{"header":"Methods","content":"\u003cp\u003e\u003cstrong\u003eEthical Approval\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis study was reviewed and approved by the University College London (UCL) Research Ethics Committee following submission of a high-risk application (ID: 26133.001). An amendment and extension to the original protocol was subsequently approved, with ethics coverage extended until 29 January 2026. All procedures were conducted in accordance with institutional ethical standards and the principles outlined in the Declaration of Helsinki\u003csup\u003e27\u003c/sup\u003e. Prior to participation, informed consent was obtained from all individuals. The study was also prospectively registered on ClinicalTrials.gov under reference number NCT06801925. \u003cstrong\u003e\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eChatbot System Design\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eHopeBot was developed as a real-time, voice-interactive assistant for depression screening through naturalistic dialogue. The system integrates GPT-4o with an RAG architecture to support open-domain dialogue while grounding responses in clinically relevant content\u003csup\u003e28\u003c/sup\u003e, including Cognitive Behavioral Therapy (CBT) transcripts, therapists\u0026rsquo; guidelines, and helpline directories. The complete system workflow is illustrated in Fig.1. The user interface was developed using Streamlit\u003csup\u003e29\u003c/sup\u003e to enable synchronous multimodal input via keyboard or microphone (Fig.2.). Voice input was processed through an automatic speech recognition module, and system responses were synthesised into audio. All components of transcription, generation, and rendering were managed within an asynchronous event loop to preserve natural turn-taking and maintain interactional fluidity. This architecture was adopted to facilitate a seamless user experience while aligning with ethical and clinical communication standards.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eThe system supported both English and Mandarin through GPT-4o\u0026rsquo;s native multilingual capabilities. Responses were generated directly in the input language without translation. Mandarin outputs were produced from Chinese prompts, and audio synthesis was handled by a general-purpose text-to-speech engine.\u003c/p\u003e\n\u003cp\u003eTo ground the chatbot\u0026rsquo;s responses in validated psychological knowledge, we implemented a multi-source RAG layer using LangChain and Chroma\u003csup\u003e31\u003c/sup\u003e. Four primary data sources were assembled: (i) A curated corpus of 34 anonymised CBT session transcripts compiled from publicly available training materials, including YouTube-based simulations, therapist role-plays, and anonymised transcripts from online repositories. (ii) The full text of A Therapist\u0026rsquo;s Guide to Brief CBT was included to ensure coverage of structured, evidence-based strategies\u003csup\u003e32\u003c/sup\u003e. (iii) Two public corpora were integrated to support emotional relevance: ESConv, an English dialogue dataset annotated for user emotions and support strategies\u003csup\u003e33\u003c/sup\u003e; and PsyQA_example, a Chinese mental health QA corpus covering topics such as depression and anxiety\u003csup\u003e34\u003c/sup\u003e. (iv) Bilingual helpline directories from the United Kingdom (UK) and China, containing validated contact information and service descriptions. The CBT vector store integrated publicly accessible materials selected for their structured, clinically grounded nature\u003csup\u003e35\u003c/sup\u003e\u003csup\u003e,\u003c/sup\u003e\u003csup\u003e36\u003c/sup\u003e, including annotated scripts (e.g., from learn.problemgambling.ca), case dialogues, and educational videos by licensed clinicians. Subtitles from video content were extracted, speaker-segmented, and cleaned. All resources were used solely for research in accordance with their stated terms and screened for alignment with core CBT principles such as socratic questioning, cognitive restructuring, and behavioural activation\u003csup\u003e35\u003c/sup\u003e.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eAll documents were pre-processed using a recursive character-level chunking strategy with 512-token segments and a 20% overlap. Text embeddings were generated using the text-embedding-3-small model\u003csup\u003e37\u003c/sup\u003e. At each conversational turn, semantic retrieval was performed in parallel across the three vector stores. The top-ranked passages were concatenated and incorporated into the GPT-4o prompt to generate evidence-informed and contextually appropriate responses. This architecture enabled the chatbot to alternate seamlessly between open-ended therapeutic dialogue and structured screening procedures, while maintaining psychological validity and factual coherence.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eThe chatbot operated under a structured three-phase protocol: (1) rapport building through open conversation, (2) PHQ-9 administration, and (3) personalised feedback. A mandatory transition to PHQ-9 was enforced within 20 dialogue turns to maintain screening focus. This constraint applied only before assessment; users could continue engaging with the system without dialogue limits following PHQ-9 completion. PHQ-9 items were administered sequentially, and user responses were categorised into standard A\u0026ndash;D scoring brackets (0 to 3 points). When responses were ambiguous, the model generated clarification queries to users in conversations rather than imposing premature classification. \u0026nbsp;\u003c/p\u003e\n\u003cp\u003eFinal output included item-level interpretations, a total score, severity classification based on validated PHQ-9 thresholds, and tailored resource recommendations. All classification and clarification logic was embedded within the system prompt and dynamically executed by the language model. On average, GPT-4o generated each response in 1.47 \u0026plusmn; 0.30 seconds, corresponding to brief single-turn responses of 49.2 \u0026plusmn; 7.6 tokens, based on 100 representative interactions collected during internal testing. Speech synthesis using OpenAI\u0026rsquo;s TTS‑1 model required an additional 2.36 \u0026plusmn; 0.49 seconds, resulting in a total latency of ~3.83 seconds per user\u0026ndash;bot turn.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eThe prototype was reviewed by four domain experts, including a practising NHS clinical psychiatrist in the UK, two doctoral researchers at UCL, and a licensed mental health counsellor in China. Reviewers noted that the system maintained acceptable response latency and did not disrupt conversational flow. Their feedback also addressed scoring validity, linguistic tone, empathy, and the handling of ambiguous responses, informing iterative refinements before participant deployment.\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eIn addition to its technical functions, the system incorporated safeguards to address ethical, emotional, and data privacy concerns during human\u0026ndash;artificial intelligence (AI) interactions. Please refer to Supplementary Material 1 for details.\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eEvaluation: Participant Recruitment and Procedure\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTo evaluate the performance of Hopebot as a mental health screening tool, we conducted a completed trial involving a diverse participant sample. This manuscript reports the final analysis of the collected data. Participant recruitment was carried out concurrently in the UK and China to ensure demographic and experiential diversity using both online and offline strategies. Recruitment targeted adults aged 18 to 70 years. Advertisements were distributed via social media platforms (Facebook, X, and Xiaohongshu) and printed posters at university buildings and community venues. Interested individuals were instructed to contact the research team directly, upon which they were provided with a participation information pack, including a Participant Information Sheet and a consent form.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eAfter providing informed consent, participants were asked to complete a self-administered PHQ-9 online, serving as a baseline measure. Participants selected either English or Mandarin Chinese according to their language preference. The Chinese version of PHQ‑9 used in this study was based on the validated mainland translation widely adopted in clinical and research settings. They were then invited to interact with HopeBot using either a desktop or mobile device, with the option of submitting inputs via keyboard or microphone. Each interaction lasted approximately 25 minutes. Following the chatbot session, participants were required to complete a 25-item post-interaction survey (see Supplementary Material 2) covering demographic information, PHQ-9 results, and experiential feedback. The final questionnaire included 5 demographic items, 2 PHQ-9 result entries (self-reported and HopeBot-assisted), and 18 open-ended questions such as Likert-style ratings assessing comfort, empathy, voice clarity, and perceived usefulness. Participants were encouraged to elaborate on their responses by providing reasons or examples. On average, completing the survey took about 35 minutes. Data were collected between 1 March and 3 April 2025.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eA total of 191 individuals were initially enrolled. Submissions were excluded if they (i) completed less than 80% of the questionnaire (n = 32), (ii) submitted incoherent or AI-generated responses (n = 12), or (iii) provided non-substantive answers to open-ended questions, such as single-word replies, vague affirmations (e.g., \u0026ldquo;good\u0026rdquo; or \u0026ldquo;helpful\u0026rdquo;), or content copied from external sources (n = 15). After quality screening, 132 responses were retained for analysis.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eData Analysis Method\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eDescriptive statistics were generated for all structured survey responses using Python 3.11. To evaluate consistency between self-administered and HopeBot-assisted PHQ-9 scores, we employed a within-subject design. Absolute and signed score differences were computed, and measures of central tendency (mean, median) and dispersion (interquartile range, standard deviation) were reported\u003csup\u003e38\u003c/sup\u003e. Paired t-test and a Wilcoxon signed-rank test were conducted to compare PHQ-9 scores between formats \u003csup\u003e39\u003c/sup\u003e. Spearman\u0026apos;s rank correlation\u003csup\u003e40\u003c/sup\u003e and ICC(3,1) were used to assess correlation and agreement between formats\u003csup\u003e41\u003c/sup\u003e, respectively. To explore associations between demographic factors and user ratings across four key outcomes (Q17\u0026ndash;Q20), independent samples t-tests and one-way ANOVA were applied, depending on the variable structure\u003csup\u003e42\u003c/sup\u003e. All significance tests were two-sided with an \u0026alpha; threshold of 0.05.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eMultilevel demographic variables were dichotomised a priori to maintain expected cell counts \u0026ge; 5 (e.g., age \u0026le; 34 vs \u0026ge; 35 years; ethnicity White vs non-White; education degree vs non-degree). Each demographic factor was cross-tabulated (2 \u0026times; 2) against three binary endpoints: (i) perceived trustworthiness of PHQ-9 scores, (ii) preferred screening modality, and (iii) intention to recommend or reuse HopeBot. Pearson\u0026rsquo;s \u0026chi;\u0026sup2; test with Yates\u0026rsquo; correction was used when appropriate\u003csup\u003e43\u003c/sup\u003e; otherwise, Fisher\u0026rsquo;s exact test was applied\u003csup\u003e44\u003c/sup\u003e. Statistical significance was set at \u0026alpha; = 0.05, with Holm\u0026ndash;Bonferroni adjustment for multiple comparisons.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eOpen-ended responses were thematically analysed using Braun and Clarke\u0026rsquo;s six-phase framework\u003csup\u003e30\u003c/sup\u003e (see Fig.1.). Coding was conducted inductively by the first author to allow themes to emerge from the data. To ensure analytic rigour, a second qualitative researcher (KL) independently reviewed the codes. Inter-coder agreement was 86%, indicating good consistency. Discrepancies in code assignment or theme mapping were resolved through discussion until consensus was reached. A full codebook outlining code definitions, inclusion criteria, and exemplar quotes is provided in Supplementary Material 3. Word frequency statistics were computed using Python to support theme validation and lexical salience analysis; the distribution of word frequencies is presented in Supplementary Material 4.\u0026nbsp;\u003c/p\u003e"},{"header":"Results","content":"\u003cp\u003e\u003cstrong\u003eParticipant characteristics\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eOf the 132 participants included in the final analysis, 68 (51.5%) were recruited in the UK. 75% were under 45 years of age, 54.5% identified as female, and 56.1% as Asian or Asian British, while 38.6% identified as White. Most participants held an undergraduate or postgraduate degree (88.7%) and were either in full-time employment (59.1%) or full-time education (22.7%). Familiarity with LLMs was high overall, with 85 participants (64.4%) describing themselves as regular users, and only 2 (1.5%) reporting no prior experience. In total, 56 participants (42.4%) had previously interacted with chatbot technologies, most (48/56, 85.7%) reported using general-purpose LLMs (e.g., ChatGPT, Doubao) for emotional disclosure or mental health\u0026ndash;related interactions, rather than specialised mental health chatbots. Prior experience with conventional mental health support was reported by 26 participants (19.7%) (see Table.1).\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTable.\u003c/strong\u003e\u003cstrong\u003e1\u003c/strong\u003e\u003cstrong\u003e. Sociodemographic and background characteristics of the survey respondents (N = 132).\u003c/strong\u003e\u003c/p\u003e\n\u003ctable border=\"0\" cellspacing=\"0\" cellpadding=\"0\" width=\"100%\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 43px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eCharacteristic\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 42px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eCategory\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 6px;\"\u003e\n \u003cp\u003e\u003cstrong\u003en\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 7px;\"\u003e\n \u003cp\u003e\u003cstrong\u003e%\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd rowspan=\"2\" style=\"width: 43px;\"\u003e\n \u003cp\u003eCountry of Recruitment\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 42px;\"\u003e\n \u003cp\u003eUK\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 6px;\"\u003e\n \u003cp\u003e68\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 7px;\"\u003e\n \u003cp\u003e51.5\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 42px;\"\u003e\n \u003cp\u003eChina\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 6px;\"\u003e\n \u003cp\u003e64\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 7px;\"\u003e\n \u003cp\u003e48.5\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd rowspan=\"6\" style=\"width: 43px;\"\u003e\n \u003cp\u003eAge group (years)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 42px;\"\u003e\n \u003cp\u003e18 \u0026ndash; 24\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 6px;\"\u003e\n \u003cp\u003e27\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 7px;\"\u003e\n \u003cp\u003e20.5\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 42px;\"\u003e\n \u003cp\u003e25 \u0026ndash; 34\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 6px;\"\u003e\n \u003cp\u003e40\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 7px;\"\u003e\n \u003cp\u003e30.3\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 42px;\"\u003e\n \u003cp\u003e35 \u0026ndash; 44\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 6px;\"\u003e\n \u003cp\u003e32\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 7px;\"\u003e\n \u003cp\u003e24.2\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 42px;\"\u003e\n \u003cp\u003e45 \u0026ndash; 54\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 6px;\"\u003e\n \u003cp\u003e19\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 7px;\"\u003e\n \u003cp\u003e14.4\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 42px;\"\u003e\n \u003cp\u003e55 \u0026ndash; 64\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 6px;\"\u003e\n \u003cp\u003e12\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 7px;\"\u003e\n \u003cp\u003e9.1\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 42px;\"\u003e\n \u003cp\u003e65 \u0026ndash; 70\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 6px;\"\u003e\n \u003cp\u003e2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 7px;\"\u003e\n \u003cp\u003e1.5\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd rowspan=\"2\" style=\"width: 43px;\"\u003e\n \u003cp\u003eGender\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 42px;\"\u003e\n \u003cp\u003eFemale\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 6px;\"\u003e\n \u003cp\u003e72\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 7px;\"\u003e\n \u003cp\u003e54.5\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 42px;\"\u003e\n \u003cp\u003eMale\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 6px;\"\u003e\n \u003cp\u003e60\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 7px;\"\u003e\n \u003cp\u003e45.5\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd rowspan=\"5\" style=\"width: 43px;\"\u003e\n \u003cp\u003eEthnicity\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 42px;\"\u003e\n \u003cp\u003eAsian or Asian British\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 6px;\"\u003e\n \u003cp\u003e74\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 7px;\"\u003e\n \u003cp\u003e56.1\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 42px;\"\u003e\n \u003cp\u003eWhite\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 6px;\"\u003e\n \u003cp\u003e51\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 7px;\"\u003e\n \u003cp\u003e38.6\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 42px;\"\u003e\n \u003cp\u003eBlack / Black British / Caribbean\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 6px;\"\u003e\n \u003cp\u003e4\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 7px;\"\u003e\n \u003cp\u003e3.0\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 42px;\"\u003e\n \u003cp\u003eMixed / Multiple groups\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 6px;\"\u003e\n \u003cp\u003e2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 7px;\"\u003e\n \u003cp\u003e1.5\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 42px;\"\u003e\n \u003cp\u003ePrefer not to say\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 6px;\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 7px;\"\u003e\n \u003cp\u003e0.8\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd rowspan=\"4\" style=\"width: 43px;\"\u003e\n \u003cp\u003eHighest education\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 42px;\"\u003e\n \u003cp\u003eUndergraduate degree\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 6px;\"\u003e\n \u003cp\u003e81\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 7px;\"\u003e\n \u003cp\u003e61.4\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 42px;\"\u003e\n \u003cp\u003ePost-graduate degree (Master\u0026rsquo;s/PhD)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 6px;\"\u003e\n \u003cp\u003e36\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 7px;\"\u003e\n \u003cp\u003e27.3\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 42px;\"\u003e\n \u003cp\u003eFurther education (e.g., A-levels/NVQ)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 6px;\"\u003e\n \u003cp\u003e13\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 7px;\"\u003e\n \u003cp\u003e9.9\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 42px;\"\u003e\n \u003cp\u003eNo formal qualification / Prefer not to say\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 6px;\"\u003e\n \u003cp\u003e2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 7px;\"\u003e\n \u003cp\u003e1.5\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd rowspan=\"6\" style=\"width: 43px;\"\u003e\n \u003cp\u003eEmployment status\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 42px;\"\u003e\n \u003cp\u003eFull-time employment\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 6px;\"\u003e\n \u003cp\u003e78\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 7px;\"\u003e\n \u003cp\u003e59.1\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 42px;\"\u003e\n \u003cp\u003eFull-time education/training\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 6px;\"\u003e\n \u003cp\u003e30\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 7px;\"\u003e\n \u003cp\u003e22.7\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 42px;\"\u003e\n \u003cp\u003ePart-time employment\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 6px;\"\u003e\n \u003cp\u003e12\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 7px;\"\u003e\n \u003cp\u003e9.1\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 42px;\"\u003e\n \u003cp\u003eLooking after home\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 6px;\"\u003e\n \u003cp\u003e5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 7px;\"\u003e\n \u003cp\u003e3.8\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 42px;\"\u003e\n \u003cp\u003eOther / Retired\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 6px;\"\u003e\n \u003cp\u003e6\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 7px;\"\u003e\n \u003cp\u003e4.5\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 42px;\"\u003e\n \u003cp\u003ePrefer not to say\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 6px;\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 7px;\"\u003e\n \u003cp\u003e0.8\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd rowspan=\"5\" style=\"width: 43px;\"\u003e\n \u003cp\u003eFamiliarity with LLMs*\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 42px;\"\u003e\n \u003cp\u003eRegular user\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 6px;\"\u003e\n \u003cp\u003e85\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 7px;\"\u003e\n \u003cp\u003e64.4\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 42px;\"\u003e\n \u003cp\u003eHeard of / tried once\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 6px;\"\u003e\n \u003cp\u003e29\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 7px;\"\u003e\n \u003cp\u003e22.0\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 42px;\"\u003e\n \u003cp\u003eOccasional user\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 6px;\"\u003e\n \u003cp\u003e8\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 7px;\"\u003e\n \u003cp\u003e6.1\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 42px;\"\u003e\n \u003cp\u003eTechnical expert\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 6px;\"\u003e\n \u003cp\u003e8\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 7px;\"\u003e\n \u003cp\u003e6.1\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 42px;\"\u003e\n \u003cp\u003eNo experience\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 6px;\"\u003e\n \u003cp\u003e2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 7px;\"\u003e\n \u003cp\u003e1.5\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd rowspan=\"2\" style=\"width: 43px;\"\u003e\n \u003cp\u003eMental health chatbot experience\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 42px;\"\u003e\n \u003cp\u003eYes\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 6px;\"\u003e\n \u003cp\u003e56\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 7px;\"\u003e\n \u003cp\u003e42.4\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 42px;\"\u003e\n \u003cp\u003eNo\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 6px;\"\u003e\n \u003cp\u003e76\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 7px;\"\u003e\n \u003cp\u003e57.6\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd rowspan=\"2\" style=\"width: 43px;\"\u003e\n \u003cp\u003ePrevious mental health support experience\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 42px;\"\u003e\n \u003cp\u003eYes\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 6px;\"\u003e\n \u003cp\u003e26\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 7px;\"\u003e\n \u003cp\u003e19.7\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 42px;\"\u003e\n \u003cp\u003eNo\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 6px;\"\u003e\n \u003cp\u003e106\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 7px;\"\u003e\n \u003cp\u003e80.3\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd rowspan=\"5\" style=\"width: 43px;\"\u003e\n \u003cp\u003ePHQ-9 Severity Result \u0026ndash; Self-report\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 42px;\"\u003e\n \u003cp\u003eMinimal/None (0\u0026ndash;4)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 6px;\"\u003e\n \u003cp\u003e51\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 7px;\"\u003e\n \u003cp\u003e38.6\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 42px;\"\u003e\n \u003cp\u003eMild (5\u0026ndash;9)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 6px;\"\u003e\n \u003cp\u003e42\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 7px;\"\u003e\n \u003cp\u003e31.8\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 42px;\"\u003e\n \u003cp\u003eModerate (10\u0026ndash;14)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 6px;\"\u003e\n \u003cp\u003e25\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 7px;\"\u003e\n \u003cp\u003e18.9\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 42px;\"\u003e\n \u003cp\u003eModerately Severe (15\u0026ndash;19)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 6px;\"\u003e\n \u003cp\u003e9\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 7px;\"\u003e\n \u003cp\u003e6.8\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 42px;\"\u003e\n \u003cp\u003eSevere (20\u0026ndash;27)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 6px;\"\u003e\n \u003cp\u003e5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 7px;\"\u003e\n \u003cp\u003e3.8\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd rowspan=\"5\" style=\"width: 43px;\"\u003e\n \u003cp\u003ePHQ-9 Severity Result \u0026ndash; HopeBot\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 42px;\"\u003e\n \u003cp\u003eMinimal/None (0\u0026ndash;4)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 6px;\"\u003e\n \u003cp\u003e48\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 7px;\"\u003e\n \u003cp\u003e36.4\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 42px;\"\u003e\n \u003cp\u003eMild (5\u0026ndash;9)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 6px;\"\u003e\n \u003cp\u003e47\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 7px;\"\u003e\n \u003cp\u003e35.6\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 42px;\"\u003e\n \u003cp\u003eModerate (10\u0026ndash;14)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 6px;\"\u003e\n \u003cp\u003e24\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 7px;\"\u003e\n \u003cp\u003e18.2\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 42px;\"\u003e\n \u003cp\u003eModerately Severe (15\u0026ndash;19)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 6px;\"\u003e\n \u003cp\u003e9\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 7px;\"\u003e\n \u003cp\u003e6.8\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 42px;\"\u003e\n \u003cp\u003eSevere (20\u0026ndash;27)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 6px;\"\u003e\n \u003cp\u003e4\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 7px;\"\u003e\n \u003cp\u003e3.0\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e*LLM = large language model.\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eChatbot System Design\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eWhile administering the PHQ-9, HopeBot actively sought clarification when user input was vague or non-categorical. For example, responses such as \u0026apos;maybe sometimes?\u0026apos; triggered follow-up prompts offering standardised response options. This mechanism improved scoring accuracy and reduced the risk of misclassification. However, its effectiveness depended on user engagement and could be limited by cognitive load, language barriers, or low responsiveness, highlighting a trade-off between flexibility and robustness in automated screening.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eFollowing completion, the system generated a structured summary comprising item-level scores, overall severity classification, and general resource recommendations. Representative outputs illustrating responses to crisis language, ambiguous input, and summary generation are shown in Fig.2(b,c,d). Feedback was designed to be emotionally sensitive and clinically interpretable. While participants generally found the summaries clear and supportive, the recommendations remained generic and did not incorporate prior psychiatric history or comorbidities, reflecting broader limitations in personalisation within scalable AI-driven screening tools.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ePHQ-9 Score Concordance\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eA within-subject comparison was conducted to evaluate alignment between self-administered and HopeBot-assisted PHQ-9 assessments. As shown in Fig.3, scores were identical across both administrations in 59 participants (44.7%). The median absolute difference between scores was 1 point (IQR = 2.00; mean = 1.33), indicating strong overall consistency. The signed difference distribution had a median of 0.00 and a mean of 0.05, suggesting no systematic tendency for HopeBot to over- or underestimate participants\u0026rsquo; symptom severity. A paired Wilcoxon signed-rank test confirmed the absence of systematic bias (Z = 1304.0, p = .649). Consistency between formats was high: Spearman\u0026rsquo;s rank correlation coefficient was \u0026rho; = 0.92 (p \u0026lt; .001), and the ICC(3,1) was 0.91 (95% CI: 0.88\u0026ndash;0.93), indicating excellent agreement in both absolute score magnitude and relative rank order. Despite small score differences, 37 participants (28.0%) were assigned to a different PHQ-9 severity category in the HopeBot-assisted version due to score shifts across categorical cutoffs.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eFor the subsample of participants who were asked which PHQ-9 result they trusted more, 75 provided qualitative justifications; of these, 55 (73.3%) had discrepant scores across formats, while 20 (26.7%) gave feedback despite reporting identical scores. The majority (n = 53, 70.7%) expressed greater confidence in the chatbot-assisted result, whereas 14 (18.7%) preferred their self-assessment, and 8 (10.7%) considered both formats equally valid. Participants who preferred HopeBot\u0026apos;s result often cited its clearer structure and interpretive scaffolding. The most common rationale (33 mentions, 42.9%) described the chatbot as providing \u0026lsquo;detailed guidance\u0026rsquo; or \u0026lsquo;examples that clarified my emotions\u0026rsquo;. Others highlighted the emotional support HopeBot offered (15 mentions, 19.5%) or its ability to facilitate deeper self-reflection (8 mentions, 10.4%), contrasting with the quicker, more instinctive nature of the self-test. \u0026nbsp;\u003c/p\u003e\n\u003cp\u003eConversely, some participants expressed greater trust in their self-administered PHQ-9 scores. The most frequently coded rationale (8 mentions, 10.4%) described the self-assessment as more intuitive and spontaneous, with several responses noting that the chatbot\u0026rsquo;s guided prompts occasionally encouraged overthinking. Privacy-related discomfort with disclosing sensitive information to an AI system was also reported (5 mentions, 6.5%). Others pointed to technical limitations (3 mentions, 3.9%), including delays in input recognition or submission issues. One mention (1.3%) described reduced concentration due to the slower pacing of the chatbot interaction. \u0026nbsp;\u003c/p\u003e\n\u003cp\u003eWhile a chi-square test showed a significant association between self-reported PHQ-9 severity and trust in HopeBot (\u0026chi;\u0026sup2; = 11.65, df = 4, p = 0.020), this was not supported by logistic regression assuming a linear trend (OR = 1.32, 95% CI 0.79\u0026ndash;2.20, p = 0.29), suggesting the relationship may be non-monotonic or driven by specific subgroups.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFeedback and User Experience\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eParticipant feedback, coded by mention frequency, highlighted both strengths and limitations of HopeBot. Personalised advice (50 mentions, 17.9%) was the most frequently mentioned, followed by emotional support (31 mentions, 11.1%) and prompt response timing (30 mentions, 10.7%). A few participants also highlighted affirming communication (5 mentions, 1.8%). Criticisms focused on shallow or generic replies (33 mentions, 11.8%) and voice-related issues, including delayed output (10 mentions, 3.6%) and mechanical delivery (8 mentions, 2.9%).\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eBuilding on these impressions, participants also evaluated HopeBot\u0026rsquo;s performance during the PHQ-9 screening phase. The transition from open dialogue to the PHQ-9 was generally well received: 79.5% of all participants described it as natural, and 97.7% found the instructions and questions easy to understand. However, 33.3% of participants (n = 44) requested clarification on item interpretation or scoring; among them, 93.2% (n = 41/44) found the chatbot\u0026rsquo;s explanations helpful. While 77.3% of participants characterised the overall interaction as natural, some noted pacing concerns: 17 responses (7.0%) described the transition as abrupt, and 15 (6.1%) mentioned it felt rushed. These findings suggest that fixed dialogue limits\u0026mdash;such as the 20-turn threshold before initiating the PHQ-9\u0026mdash;may not always align with users\u0026rsquo; conversational flow or emotional readiness.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eQuantitative ratings reinforced these observations (Fig.4). On a 10-point scale, participants rated HopeBot\u0026rsquo;s handling of sensitive topics at a mean of 7.60 (SD = 1.53), supported by 63 mentions (31.0%) citing its empathic tone and 50 mentions (24.6%) referencing practical guidance. However, concerns were also raised regarding shallow responses (36 mentions, 17.7%), robotic delivery (16 mentions, 7.9%), and repetitive scripted messages (7 mentions, 3.4%). For example, in response to intense emotional disclosures, the chatbot often reiterated that it was not a licensed psychologist and advised users to seek professional help.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eHopeBot\u0026rsquo;s capacity to facilitate emotional expression without judgment received a higher mean rating of 8.44 (SD = 1.53). This was frequently attributed to perceived confidentiality and a non-intrusive communication style. Anonymity was referenced in 72 mentions (38.7%), while 24 mentions (12.9%) highlighted its neutral and non-moralising language. \u0026nbsp;\u003c/p\u003e\n\u003cp\u003ePerceived usefulness of the chatbot\u0026rsquo;s advice was moderately high, with a mean score of 7.36 (SD = 2.06). Many participants reported that the recommendations were clear and actionable (72 mentions, 35.0%). In contrast, 43 mentions (20.9%) described the content as overly generic or lacking in depth. One participant noted that while the guidance was accurate, its similarity to publicly available information reduced its perceived value.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eHopeBot\u0026rsquo;s voice output was generally well received, with a mean clarity rating of 7.73 (SD = 1.49). Positive feedback most frequently cited clear pronunciation (117 mentions, 33.0%) and an empathetic, human-like tone (45 mentions, 13.0%). Criticisms centred on slow or inaccurate speech recognition (32 mentions, 9.3%) and limited personalisation (25 mentions, 7.2%). Additionally, 45.5% of participants (n = 60) preferred reading the on-screen transcript over listening to the full audio, citing greater convenience and discretion.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eAcross all demographic comparisons (Table.2), no statistically significant associations were found between age, gender, ethnicity, education level, or PHQ-9 severity (both self-reported and HopeBot-assisted) and any of the four HopeBot ratings. Employment status produced a significant omnibus effect for the perceived helpfulness of HopeBot\u0026rsquo;s recommendations (Q19: F = 3.20, p = .006), whereas its impact on the remaining dimensions was nonsignificant (Q17, Q18, Q20: p \u0026gt; .25). Follow-up contrasts indicated that the difference reflected variability among employment sub-groups rather than a uniform shift across the full sample.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eParticipants who had prior experience with mental-health treatment gave slightly lower Q19 scores than those without such experience (t = \u0026ndash;2.65, p = .012); their ratings of handling sensitive topics (Q17), comfort expressing feelings (Q18), and voice clarity (Q20) did not differ (p \u0026ge; .10). Previous use of mental-health chatbots was unrelated to any rating (all p \u0026ge; .15). Taken together, perceptions of HopeBot were largely stable across demographic groups, with the sole notable finding being reduced perceived helpfulness of recommendations among participants in certain employment categories and among those who had already engaged with mental-health services.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003ePreferences for interaction modality varied. Just over half of the participants (51.5%) preferred text-based communication, citing convenience, reduced transcription errors, and greater suitability for private contexts. In comparison, 40.9% favoured voice-based interaction, highlighting its interactivity and perceived naturalness. A smaller subset (7.6%) reported no clear preference. \u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTable.\u003c/strong\u003e\u003cstrong\u003e2\u003c/strong\u003e\u003cstrong\u003e. Association between demographic characteristics and HopeBot user ratings (Q17\u0026ndash;Q20).\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNotes. Q17 = handling of sensitive depression topics; Q18 = comfort expressing feelings without judgment; Q19 = helpfulness of recommendations; Q20 = clarity and tone of voice output. All values are rounded to three significant figures. Bold indicates p \u0026lt; 0.05.\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\" width=\"614\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 142px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eDemographic variable\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 66px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eTest\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" style=\"width: 95px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eQ17\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" style=\"width: 104px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eQ18\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" style=\"width: 104px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003e\u003cstrong\u003eQ19\u003c/strong\u003e\u003c/p\u003e\n \u003cp\u003e\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" style=\"width: 104px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eQ20\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 142px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 66px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 47px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eF-statistic\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 47px;\"\u003e\n \u003cp\u003e\u003cstrong\u003ep-value\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 57px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eF-statistic\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 47px;\"\u003e\n \u003cp\u003e\u003cstrong\u003ep-value\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 57px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eF-statistic\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 47px;\"\u003e\n \u003cp\u003e\u003cstrong\u003ep-value\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 54px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eF-statistic\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 50px;\"\u003e\n \u003cp\u003e\u003cstrong\u003ep-value\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 142px;\"\u003e\n \u003cp\u003eAge group (6 levels)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 66px;\"\u003e\n \u003cp\u003eANOVA\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 47px;\"\u003e\n \u003cp\u003e0.559\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 47px;\"\u003e\n \u003cp\u003e0.731\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 57px;\"\u003e\n \u003cp\u003e1.37\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 47px;\"\u003e\n \u003cp\u003e0.241\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 57px;\"\u003e\n \u003cp\u003e1.43\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 47px;\"\u003e\n \u003cp\u003e0.219\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 54px;\"\u003e\n \u003cp\u003e1.67\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 50px;\"\u003e\n \u003cp\u003e0.147\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 142px;\"\u003e\n \u003cp\u003eGender (2 levels)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 66px;\"\u003e\n \u003cp\u003et test\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 47px;\"\u003e\n \u003cp\u003e0.696\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 47px;\"\u003e\n \u003cp\u003e0.488\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 57px;\"\u003e\n \u003cp\u003e0.186\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 47px;\"\u003e\n \u003cp\u003e0.853\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 57px;\"\u003e\n \u003cp\u003e1.72\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 47px;\"\u003e\n \u003cp\u003e0.087\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 54px;\"\u003e\n \u003cp\u003e1.22\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 50px;\"\u003e\n \u003cp\u003e0.227\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 142px;\"\u003e\n \u003cp\u003eEthnicity (5 levels)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 66px;\"\u003e\n \u003cp\u003eANOVA\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 47px;\"\u003e\n \u003cp\u003e0.889\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 47px;\"\u003e\n \u003cp\u003e0.473\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 57px;\"\u003e\n \u003cp\u003e0.398\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 47px;\"\u003e\n \u003cp\u003e0.810\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 57px;\"\u003e\n \u003cp\u003e0.821\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 47px;\"\u003e\n \u003cp\u003e0.514\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 54px;\"\u003e\n \u003cp\u003e0.709\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 50px;\"\u003e\n \u003cp\u003e0.587\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 142px;\"\u003e\n \u003cp\u003eHighest education (4 levels)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 66px;\"\u003e\n \u003cp\u003eANOVA\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 47px;\"\u003e\n \u003cp\u003e0.174\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 47px;\"\u003e\n \u003cp\u003e0.951\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 57px;\"\u003e\n \u003cp\u003e0.831\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 47px;\"\u003e\n \u003cp\u003e0.508\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 57px;\"\u003e\n \u003cp\u003e1.18\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 47px;\"\u003e\n \u003cp\u003e0.323\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 54px;\"\u003e\n \u003cp\u003e0.607\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 50px;\"\u003e\n \u003cp\u003e0.658\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 142px;\"\u003e\n \u003cp\u003eEmployment status (6 levels)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 66px;\"\u003e\n \u003cp\u003eANOVA\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 47px;\"\u003e\n \u003cp\u003e0.661\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 47px;\"\u003e\n \u003cp\u003e0.681\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 57px;\"\u003e\n \u003cp\u003e1.32\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 47px;\"\u003e\n \u003cp\u003e0.253\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 57px;\"\u003e\n \u003cp\u003e3.20\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 47px;\"\u003e\n \u003cp\u003e\u003cstrong\u003e0.006\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 54px;\"\u003e\n \u003cp\u003e1.18\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 50px;\"\u003e\n \u003cp\u003e0.320\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 142px;\"\u003e\n \u003cp\u003eMental health support experience (binary)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 66px;\"\u003e\n \u003cp\u003et test\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 47px;\"\u003e\n \u003cp\u003e-0.637\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 47px;\"\u003e\n \u003cp\u003e0.528\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 57px;\"\u003e\n \u003cp\u003e-1.53\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 47px;\"\u003e\n \u003cp\u003e0.137\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 57px;\"\u003e\n \u003cp\u003e-2.65\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 47px;\"\u003e\n \u003cp\u003e\u003cstrong\u003e0.012\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 54px;\"\u003e\n \u003cp\u003e-1.14\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 50px;\"\u003e\n \u003cp\u003e0.261\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 142px;\"\u003e\n \u003cp\u003eChatbot experience (binary)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 66px;\"\u003e\n \u003cp\u003et test\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 47px;\"\u003e\n \u003cp\u003e-1.44\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 47px;\"\u003e\n \u003cp\u003e0.153\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 57px;\"\u003e\n \u003cp\u003e0.380\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 47px;\"\u003e\n \u003cp\u003e0.705\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 57px;\"\u003e\n \u003cp\u003e-0.859\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 47px;\"\u003e\n \u003cp\u003e0.392\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 54px;\"\u003e\n \u003cp\u003e0.134\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 50px;\"\u003e\n \u003cp\u003e0.894\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 142px;\"\u003e\n \u003cp\u003ePHQ-9 Severity Result \u0026ndash; HopeBot (5 levels)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 66px;\"\u003e\n \u003cp\u003eANOVA\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 47px;\"\u003e\n \u003cp\u003e1.23\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 47px;\"\u003e\n \u003cp\u003e0.300\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 57px;\"\u003e\n \u003cp\u003e0.462\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 47px;\"\u003e\n \u003cp\u003e0.763\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 57px;\"\u003e\n \u003cp\u003e1.57\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 47px;\"\u003e\n \u003cp\u003e0.188\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 54px;\"\u003e\n \u003cp\u003e1.72\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 50px;\"\u003e\n \u003cp\u003e0.150\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 142px;\"\u003e\n \u003cp\u003ePHQ-9 Severity Result \u0026ndash; Self-test (5 levels)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 66px;\"\u003e\n \u003cp\u003eANOVA\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 47px;\"\u003e\n \u003cp\u003e1.40\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 47px;\"\u003e\n \u003cp\u003e0.238\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 57px;\"\u003e\n \u003cp\u003e0.284\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 47px;\"\u003e\n \u003cp\u003e0.888\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 57px;\"\u003e\n \u003cp\u003e0.823\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 47px;\"\u003e\n \u003cp\u003e0.513\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 54px;\"\u003e\n \u003cp\u003e0.394\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 50px;\"\u003e\n \u003cp\u003e0.813\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u003cstrong\u003ePerceived Acceptability and Adoption Intentions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eParticipant preferences for PHQ-9 administration formats varied. A majority (n = 92, 69.7%) favoured the chatbot-assisted version over self-completion, citing greater engagement (39 mentions, 20.2%), emotionally supported and interactive communication (36 mentions, 18.7%), and real-time interpretive scaffolding (32 mentions, 16.6%). In contrast, participants who preferred self-administration emphasised its efficiency (20 mentions, 10.4%) and its perceived suitability for situations where users were not experiencing immediate emotional distress (13 mentions, 6.7%). A small subset (n = 6, 4.6%) expressed no clear preference. Preference was associated with employment status overall (\u0026chi;\u0026sup2; = 21.69, df = 12, p = .041), but no single employment subgroup (all p \u0026gt; 0.5) showed a statistically significant odds ratio relative to others, suggesting weak or diffuse effects.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eAlthough only 19.7% (n=26) of respondents reported prior engagement with professional mental health services, all participants were invited to reflect on HopeBot\u0026rsquo;s performance relative to mental counselling. Consistent with earlier themes, \u0026nbsp;participants positively appraised the chatbot\u0026rsquo;s structured questioning and empathic tone (each 38 mentions, 12.3%). However, limitations were frequently noted, including perceptions of insufficient human-likeness (58 mentions, 18.7%), emotional shallowness or detachment (42 mentions, 13.5%), and overly generic or impersonal responses lacking individual tailoring (19 mentions, 6.1%).\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eDespite some concerns, the majority of participants (n = 115, 87.1%) expressed willingness to use or recommend HopeBot in the future. Many highlighted its broader potential in mental health screening, particularly for early detection via algorithmic pattern recognition (56 mentions, 22.0%) and supportive, emotionally responsive communication (32 mentions, 12.6%). Other frequently cited advantages included immediate accessibility (15 mentions, 5.9%), rapid response time (14 mentions, 5.5%), and anonymous interaction (8 mentions, 3.1%). Several participants (n = 12, 4.7%) stressed that such tools should augment\u0026mdash;not replace\u0026mdash;professional care. Concerns centred on the need for clinical validation (4 mentions, 1.6%), potential diagnostic unreliability (3 mentions, 1.2%), and data privacy risks (11 mentions, 4.4%). A small number expressed cautious optimism, emphasising the importance of ethical governance and integration into trusted health systems (4 mentions, 1.6%). Willingness to recommend was significantly lower among participants with prior mental health service experience (Fisher\u0026rsquo;s exact OR = 0.22, CI 0.05-0.92, p = .0497), suggesting that those with firsthand experience may apply more critical standards in evaluating AI-based tools.\u0026nbsp;\u003c/p\u003e"},{"header":"Discussion","content":"\u003cp\u003eThe present study demonstrates that a GPT-4o-powered, voice-interactive chatbot (HopeBot) can feasibly administer the PHQ-9. HopeBot-assisted and self-administered scores showed high concordance (ICC = 0.91; median absolute difference = 1 point), without systematic bias in symptom severity. Participants positively described the chatbot as timely and accessible, supporting the potential of automated mental health screening beyond clinician-led settings, though broader validation remains necessary.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eAlthough clinician-administered PHQ-9 interviews detect suicidality and comorbid conditions with higher sensitivity\u003csup\u003e11\u003c/sup\u003e, self-administered formats remain standard in digital screening, showing acceptable psychometric performance (sensitivity \u0026asymp;0.80; specificity \u0026asymp;0.85)\u003csup\u003e8\u003c/sup\u003e\u003csup\u003e,\u003c/sup\u003e\u003csup\u003e45\u003c/sup\u003e. In this study, the self-test served as a pragmatic reference, aligning with real-world usage where individuals commonly complete online questionnaires independently. HopeBot achieved comparable agreement with this benchmark while offering additional benefits such as real-time clarification, empathic support, and increased user engagement. \u003cstrong\u003e\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eBeyond score concordance, HopeBot elicited considerable user trust. Among the 75 participants who directly compared the two formats, 70.7% (n = 53/73) expressed greater confidence in the chatbot-assisted scores, attributing this preference to features such as real-time clarification (33 mentions, 42.9%) and an empathic tone (15 mentions, 19.5%). These interactional advantages parallel those of semi-structured clinical interviews, while preserving the scalability, standardisation, and accessibility of automated delivery.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eHowever, perceptions of recommendation helpfulness (Q19), a key determinant of user trust, varied across subgroups. Full-time students and individuals managing household responsibilities gave lower ratings than full-time employees (F = 3.20, p = .006), and those with prior mental health service experience rated recommendations less helpful than first-time users (t = \u0026ndash;2.65, p = .012). These differences likely reflect heightened expectations shaped by users\u0026rsquo; life context and therapeutic background. Students and homemakers\u0026mdash;often managing complex emotional demands with limited external support\u0026mdash;may have anticipated greater empathy and personalisation\u003csup\u003e46\u003c/sup\u003e\u003csup\u003e,\u003c/sup\u003e\u003csup\u003e47\u003c/sup\u003e. Similarly, individuals with prior counselling experience may have evaluated responses against professional standards\u003csup\u003e4\u003c/sup\u003e, consistent with expectancy-disconfirmation theory\u003csup\u003e48\u003c/sup\u003e. These findings suggest that perceptions of chatbot utility are strongly influenced by users\u0026rsquo; prior experiences and situational expectations.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eAs summarised in Table.3, HopeBot introduces several substantive advancements over earlier PHQ-9 chatbots that rely primarily on Dialogflow-based intent matching (e.g., Perla\u003csup\u003e14\u003c/sup\u003e, Marcus\u003csup\u003e23\u003c/sup\u003e, DEPRA\u003csup\u003e17\u003c/sup\u003e). By integrating RAG with the GPT-4o architecture, HopeBot supports fully open-ended interaction while meeting the technical constraints of real-time screening. GPT-4o was selected based on three key considerations: (1) independent benchmarks reported the lowest latency among publicly available LLMs (\u0026asymp;0.45 s for text, \u0026asymp;0.32 s for audio) at that time, outperforming Claude 3 and Gemini 1.5\u003csup\u003e50\u003c/sup\u003e;(2) its unified multimodal framework eliminates the need for separate ASR\u0026ndash;TTS pipelines, which remain necessary for open-source and contemporary commercial alternatives\u003csup\u003e51\u003c/sup\u003e; and (3) its extended context window (100k tokens) and multilingual tokeniser ensure compatibility with the demands of interactive PHQ-9 delivery while preserving clinical safety constraints\u003csup\u003e52\u003c/sup\u003e. Although emerging models such as Claude 3, Gemini 1.5, and Llama-3 warrant future investigation, their current limitations in latency, speech integration, and alignment tooling rendered them suboptimal for the present study.\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis architecture enables dynamic turn-taking, clarification of ambiguous responses, and seamless support for languages beyond English and Mandarin\u0026mdash;capabilities not reported in prior systems. Transparency is further enhanced through item-level scoring and source-linked rationales, features absent in comparators such as IGOR\u003csup\u003e22\u003c/sup\u003e and Marcus\u003csup\u003e23\u003c/sup\u003e. Among participants who engaged with the clarification module, 93.2% indicated that these explanations improved their response accuracy.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eUser feedback underscores these functional gains. In contrast to Marcus, where only 18.1% of users preferred the chatbot over conventional self-report\u003csup\u003e23\u003c/sup\u003e, 69.7% (n=92) of participants in this study favoured HopeBot, and 87.1% (n=115) expressed willingness to reuse and recommend the system. Collectively, HopeBot\u0026rsquo;s integration of low-latency generation, multilingual adaptability, explainable outputs, and improved user engagement positions it as a more transparent and clinically versatile alternative to earlier rule-based tools.\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTable.\u003c/strong\u003e\u003cstrong\u003e3\u003c/strong\u003e\u003cstrong\u003e.\u0026nbsp;\u003c/strong\u003e\u003cstrong\u003eComparison of automated depression screening tools across key functional dimensions\u003c/strong\u003e\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\" width=\"985\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eDimension\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eCore architecture\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eLanguage flexibility\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eScreening instrument\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eExplainability\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eUser study (N)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eEmpathy/tone\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eDeployment potential\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\"\u003e\n \u003cp\u003e\u003cstrong\u003eKey limitation in prior work\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003ePerla (2020)\u003c/strong\u003e\u003cstrong\u003e\u003csup\u003e14\u003c/sup\u003e\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\"\u003e\n \u003cp\u003eGoogle Dialogflow with ML-based intent classification and Firebase backend\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\"\u003e\n \u003cp\u003eNatural language input with ~200 phrases per item and synonym matching\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003ePHQ-9 (Spanish)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eProvides total score, risk status, and resource links at the end\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e276 participants; 108 completed both Perla and web-based PHQ-9\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\"\u003e\n \u003cp\u003eSupportive tone with female persona and encouraging prompts\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eWeb and major messaging platforms (e.g., Messenger, Google Assistant, Telegram)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eLimited validation, English-only tools, and low engagement in prior form-based tools\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\"\u003e\n \u003cp\u003e\u003cstrong\u003eMarcus (2023)\u003c/strong\u003e\u003cstrong\u003e\u003csup\u003e23\u003c/sup\u003e\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\"\u003e\n \u003cp\u003eDialogflow intents + BERT model (Node.js / GCP; Kommunicate UI)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\"\u003e\n \u003cp\u003eFree-text input classified to PHQ-9\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003ePHQ-9 (English)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\"\u003e\n \u003cp\u003eOutputs total PHQ-9 score only\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\"\u003e\n \u003cp\u003e81 U.S. college students (130 enrolled)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eNeutral; static male avatar; no empathy modelling\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eiOS app and web chat prototype\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eEarlier, PHQ-9 chatbots lacked validation with U.S. college students and used only fixed-choice input\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\"\u003e\n \u003cp\u003e\u003cstrong\u003eIGOR (2021)\u003c/strong\u003e\u003cstrong\u003e\u003csup\u003e22\u003c/sup\u003e\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\"\u003e\n \u003cp\u003eDialogflow intents with Node.js + Firebase backend\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\"\u003e\n \u003cp\u003eButton/option input (PHQ-9 scores 0\u0026ndash;3)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\"\u003e\n \u003cp\u003ePHQ-9\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eSends total score to clinician; not shown to user\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\"\u003e\n \u003cp\u003e10 university staff (usability test)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eNeutral; no empathic responses\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003ePrototype within MS self-management app\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eResults hidden from the user; rule-based flow fails on off-topic input\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eDEPRA (2023)\u003c/strong\u003e\u003cstrong\u003e\u003csup\u003e17\u003c/sup\u003e\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\"\u003e\n \u003cp\u003eDialogflow chatbot with 27 SIGH-D/IDS-C intents\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\"\u003e\n \u003cp\u003eOpen-text input with intent matching\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\"\u003e\n \u003cp\u003eSIGH-D + IDS-C\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\"\u003e\n \u003cp\u003eFinal score and severity level only; no item-level feedback\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\"\u003e\n \u003cp\u003e50 Australian adults\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eNeutral tone; no empathy modelling\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eFacebook Messenger chatbot prototype\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eHigh cognitive load; long completion time\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eEmoScan (2024)\u003c/strong\u003e\u003cstrong\u003e\u003csup\u003e24\u003c/sup\u003e\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\"\u003e\n \u003cp\u003eMistral-7B fine-tuned on synthetic clinical interviews (PsyInterview)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\"\u003e\n \u003cp\u003eFree-text, multi-turn inputs; fine-tuned LLM\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\"\u003e\n \u003cp\u003eDSM-5-based emotional disorder classification (coarse \u0026amp; fine-grained)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\"\u003e\n \u003cp\u003eLLM-generated explanations based on DSM-5 criteria\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e1,157 synthetic cases; 50 expert-evaluated; GPT-4-based performance evaluation\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eSimulated empathy assessed by GPT-4 and clinical experts\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eResearch prototype; not deployed clinically\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eHeavily synthetic Reliance on synthetic data; limited real-world generalisability\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eMoodpath App (2021)\u003c/strong\u003e\u003cstrong\u003e\u003csup\u003e49\u003c/sup\u003e\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\"\u003e\n \u003cp\u003eSmartphone app, 3\u0026times; daily AA (45 ICD-10 items + mood)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\"\u003e\n \u003cp\u003eTap yes/no \u0026rarr; 4-level severity; 5-point mood scale\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\"\u003e\n \u003cp\u003eDSM-5-based emotional disorder classification (coarse \u0026amp; fine-grained)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e14-day summary with score, severity band \u0026amp; mood charts\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\"\u003e\n \u003cp\u003e113 general-population users\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eNeutral; no empathic dialogue\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eLive on iOS \u0026amp; Android\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003ePrior tools used retrospective questionnaires with little validation\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eHopeBot (2025)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\"\u003e\n \u003cp\u003eGPT-4o with RAG\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eSupports open-ended user input in various languages\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003ePHQ-9 administered via free-text/voice dialogue, with dynamic clarification and fuzzy score interpretation\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eProvides item-level rationales and context-relevant evidence drawn from curated knowledge bases\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\"\u003e\n \u003cp\u003e132 participants (Chinese + UK residents)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eLLM-generated responses were generally perceived as empathic; mean comfort rating 8.51/10, though variation in emotional tone and delivery style was reported\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eAvailable via web interface; compatible with both desktop and mobile devices (iOS, Android); supports both Mandarin and English voice/text modalities\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eAbsence of non-verbal cues, occasional mechanical tone in voice output, and lack of clinical validation for diagnostic reliability\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003eDespite its technical strengths, HopeBot did not fully replicate the relational depth of professional counselling. A substantial number of participants described the interaction as emotionally flat or impersonal (58 mentions, 18.7%), citing insufficient affective nuance (42 mentions, 13.5%) and reliance on generic responses (19 mentions, 6.1%). These limitations indicate that, even with prompt tuning and retrieval augmentation, simulated empathy remains perceptibly artificial. The findings highlight a persistent gap between the linguistic fluency of LLMs and the emotional authenticity expected in therapeutic dialogue.\u003c/p\u003e\n\u003cp\u003eMore broadly, these limitations reflect structural constraints inherent in current-generation conversational AI. While LLMs can generate fluent and contextually appropriate text, they lack access to non-verbal signals\u0026mdash;including tone, facial expression, and posture\u0026mdash;which clinicians routinely rely upon to identify distress, hesitancy, or latent risk\u003csup\u003e18\u003c/sup\u003e. Cross-linguistic speech synthesis poses further challenges. Although GPT-4o natively supports Mandarin text, the deployment of a text-to-speech engine optimised for English prosody introduced prosodic inconsistencies that reduced perceived naturalness and constrained affective expressiveness\u003csup\u003e53\u003c/sup\u003e. Culturally adaptive voice synthesis models may be required to ensure emotional fidelity and communicative clarity across diverse linguistic contexts.\u003c/p\u003e\n\u003cp\u003eBeyond technical limitations, ethical and epistemic challenges must also be addressed. Repeated exposure to standardised, syntactically polished language may subtly influence how users articulate their experiences\u003csup\u003e54\u003c/sup\u003e, potentially narrowing expressive nuance. Emerging evidence further suggests that clinicians may revise their judgments when presented with opaque AI-generated recommendations, even in the absence of a clear clinical rationale\u003csup\u003e55\u003c/sup\u003e, raising concerns about automation bias and the erosion of clinical autonomy. Accordingly, systems such as HopeBot should be positioned as adjunctive tools that support\u0026mdash;rather than replace\u0026mdash;professional expertise\u003csup\u003e18\u003c/sup\u003e. Responsible deployment will require transparent algorithmic logic, explicit clinical oversight, and safeguards that ensure interpretability, accountability, and informed consent. Future research should extend beyond screening accuracy to investigate how such technologies influence therapeutic relationships, user trust, and long-term mental health outcomes.\u003c/p\u003e\n\u003cp\u003eThis study has several limitations. Although chatbot-assisted scores aligned closely with self-reports, this does not establish diagnostic validity, given the lower sensitivity of self-assessments compared to clinician-led evaluations. The sequential completion of both PHQ-9 formats within a single session may have introduced recall bias, with concordance potentially influenced by short-term memory or transient mood states. The sample was skewed toward younger, digitally literate users, limiting generalisability to older or digitally excluded populations. In addition, the controlled testing environment may not reflect naturalistic conditions. All findings are specific to GPT-4o and may not extend to other LLM-based systems.\u003c/p\u003e\n\u003cp\u003eFuture research should evaluate clinical utility, safety, and equity across settings. Multisite trials and randomised comparisons with standard self-assessment tools could clarify HopeBot\u0026rsquo;s impact on referral accuracy, care access, and clinician workload. Development of governance frameworks\u0026mdash;such as escalation protocols, audit trails, and transparent disclosures\u0026mdash;will be essential to meet regulatory standards. Particular attention is needed for high-risk encounters requiring non-verbal cues, and for underserved groups with limited digital access or linguistic mismatch. Cross-cultural validation will also be necessary to determine the applicability of LLM-assisted screening across healthcare systems, including those in the UK and China.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eData availability\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eRestrictions apply to the availability of the full dataset generated and analysed during the current study in order to protect participant privacy; accordingly, these data are not publicly available. However, the custom training data used to develop the RAG component of HopeBot is available at: https://github.com/candiceguo0528/HopeBot-Candice.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCode availability\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe code for the analysis is available through a GitHub code repository (https://github.com/candiceguo0528/HopeBot-Candice).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAcknowledgements\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors gratefully acknowledge the invaluable contributions of the following colleagues to the evaluation of HopeBot: Dr Alexandru Petcu, MD (NHS consultant psychiatrist) for clinical insights and safety guidance; Kai Yao and Zuyu Wang (Post-Graduate Teaching Assistants, UCL Division of Psychiatry) for assistance with study design, participant recruitment and data interpretation; and Wei\u0026rsquo;an Li (licensed mental-health counsellor, China) for expert feedback on Mandarin content and cultural adaptation. Their support greatly strengthened the rigour and relevance of this work.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthor information\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthors and Affiliations\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eInstitute of Health Informatics University College, London, London, United Kingdom\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eZhijun Guo; Alvina Lai; Julia Ive; Yutong Wang; Luyuan Qi; Johan H Thygesen; Kezhi Li\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eLancashire and South Cumbria NHS Foundation Trust, Psychiatry Department, Lancashire, UK\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eUniversity of Medicine and Pharmacy \u0026quot;Victor Babeș\u0026quot;, Timișoara, Romania\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAlexandru Petcu\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eContributions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eConceptualisation: Z.G., K.L.; System development and deployment: Z.G.; Streamlit implementation: Z.G., L.Q.; Participant recruitment and facilitation: Z.G., Y.W.; Qualitative analysis: Z.G.; Secondary validation: K.L.; Writing\u0026mdash;original draft: Z.G.; Writing\u0026mdash;review and editing: Z.G., K.L., A.L., J.I., J.T.; Clinical evaluation and feedback: A.P.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCorresponding authors\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eCorrespondence to\u0026nbsp;Kezhi Li.\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eEthics declarations\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting interests\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors declare no competing interests.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eWorld Health Organisation (WHO). Depressive disorder (depression). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.who.int/news-room/fact-sheets/detail/depression\u003c/span\u003e\u003cspan address=\"https://www.who.int/news-room/fact-sheets/detail/depression\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. (2023).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eNHS. Symptoms - Depression in adults. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.nhs.uk/mental-health/conditions/depression-in-adults/symptoms\u003c/span\u003e\u003cspan address=\"https://www.nhs.uk/mental-health/conditions/depression-in-adults/symptoms\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. (2021).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003ePuyat, J. H., Kazanjian, A., Goldner, E. M. \u0026amp; Wong, H. How Often Do Individuals with Major Depression Receive Minimally Adequate Treatment? A Population-Based, Data Linkage Study. \u003cem\u003eCan. J. Psychiatry Rev. Can. Psychiatr.\u003c/em\u003e 61, 394\u0026ndash;404 (2016).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eGuo, Z., Lai, A., Deng, Z. \u0026amp; Li, K. Evaluating the Feasibility and Acceptability of a GPT-Based Chatbot for Depression Screening: A Mixed-Methods Study. in \u003cem\u003eArtificial Intelligence in Healthcare\u003c/em\u003e (eds. Xie, X., Styles, I., Powathil, G. \u0026amp; Ceccarelli, M.) 249\u0026ndash;263 (Springer Nature Switzerland, Cham, 2024).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eCook, S. C., Schwartz, A. C. \u0026amp; Kaslow, N. J. Evidence-Based Psychotherapy: Advantages and Challenges. \u003cem\u003eNeurotherapeutics\u003c/em\u003e 14, 537\u0026ndash;545 (2017).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eGOV.UK. Health matters: reducing health inequalities in mental illness. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.gov.uk/government/publications/health-matters-reducing-health-inequalities-in-mental-illness/health-matters-reducing-health-inequalities-in-mental-illness\u003c/span\u003e\u003cspan address=\"https://www.gov.uk/government/publications/health-matters-reducing-health-inequalities-in-mental-illness/health-matters-reducing-health-inequalities-in-mental-illness\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. (2018).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eFaugno, E. \u003cem\u003eet al.\u003c/em\u003e Experiences with diagnostic delay among underserved racial and ethnic patients: a systematic review of the qualitative literature. \u003cem\u003eBMJ Qual. Saf.\u003c/em\u003e 34, 190\u0026ndash;200 (2025).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eAmerican Psychological Association. Patient Health Questionnaire (PHQ-9 \u0026amp; PHQ-2). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.apa.org/pi/about/publications/caregivers/practice-settings/assessment/tools/patient-health\u003c/span\u003e\u003cspan address=\"https://www.apa.org/pi/about/publications/caregivers/practice-settings/assessment/tools/patient-health\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. (2011).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLee, P. W., Schulberg, H. C., Raue, P. J. \u0026amp; Kroenke, K. Concordance between the PHQ-9 and the HSCL-20 in depressed primary care patients. \u003cem\u003eJ. Affect. Disord.\u003c/em\u003e 99, 139\u0026ndash;145 (2007).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eRobinson, J. \u003cem\u003eet al.\u003c/em\u003e Why are there discrepancies between depressed patients\u0026rsquo; Global Rating of Change and scores on the Patient Health Questionnaire depression module? A qualitative study of primary care in England. \u003cem\u003eBMJ Open\u003c/em\u003e 7, e014519 (2017).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eEack, S. M., Greeno, C. G. \u0026amp; Lee, B.-J. Limitations of the Patient Health Questionnaire in Identifying Anxiety and Depression: Many Cases Are Undetected. \u003cem\u003eRes. Soc. Work Pract.\u003c/em\u003e 16, 625\u0026ndash;631 (2006).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMorris, R. R., Schueller, S. M. \u0026amp; Picard, R. W. Efficacy of a Web-Based, Crowdsourced Peer-To-Peer Cognitive Reappraisal Platform for Depression: Randomized Controlled Trial. \u003cem\u003eJ. Med. Internet Res.\u003c/em\u003e 17, e4167 (2015).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMohr, D. C., Burns, M. N., Schueller, S. M., Clarke, G. \u0026amp; Klinkman, M. Behavioral Intervention Technologies: Evidence review and recommendations for future research in mental health. \u003cem\u003eGen. Hosp. Psychiatry\u003c/em\u003e 35, 332\u0026ndash;338 (2013).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eArrabales, R. Perla: A Conversational Agent for Depression Screening in Digital Ecosystems. Design, Implementation and Validation. Preprint at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.48550/arXiv.2008.12875\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2008.12875\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. (2021).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMaples, B., Cerit, M., Vishwanath, A. \u0026amp; Pea, R. Loneliness and suicide mitigation for students using GPT3-enabled chatbots. \u003cem\u003eNpj Ment. Health Res.\u003c/em\u003e 3, 1\u0026ndash;6 (2024).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMahbub, M. \u003cem\u003eet al.\u003c/em\u003e Decoding substance use disorder severity from clinical notes using a large language model. \u003cem\u003eNpj Ment. Health Res.\u003c/em\u003e 4, 1\u0026ndash;10 (2025).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKaywan P., Ahmed K., Ibaida A., Miao Y. \u0026amp; Gu B. Early detection of depression using a conversational AI bot: A non-clinical trial. \u003cem\u003ePLOS ONE\u003c/em\u003e 18, e0279743 (2023).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eGuo, Z. \u003cem\u003eet al.\u003c/em\u003e Large Language Models for Mental Health Applications: Systematic Review. \u003cem\u003eJMIR Ment. Health\u003c/em\u003e 11, e57400 (2024).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBubeck, S. et al. Sparks of Artificial General Intelligence: Early experiments with GPT-4. Preprint at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.48550/arXiv.2303.12712\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2303.12712\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. (2023).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eFitzpatrick, K. K., Darcy, A. \u0026amp; Vierhile, M. Delivering Cognitive Behavior Therapy to Young Adults With Symptoms of Depression and Anxiety Using a Fully Automated Conversational Agent (Woebot): A Randomized Controlled Trial. \u003cem\u003eJMIR Mental Health\u003c/em\u003e 4, e7785 (2017).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMontgomery, B. Mother says AI chatbot led her son to kill himself in lawsuit against its maker. \u003cem\u003eThe Guardian\u003c/em\u003e \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.theguardian.com/technology/2024/oct/23/character-ai-chatbot-sewell-setzer-death\u003c/span\u003e\u003cspan address=\"https://www.theguardian.com/technology/2024/oct/23/character-ai-chatbot-sewell-setzer-death\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. (2024).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eGiunti, G., Isomursu, M., Gabarron, E. \u0026amp;Solad, Y. Designing Depression Screening Chatbots. in \u003cem\u003eNurses and Midwives in the Digital Age\u003c/em\u003e 259\u0026ndash;263 (IOS Press, 2021). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.3233/SHTI210719\u003c/span\u003e\u003cspan address=\"10.3233/SHTI210719\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eToulme, P., Nanaw, J. \u0026amp; Apostolellis, P. Marcus: A Chatbot for Depression Screening Based on the PHQ-9 Assessment. In \u003cem\u003eProceedings of the 16th International Conference on Advances in Computer-Human Interactions (ACHI\u003c/em\u003e 2023) 97\u0026ndash;105 (IARIA, Venice, 2023).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLiu, J. M. \u003cem\u003eet al.\u003c/em\u003e Enhanced Large Language Models for Effective Screening of Depression and Anxiety. Preprint at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.48550/arXiv.2501.08769\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2501.08769\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. (2025).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHatchWorks AI. Harnessing RAG in Healthcare: Use-Cases, Impact, \u0026amp; Solutions. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://hatchworks.com/blog/gen-ai/rag-for-healthcare\u003c/span\u003e\u003cspan address=\"https://hatchworks.com/blog/gen-ai/rag-for-healthcare\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. (2024).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eThase, M. E., Khazanov, G. \u0026amp; Wright, J. H. Cognitive and behavioral therapies. In \u003cem\u003eTasman\u0026rsquo;s Psychiatry\u003c/em\u003e (eds Tasman, A. \u003cem\u003eet al.\u003c/em\u003e) 1\u0026ndash;38 (Springer International Publishing, Cham, 2020). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1007/978-3-030-42825-9_35-1\u003c/span\u003e\u003cspan address=\"10.1007/978-3-030-42825-9_35-1\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eWMA - The World Medical Association-WMA Declaration of Helsinki \u0026ndash; Ethical Principles for Medical Research Involving Human Participants. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.wma.net/policies-post/wma-declaration-of-helsinki\u003c/span\u003e\u003cspan address=\"https://www.wma.net/policies-post/wma-declaration-of-helsinki\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. (2025).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eGupta, S., Ranjan, R. \u0026amp; Singh, S. N. A Comprehensive Survey of Retrieval-Augmented Generation (RAG): Evolution, Current Landscape and Future Directions. Preprint at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.48550/arXiv.2410.12837\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2410.12837\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. (2024).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eStreamlit \u0026bull; A faster way to build and share data apps. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://streamlit.io\u003c/span\u003e\u003cspan address=\"https://streamlit.io\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. (2021).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBraun, V. \u0026amp; and Clarke, V. Using thematic analysis in psychology. \u003cem\u003eQualitative Research in Psychology\u003c/em\u003e 3, 77\u0026ndash;101 (2006).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLangChain. Chroma: LangChain integration for vector storage. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://python.langchain.com/docs/integrations/vectorstores/chroma\u003c/span\u003e\u003cspan address=\"https://python.langchain.com/docs/integrations/vectorstores/chroma\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. (2025).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eCully, J. A. \u0026amp; Teten, A. L. \u003cem\u003eA Therapist\u0026rsquo;s Guide to Brief Cognitive Behavioral Therapy\u003c/em\u003e (U.S. Department of Veterans Affairs, 2008).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eTHU-COA. Emotional-Support-Conversation. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/thu-coai/Emotional-Support-Conversation\u003c/span\u003e\u003cspan address=\"https://github.com/thu-coai/Emotional-Support-Conversation\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. (2025).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eTHU-COAI. PsyQA: PsyQA_example.json. GitHub \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/thu-coai/PsyQA/blob/main/PsyQA_example.json\u003c/span\u003e\u003cspan address=\"https://github.com/thu-coai/PsyQA/blob/main/PsyQA_example.json\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. (2025).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eFenn, K. \u0026amp; Byrne, M. The key principles of cognitive behavioural therapy. \u003cem\u003eInnovAiT\u003c/em\u003e 6, 579\u0026ndash;585 (2013).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eNakao, M., Shirotsuki, K. \u0026amp; Sugaya, N. Cognitive\u0026ndash;behavioral therapy for management of mental health and stress-related disorders: Recent advances in techniques and technologies. \u003cem\u003eBiopsychosoc Med\u003c/em\u003e 15, 16 (2021).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eOpenAI. Vector embeddings - OpenAI API. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://platform.openai.com\u003c/span\u003e\u003cspan address=\"https://platform.openai.com\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. (2025).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBland, J. M. \u0026amp; Altman, D. STATISTICAL METHODS FOR ASSESSING AGREEMENT BETWEEN TWO METHODS OF CLINICAL MEASUREMENT. \u003cem\u003eThe Lancet\u003c/em\u003e 327, 307\u0026ndash;310 (1986).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eDATAtab. Wilcoxon Test Tutorial: t-Test, Chi-Square, ANOVA, Regression, Correlation. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://datatab.net/tutorial/wilcoxon-test\u003c/span\u003e\u003cspan address=\"https://datatab.net/tutorial/wilcoxon-test\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. (2025).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSpearman, C. The proof and measurement of association between two things. \u003cem\u003eAm. J. Psychol.\u003c/em\u003e 15, 72\u0026ndash;101 (1904).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eShrout, P. E. \u0026amp; Fleiss, J. L. Intraclass correlations: Uses in assessing rater reliability. \u003cem\u003ePsychological Bulletin\u003c/em\u003e 86, 420\u0026ndash;428 (1979).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eFisher, R. A. Statistical methods for research workers. In \u003cem\u003eBreakthroughs in Statistics: Methodology and Distribution\u003c/em\u003e (eds Kotz, S. \u0026amp; Johnson, N. L.) 66\u0026ndash;70 (Springer, New York, 1992). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1007/978-1-4612-4380-9_6\u003c/span\u003e\u003cspan address=\"10.1007/978-1-4612-4380-9_6\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLaerd Statistics. Chi-Square Test for Association using SPSS Statistics. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://statistics.laerd.com/spss-tutorials/chi-square-test-for-association-using-spss-statistics.php\u003c/span\u003e\u003cspan address=\"https://statistics.laerd.com/spss-tutorials/chi-square-test-for-association-using-spss-statistics.php\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. (2018)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eTechnology Networks. The Fisher\u0026rsquo;s Exact Test. Technology Networks. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttp://www.technologynetworks.com/tn/articles/the-fishers-exact-test-385738\u003c/span\u003e\u003cspan address=\"http://www.technologynetworks.com/tn/articles/the-fishers-exact-test-385738\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. (2024)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMiller, P. et al. The performance and accuracy of depression screening tools capable of self-administration in primary care: A systematic review and meta-analysis. \u003cem\u003eEur. J. Psychiatry\u003c/em\u003e 35, 1\u0026ndash;18 (2021).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLattie, E. G. et al. Digital Mental Health Interventions for Depression, Anxiety, and Enhancement of Psychological Well-Being Among College Students: Systematic Review. \u003cem\u003eJ. Med. Internet Res.\u003c/em\u003e 21, e12869 (2019).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKaplan, V. Mental Health States of Housewives: an Evaluation in Terms of Self-perception and Codependency. \u003cem\u003eInt. J. Ment. Health Addict.\u003c/em\u003e 21, 666\u0026ndash;683 (2023).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eOliver, R. L. A Cognitive Model of the Antecedents and Consequences of Satisfaction Decisions. \u003cem\u003eJ. Mark. Res.\u003c/em\u003e 17, 460\u0026ndash;469 (1980).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBurchert, S., Kerber, A., Zimmermann, J. \u0026amp;Knaevelsrud, C. Screening accuracy of a 14-day smartphone ambulatory assessment of depression symptoms and mood dynamics in a general population sample: Comparison with the PHQ-9 depression screening. \u003cem\u003ePLoS One\u003c/em\u003e 16, e0244955 (2021).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eVellum. Claude 3.5 Sonnet vs GPT-4o. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.vellum.ai/blog/claude-3-5-sonnet-vs-gpt4o\u003c/span\u003e\u003cspan address=\"https://www.vellum.ai/blog/claude-3-5-sonnet-vs-gpt4o\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. (2025).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eOpenAI. Introducing next-generation audio models in the API. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://openai.com/index/introducing-our-next-generation-audio-models\u003c/span\u003e\u003cspan address=\"https://openai.com/index/introducing-our-next-generation-audio-models\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. (2025).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eOpenAI. ChatGPT \u0026mdash; Release Notes. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://help.openai.com/en/articles/6825453-chatgpt-release-notes?utm_source=chatgpt.com\u003c/span\u003e\u003cspan address=\"https://help.openai.com/en/articles/6825453-chatgpt-release-notes?utm_source=chatgpt.com\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. (2025).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eOpenAI. Text to speech - OpenAI API. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://platform.openai.com\u003c/span\u003e\u003cspan address=\"https://platform.openai.com\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. (2025).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eEsmaeilzadeh, P., Mirzaei, T. \u0026amp;Dharanikota, S. Patients\u0026rsquo; Perceptions Toward Human\u0026ndash;Artificial Intelligence Interaction in Health Care: Experimental Study. \u003cem\u003eJ. Med. Internet Res.\u003c/em\u003e 23, e25856 (2021).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eRyan, K., Yang, H.-J., Kim, B. \u0026amp; Kim, J. P. Assessing the impact of AI on physician decision-making for mental health treatment in primary care. \u003cem\u003enpj Ment. Health Res.\u003c/em\u003e 4, 1\u0026ndash;8 (2025).\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-6976450/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-6976450/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eStatic tools like the Patient Health Questionnaire-9 (PHQ-9) effectively screen depression but lack interactivity and adaptability. We developed HopeBot, a chatbot powered by a large language model (LLM) that administers the PHQ-9 using retrieval-augmented generation and real-time clarification. In a within-subject study, 132 adults in the United Kingdom and China completed both self-administered and chatbot versions. Scores demonstrated strong agreement (ICC\u0026thinsp;=\u0026thinsp;0.91; 45% identical). Among 75 participants providing comparative feedback, 71% reported greater trust in the chatbot, highlighting clearer structure, interpretive guidance, and a supportive tone. Mean ratings (0\u0026ndash;10) were 8.4 for comfort, 7.7 for voice clarity, 7.6 for handling sensitive topics, and 7.4 for recommendation helpfulness; the latter varied significantly by employment status and prior mental-health service use (p\u0026thinsp;\u0026lt;\u0026thinsp;0.05). Overall, 87.1% expressed willingness to reuse or recommend HopeBot. These findings demonstrate voice-based LLM chatbots can feasibly serve as scalable, low-burden adjuncts for routine depression screening.\u003c/p\u003e","manuscriptTitle":"Development and Evaluation of HopeBot: an LLM-based chatbot for structured and interactive PHQ-9 depression screening","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-08-07 10:02:04","doi":"10.21203/rs.3.rs-6976450/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"88364721-3b5c-4456-ba6b-04902f530339","owner":[],"postedDate":"August 7th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":52702843,"name":"Biological sciences/Psychology"},{"id":52702844,"name":"Health sciences/Health care"}],"tags":[],"updatedAt":"2025-10-27T14:34:51+00:00","versionOfRecord":[],"versionCreatedAt":"2025-08-07 10:02:04","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-6976450","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-6976450","identity":"rs-6976450","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00