Automated Suicide Risk Factor Monitoring in Crisis Text Line Users: Comparative Study of AI and Human Ratings Using Large Language Models | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Automated Suicide Risk Factor Monitoring in Crisis Text Line Users: Comparative Study of AI and Human Ratings Using Large Language Models Julia Thomas, Zohar Elyoseph, Lars Kuchinke, Gunther Meinlschmidt This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-6210376/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 10 Nov, 2025 Read the published version in Scientific Reports → Version 1 posted 13 You are reading this latest preprint version Abstract Background : Large Language Models´ (LLMs) potential for psychological diagnostics requires systematic evaluation. Objective : To investigate conditions for reliable and valid psychological assessments, focusing on suicide risk evaluation in clinical data by comparing LLM-generated ratings with human expert ratings across across configurations. Methods : We analyzed 100 youth crisis conversation transcripts rated by four experts using the Nurses Global Assessment of Suicide Scale (NGASR). Using Mixtral-7x8b-Instruct, we generated ratings across three temperature settings and prompting styles (zero-shot, few-shot, chain-of-thought). Across configurations we compared a) inter-rating-reliability for AI-generated NGASR risk and sum scores, b) LLM-to-human observer agreement regarding sum score, risk category, and item, using Krippendorff´s α, c) classification metrics of risk categories and individual items against human ratings. Results : LLM configuration strongly influenced assessment reliability. Zero-shot prompting at temperature 0 yielded perfect inter-rating reliability (α=1.00, 95% CI: [1-1] for high & very high risk), while few-shot prompting showed best human-AI agreement for very high risk (α=0.78, 95% CI: [0.67-0.89]) and strongest classification performance (balanced accuracy 0.54-0.71). Lower temperatures consistently improved reliability and accuracy. However, critical clinical items showed poor validity. Discussion : Our findings establish optimal conditions (zero temperature, task-specific prompting) for LLM-based psychological assessment. However, inconsistent clinical item performance and only moderate to-human agreement limit LLMs to initial screening rather than detailed assessment, requiring careful parameter control and validation. Biological sciences/Psychology Health sciences/Health care natural language processing retrieval augmentation machine learning psychometry benchmarking Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Introduction Large Language Models (LLM) are neural networks that predict text sequences using conditional word probabilities 1 . Through self-supervised learning, they process language by optimizing billions of parameters for text prediction 2 . These “foundational models” execute various tasks based on textual instructions or “prompts” 3 . LLMs demonstrate emergent capabilities in processing and reasoning by identifying complex word associations and developing implicit knowledge through vast training 4–6 . LLMs show promise in clinical psychology due to their language processing capabilities 7 . They support medical information retrieval 8 , treatment decisions 9 , clinical summarization 10 , and patient education 11 . Their extensive training enables access to medical and psychological knowledge 11 . AI-powered applications offer 24/7 availability, streamlined processing 12 , and reduced administrative burden 3 . LLMs´ contextual embeddings capture language nuances 13,14 and individual usage patterns 15 , while demonstrating cultural sensitivity in assessments 16 . Although not explicitly designed for psychological assessments, LLMs can be adapted through techniques like structured evaluation-based scoring 17 . However, ensuring high-quality benchmarks for clinical decision-making remains challenging due to “hallucination” - generating plausible but incorrect outputs 18 . This phenomenon, stemming from LLMs´ probabilistic nature and lack of intrinsic truth understanding 19 , poses a significant obstacle to achieving clinical-grade accuracy and reliability in LLM-based assessments 20 . This holds especially true in high risk domains of psychology and medicine. One of these domains is suicide prevention. Suicide remains a leading global mortality cause 21 , presenting an urgent need for cost-effective tools for prevention and monitoring 22–24 . Crisis text lines have demonstrated potential in suicide prevention by offering accessible support, making them an ideal testing ground for AI applications. The integration of LLM systems into mental health care holds particular promise for suicide prevention, where timely interventions can save lives. Recent studies demonstrate LLMs´ potential in psychiatric risk assessment. GPT-4 matched mental health professionals´ assessment capabilities 16,25 , showed enhanced risk factor detection 26 , and analyzed suicide-related media content effectively 27 . GPT-4 also achieved 0.6 precision in suicide plan prediction versus clinicians´ 0.7, with higher sensitivity (0.62 vs 0.53; 28 ). LLM analysis of crisis hotlines achieved 76% F1 score, outperforming manual assessments and traditional deep learning 29 . However, validation studies with clinical data remain limited. While these studies demonstrate promising potential, critical gaps remain in understanding how to achieve reliable and valid LLM-based clinical assessments. First, existing research hasn´t systematically examined how different LLM configurations affect assessment reliability and validity. Second, while various prompting strategies exist, their comparative effectiveness for clinical assessment remains untested. Third, the impact of temperature settings on clinical judgment reliability is unexplored, particularly for high-stakes decisions. Finally, no studies have conducted item-level analyses to identify which clinical assessment components are most suitable for LLM evaluation. To address these gaps, we investigated retrieval-augmented (RAG) LLM agents for structured psychological suicide assessments by measuring agreement between human and LLM ratings. We examined 1) the impact of various prompting styles (zero-shot, few-shot, and chain-of-thought) on reliability and validity, 2) observer agreement and classification performance across different operational settings to human expert raters, and 3) conducted granular analysis of item-specific metrics to assess which individual items were most amenable to automated assessment. Methods 2.1 Study Design We present the study design in Figure 1. The study analyzed chat transcripts from the German crisis text line, krisenchat (Figure 1). Four expert raters independently scored 16 items of the NGASR scale (Cutcliffe & Barker, 2004) to assess suicide risk. An LLM agent generated similar ratings using varied temperature values and prompting styles. We compared human and AI evaluations through interrater reliability, observer agreement, and classification metrics across operational settings. 2.2 Data Preparation This study analyzed chat transcripts from krisenchat, a German preclinical crisis intervention service for individuals up to 25 years old 30 . The data comprised counseling sessions conducted between 2021-11-30, and 2022-04-30, with transcripts representing complete counseling histories. Sample Selection and Stratification From an initial pool of 439 labeled cases, we selected 100 cases using stratified random sampling to ensure balanced representation across NGASR-assessed risk levels. The sample was equally distributed with 25 cases in each risk category: low ( 12). Study participation was restricted to female participants with a minimum age of 14 years who were seeking help for themselves, excluding cases of help-seeking for others as well as male and diverse gender cases. Data Processing To maintain internal validity, transcripts were preserved without modifications. Risk levels were determined using majority-voted NGASR sum scores from four independent clinical experts. For binary classifications (presence/absence of risk factors), a threshold of greater than 50% agreement among human raters or LLM ratings was used to establish positive items, 50:50 situation would result in a negative item. 2.3 Measures The NGASR scale, developed by Cutcliffe and Baker 31 and translated into German by Kozel et al. 32,33 , is a structured 16-item questionnaire assessing evidence-based suicide risk factors, not individual suicide probability. The scale encompasses a comprehensive range of risk factors: hopelessness, recent stress events, hallucinations/delusions, depression, social withdrawal, suicidal intention, suicide plans, family psychiatric/suicide history, recent losses, psychotic disorder, widowhood, previous attempts, poor socioeconomic conditions, substance abuse, terminal illness, and multiple hospitalizations. Five items - hopelessness, depression, suicidal plans, recent losses, and previous attempts - carry triple weight in scoring due to their elevated predictive value. Total scores indicate risk levels categorized as low (4 or below), moderate (5-8), high (9-11), and very high (12 and above). The German validation study demonstrated strong psychometric properties, with median item-wise observer agreements of 0.64 in Cohen´s Kappa (K) and 0.85 in Gwet´s AC1(AC1), while sum score agreements reached 0.90 and 0.91 in absolute agreement of Intra-Class-Correlation (ICC) and consistency, respectively. Rating Procedure Four independent expert raters from a specialized suicide and self-harm counseling unit 34 conducted the clinical assessments. The raters underwent comprehensive training on NGASR items through panel ratings and group discussions using non-study cases prior to conducting assessments. Each rater independently evaluated the complete set of counseling transcripts across all NGASR items. Inter-rater agreement was evaluated using Krippendorff´s α 35,36 . To maintain consistency and prevent observer drift, integrity discussions were conducted between rating sessions 37 , allowing raters to share insights and standardize their approach without modifying existing ratings. For analysis purposes, individual ratings were aggregated using majority voting, where agreement from more than 50% of raters established positive cases. Final sum scores and risk level assignments were calculated based on these aggregated ratings, incorporating the differential item weights specified in the NGASR manual. 2.4 LLM Framework and Implementation We implemented a framework to reduce LLM hallucination using Mixtral 8x7B, which employs sparse mixture of experts architecture to activate relevant model components for focused processing 38 . The model converts conversations into numerical embeddings using an instructor-transformer model based on T5 architecture 39 , enabling similarity comparisons via euclidean distance 20,40,41 . Our RAG approach anchors LLM responses to conversation context 42 . Implementation parameters included: 500-token chunks with 25% overlap, top 5 conversation chunks, and .95 similarity threshold. We tested temperature settings of 0.0, 0.5, and 1.0 to control output randomness, with lower values producing more deterministic results 43,44 . The study explored three distinct prompting styles: zero-shot, few-shot, and chain of thought. Zero-shot prompting presented questions directly from the scale manual without examples, relying on the model´s pre-existing knowledge to interpret and rate counseling transcripts. Few-shot prompting enhanced contextual understanding by providing carefully selected positive and negative examples prior to the rating task, while avoiding potential answer bias through example selection 45 . Chain of thought (CoT) prompting encouraged structured clinical reasoning by requiring step-by-step articulation of the assessment process, enabling insight into the model´s decision-making approach. Refer to (Figure 2) for exemplary prompting style formulations. Each prompting style incorporated a RAG context, role specification, and clear output requirements. The implemented framework can be represented as: Zero-Shot-Prompt = RAG context + P role + P output specification + P question Few-Shot-Prompt= RAG context + P role + P output specification + P examples + P question Chain-of-Thought-Prompt = RAG context + P role + P output specification + P question + P COT-Instruction The LLM generated 4,320 ratings per transcript (16 NGASR items × 3 temperature settings × 3 prompting styles × 30 repetitions). We termed each prompting style and temperature combination an operational configuration . For each configuration, we aggregated individual item ratings through majority voting, requiring >50% agreement to establish positive cases. We then calculated risk levels and sum scores following the NGASR manual´s scoring rules. 2.5 Statistical Analysis Descriptive Analysis For descriptive analysis, we characterized sociodemographic characteristics and service usage behaviors using frequencies for categorical variables, mean and standard deviation for normally distributed continuous variables. These statistics were stratified by risk level, with differences between risk levels evaluated using Chi-square tests for categorical variables and ANOVAs for normally distributed continuous variables. Reliability Analysis We assessed LLM measurement reliability of NGASR risk levels using Krippendorff´s α coefficients, treating each binary risk level as an independent rater. Krippendorff´s α accommodates multiple scale types (binary, ordinal, metric), enabling consistent comparison across NGASR items, risk levels, and sum scores. We used established agreement thresholds: perfect (1), substantial (≥0.80), moderate (0.67-0.79), weak (0.60-0.66), and poor (<0.60). Negative values indicated systematic disagreement. Uncertainty was quantified through bootstrapping (1000 resamples) to compute 95% confidence intervals for α values per operational configuration. Observer Agreement Analysis Using individual item ratings, we calculated sum scores and risk levels for each LLM and human rating. To evaluate validity against human ratings, we employed Krippendorff´s α coefficient with regression bias correction, accounting for nested rater groups of human raters and LLM ratings. This correction adjusts for the fact that overall agreement between groups is limited by within-group agreement levels, providing more accurate estimates of true inter-group agreement. Separate α coefficients were computed across risk levels, and sum scores aggregated per operational configuration. The overall α value was corrected for within-group agreement using: α_corrected = α_observed + β(α_expected - α_observed) where α_corrected represents the regression bias corrected Krippendorff´s α, α_observed is the originally calculated α, α_expected is the expected α value under the null hypothesis (typically 0), and β represents the regression coefficient capturing the relationship between within-group and between-group agreement rates. This coefficient essentially determines how much the observed agreement should be adjusted based on within-group rating patterns. We quantified uncertainty through bootstrapping (1000 resamples) to compute 95% confidence intervals for individual α values of risk levels per operational configuration. Classification Performance Analysis We derived final ratings through majority voting (>50% agreement) from LLM outputs (30 ratings per item/chain/temperature combination) and human ratings (4 per item). NGASR sum scores were calculated and categorized into four risk levels. As per each risk level, we computed binary classification metrics by comparing that level against all others combined (e.g., "high risk" vs. "not high risk"). We assessed validity against the human gold standardt through balanced accuracy, sensitivity, and specificity. Balanced accuracy addresses imbalanced sample rates by measuring detection ability for both present and absent risk factors, crucial for rare but critical symptoms. Sensitivity measures ability to detect present risk factors relative to positive case base rate, critical for identifying potential dangers. Specificity evaluates correct identification of true negatives compared to negative case base rate, important for avoiding false alarms and incurring costly and unnecessary treatment. Performance above respective base rates indicates meaningful discriminative ability, distinguishing true predictive performance from class distribution effects. We calculated 95% confidence intervals through bootstrapping (1000 resamples) for each operational configuration. Values exceeding 0.5 demonstrate above-random performance. Item Specific Analysis To evaluate automation potential across risk factors, we conducted item-specific analyses using deterministic model outputs (temperature 0) from different prompting approaches. Item observer agreement per prompting style was evaluated through Krippendorff´s α coefficient with regression bias correction. Final item classifications were derived via majority voting for each prompting style and compared against human consensus ratings. We evaluated classification performance through balanced accuracy, sensitivity, and specificity metrics, comparing these against respective base rates to determine significant improvements over chance-level performance. Error Analysis We analyzed cases where LLM ratings diverged from expected clinical reasoning through qualitative assessment. Our examination of chain-of-thought outputs revealed patterns in failed assessments. We analyzed both content and structure of the model´s clinical reasoning process, focusing on deviations from standard clinical judgment. 2.6 Tools and Software Analyses were conducted using Python 3.8 on a Google Cloud Platform Kubernetes cluster. A 5-bit quantized Mixtral7x8b model was deployed on a 24GB L4 GPU machine using Ollama. The workflow utilized LangChain 46 for LLM interaction and retrieval augmentation, Pandas 47 for data manipulation, Pingouin 48 and Krippendorff 49 packages for statistical calculations, and Seaborn 50 and Matplotlib 51 for visualizations and re for regular expression string matching 52 . 2.7 Ethical Considerations All methods in this study were carried out in accordance with relevant guidelines and regulations. All experimental protocols were approved by the Ethics Committee of the International Psychoanalytic University (IPU) Berlin (approval number: 2023_08). Informed consent was obtained from all subjects through Krisenchat's terms of service, which explicitly state that user data may be used for research purposes without direct identification of individuals. All personally identifiable information was removed from chat transcripts during preprocessing. The study utilized existing data from the crisis helpline, and participants were not compensated as this was a secondary analysis of routine service data. Research was performed in accordance with the Declaration of Helsinki. 2.8 Data Availability Statement The datasets generated during and/or analysed during the current study are not publicly available and cannot be shared due to the highly sensitive and confidential nature of crisis helpline chat transcripts from vulnerable individuals, including minors who cannot provide consent for data sharing. These conversations frequently contain personal details and sensitive information regarding mental health and suicidal ideation. This restriction is necessary to protect participant privacy and confidentiality and to comply with ethical guidelines and data protection regulations, including the General Data Protection Regulation (GDPR). The nature of our Institutional Review Board approval and ethical framework for this research explicitly prohibits any sharing of this data beyond the approved research team. For questions about the methodological approach, the corresponding author J.T at [email protected] may be contacted. Results Descriptive Analysis The analysis included 100 cases stratified by NGASR-assigned risk levels: low (12), randomly sampled from 439 labeled cases. Chi-square tests indicated group differences in age, with very high risk cases showing higher overall age. Refer to (Table 1) for mor detail. Analysis of demographic and interaction variables across risk levels revealed no significant differences. ANOVA tests yielded F-statistics of 1.088 for age ( p =0.358), 2.170 for counselor messages ( p =0.097), 1.396 for chatter messages ( p =0.249), and 1.634 for session count ( p =0.187), suggesting consistency across risk levels. Reliability Analysis Human raters demonstrated varying reliability across risk levels, from high reliability in low-risk cases (α = 0.91 [0.85, 0.97]) to weaker agreement in high-risk assessments (α = 0.63 [0.51, 0.74]). LLM reliability analysis revealed distinct patterns across risk levels and prompting approaches. For low-risk cases, few-shot prompting at temperature 0 achieved highest reliability (α = 0.98 [0.95, 1.02]), exceeding human reliability. Zero-shot maintained perfect reliability (α = 1.00) for high and very high risk levels, surpassing human agreement (α = 0.63 [0.51, 0.74] and α = 0.76 [0.66, 0.87] respectively). Please refer to (Table 2) for a detailed lineout of human observer agreement. Temperature increase markedly affected reliability across prompting styles. Chain-of-thought showed most pronounced degradation, with low-risk reliability dropping from α = 0.97 [0.93, 1.01] to α = 0.61 [0.50, 0.72] between temperature 0 and 1(Figure 3). Few-shot demonstrated more stability, particularly in very high risk cases (α = 0.97 [0.92, 1.01] at temperature 0 to α = 0.80 [0.72, 0.89] at temperature 1). Sum scores showed systematic disagreement in both human (α = -0.02 [-0.16, 0.11]) and LLM ratings, with the effect amplifying at higher temperatures. For all LLM Inter-Rating Reliability and Observer Agreement values refer to (Table 3). Observer Agreement Analysis As highlighted in (Figure 4, panel A), the observer agreement analysis, few-shot prompting at temperature 0 achieved highest agreement for low-risk cases (α = 0.78 [0.68, 0.88]), while zero-shot showed poorest agreement (α = 0.39 [0.24, 0.53]). Higher temperatures minimally affected few-shot performance but degraded chain-of-thought agreement from α = 0.72 [0.62, 0.83] to α = 0.67 [0.54, 0.79]. For moderate risk cases, all prompting styles demonstrated weak agreement, with chain-of-thought and few-shot at temperature 0 performing marginally better (α = 0.33 [0.19, 0.48]). Agreement declined with temperature increases, most notably in chain-of-thought dropping to α = 0.20 [0.03, 0.37] at temperature 1. In high-risk evaluations, zero-shot demonstrated strongest agreement (α = 0.67 [0.55, 0.80]) at temperature 0, maintaining stability across temperatures, while few-shot and chain-of-thought showed marked degradation with increased temperatures, dropping to α = 0.34 [0.15, 0.52] and α = 0.35 [0.17, 0.52] respectively. For very high-risk cases, few-shot at temperature 0 achieved highest agreement (α = 0.78 [0.67, 0.89]), with all styles maintaining relatively stable performance. Zero-shot demonstrated most consistent agreement (α = 0.75 [0.64, 0.86]) across temperature settings. Lastly, sum scores revealed systematic disagreement across all configurations, with negative α values deteriorating at higher temperatures. Few-shot at temperature 0 showed least disagreement (α = -0.58 [-0.65, -0.50]), while chain-of-thought at temperature 1 demonstrated strongest disagreement (α = -0.90 [-0.94, -0.86]). See also (Table 3). Classification Performance The LLM framework demonstrated distinct performance patterns across risk levels as highlighted in (Figure 5, panel A). For low-risk cases, performance was consistently strong (BA: 0.71-0.72 [0.67-0.79]), with zero-shot prompting achieving highest sensitivity (0.93 [0.81-1.00]) despite lower specificity (0.49 [0.42-0.61]). Few-shot prompting provided better balance with high specificity (0.92 [0.90-0.95]), maintaining stable performance across temperature settings, this being particularly valuable as it minimizes false positives, reducing unnecessary clinical interventions while maintaining screening efficiency. Meanwhile, performance declined substantially for moderate risk cases, approaching random classification. Few-shot prompting showed marginally better results (BA: 0.54 [0.49-0.59]) with balanced sensitivity (0.42 [0.26-0.55]) and specificity (0.67 [0.58-0.74]), while temperature variations had minimal impact on classification accuracy. Near-random classification and lowered sensitivity may raise concerns, as missing these cases could prevent early intervention. For high-risk cases, few-shot prompting demonstrated superior performance (BA: 0.67 [0.55-0.74]), achieving better sensitivity (0.62 [0.50-0.73]) and specificity (0.71 [0.66-0.82]). In contrast, zero-shot´s poor sensitivity (0.05 [0.00-0.11]) poses substantial clinical risk, despite high specificity (0.93 [0.89-0.97]). Lastly, for very high-risk cases, few-shot prompting achieved the best balanced accuracy (BA: 0.61 [0.58-0.65]), maintaining moderate sensitivity (0.31 [0.09-0.53]) and high specificity (0.92 [0.86-0.96]). Chain-of-thought showed similar performance (BA: 0.60 [0.52-0.66]) but lower sensitivity (0.30 [0.17-0.43]). Zero-shot performed worst with perfect specificity (1.00 [1.00-1.00]) but negligible sensitivity (0.06 [0.00-0.10]), making it clinically unsuitable for severe risk assessment where missed cases have the highest potential consequences. Please also see (Table 4) for a detailed breakdown of all values. Item Specific Analysis Observer agreement varied across NGASR items at temperature 0, with distinct patterns for different item types (Figure 4, Panel A). Behaviorally-anchored items showed highest agreement: hearing voices achieved near-perfect agreement (human α = 1.00, chain-of-thought α = 0.95 95% CI: [0.90-1.00], few-shot α = 0.91 95% CI:[0.84-0.98], zero-shot α = 0.97 95% CI: [0.93-1.01]). Items requiring clinical inference showed lower agreement: hopelessness assessment demonstrated poor agreement (human α = 0.62, chain-of-thought α = 0.24 95% CI: [0.06-0.41], few-shot α = 0.29 95% CI: [0.12-0.46], zero-shot α = 0.42 [0.26-0.58]). Classification metrics revealed similar patterns (Figure 5, Panel B). Behavioral items showed strong performance: hearing voices achieved high balanced accuracy with few-shot prompting (BA = 0.97 95% CI: [0.94-0.99]). Complex clinical items performed near random: social withdrawal showed BA = 0.62 95% CI: [0.47-0.77] despite high human reliability (α = 0.92). Few-shot prompting achieved highest balanced accuracy for suicide ideation (BA = 0.80 95% CI: [0.69-0.89]). Sensitivity varied by item type and prompting style. Few-shot excelled with behavioral items (hearing voices: 1.00 95% CI: [1.00-1.00]), while zero-shot struggled with suicide assessment (suicide plan: 0.04 95% CI: [0.00-0.11]). All prompting styles maintained high specificity, particularly for observable factors (hearing voices - chain-of-thought: 0.98 95% CI: [0.94-1.00], few-shot: 0.93 95% CI: [0.88-0.98], zero-shot: 0.99 95% CI: [0.96-1.00]). Regarding the tradeoff between sensitivity and specificity, Zero-shot excelled in specificity but struggled with sensitivity, particularly for suicide-related items. Few-shot achieved the most balanced trade-off, maintaining good sensitivity without sacrificing specificity. Chain-of-thought showed moderate performance in both metrics but with less extreme trade-offs. This suggests that improvements in sensitivity often came at minimal cost to specificity, particularly for few-shot prompting Error Analysis Our error analysis revealed critical inconsistencies and logical failures in clinical reasoning, even under identical conditions. Using chain-of-thought prompting across temperatures, the model not only provided contradictory assessments but also demonstrated fundamental logical errors in clinical judgment. In one striking example, the model concluded: "while there are indications of suicidal thoughts 95% CI: [...] there is no explicit expression of current suicidal ideation" despite previously noting "the patient confirms having a plan for suicide." This represents a severe logical error, as the presence of a suicide plan necessarily implies suicidal ideation. In another case with similar input, the model imposed hallucinated diagnostic criteria: "while the patient frequently discusses their intense suicidal thoughts, they do not express any actual suicidal ideation in terms of having a plan or intent." Yet, given similar input under identical operational conditions, it correctly identified suicidal ideation based solely on thought content: "the patient expresses suicidal ideation with an intensity of 65 out of 100[...]" These inconsistencies and logical failures suggest that despite the appearance of structured clinical reasoning through step-by-step analysis, the model lacks thorough understanding of the hierarchical and logical relationships between clinical concepts. Discussion This study evaluated the performance of a LLM for standardized psychological risk assessments using the Mixtral7x8b model under a RAG framework. We assessed the LLM´s ability to rate binary items from the Nurses´ Global Assessment of Suicide Risk (NGASR) scale in German crisis text line transcripts, focusing on different prompting Styles (zero-shot, few-shot, chain of thought) and temperature settings, which in combination we call operational configurations. 5.1 Principal Results Our analysis revealed distinct patterns in LLM performance across reliability, observer agreement, and classification metrics. While LLMs demonstrated high internal consistency, particularly at temperature 0, this reliability did not translate to clinical validity. Zero-shot prompting achieved highest internal consistency but showed poor alignment with human ratings, especially for complex clinical judgments. Few-shot prompting offered better balance, achieving strongest human-AI agreement for very high risk categories, though agreement remained only moderate overall. Classification performance highlighted critical limitations in risk assessment. The framework performed best for low-risk cases but approached random classification for moderate risks. Few-shot prompting at temperature 0 provided the most balanced performance for initial screening, while zero-shot showed concerning patterns of high specificity but negligible sensitivity for high-risk cases - a limitation particularly problematic in suicide risk assessment where missing cases could have catastrophic consequences. Notably, sensitivity decreased with increasing risk levels across all prompting styles. While structured prompting improved surface-level metrics, detailed examination revealed persistent issues in clinical reasoning consistency. Given these limitations, current LLM capabilities fall short of requirements for fine-grained clinical assessment, necessitating mandatory clinical verification for moderate to high-risk cases and emphasizing that LLMs should augment rather than replace clinical judgment. Item-level analysis revealed clear performance patterns based on item characteristics. The framework performed well on behaviorally-anchored items like hearing voices but struggled with items requiring complex clinical inference such as hopelessness assessment. Few-shot prompting showed advantages for suicide-related items, though performance remained below human agreement levels. These patterns suggest that LLM effectiveness varies significantly with the type of clinical judgment required, performing best when assessing concrete, observable factors rather than interpretative clinical concepts. 5.2 Merits and Limitations Our study offered valuable ecological validity by analyzing real clinical data from a German crisis text line, though generalizability is limited by the narrow demographic scope (female youth) and potential language model biases in youth communication patterns. The systematic comparison of prompting styles and temperatures revealed reliability-performance trade-offs, but excluded temporal crisis dynamics and multimodal assessment factors. Our comprehensive evaluation framework included confidence intervals and multiple reliability metrics, though binary classification may oversimplify risk progression. Item-level analysis distinguished between behavioral and interpretative assessments, despite uneven base rates affecting discriminative ability measurement. The technical implementation featured state-of-the-art components but faced limitations in embedding quality variability and chunk size optimization. While demonstrating research feasibility, the reliance on high-performance GPUs limits practical scalability. Expert clinical ratings provided quality ground truth data, though rater diversity and expertise variations weren´t explored. German-specific cultural and linguistic nuances warrant further investigation. The relationship between confident but incorrect LLM responses deserves deeper examination, as the nature and reason for hallucination were not the focus of this work. Overall, results reflect one specific implementation choice rather than inherent LLM capabilities, suggesting potential for alternative approaches. An important limitation of this study relates to the NGASR scale itself, which was not originally designed for youth populations. The scale's applicability to adolescents may be limited by developmental considerations not accounted for in its original validation. Furthermore, several NGASR items provide minimal scoring instructions, creating inherent ambiguity that challenges both human raters and LLMs. Where human raters struggled to achieve consensus (particularly for moderate and high risk categories), the LLM similarly demonstrated lower performance. This pattern is mathematically expected given that regression bias corrected Krippendorff's α is dependent on human agreement levels, creating a ceiling effect on potential human-AI agreement. The strong performance observed in low and very high risk categories, contrasted with poorer results in moderate risk assessment, may therefore reflect inherent psychometric limitations of the scale rather than solely AI capability constraints. 5.3 Comparison with Prior Work Our study advances the emerging field of LLM applications in psychological assessment through three key contributions: implementation of state-of-the-art prompting frameworks, extension into psychological rather than purely medical assessments, and validation on real-world clinical data. While previous research has demonstrated LLMs´ potential in medical contexts, with Singhal et al. 11 achieving notable accuracy on MedQA exam questions using Flan-PaLM, psychological applications present unique challenges requiring specialized approaches. Our investigation bridges the gap between theoretical benchmarks and practical psychological assessments by implementing sophisticated prompting frameworks in mental health contexts. Recent developments in psychological applications of LLMs have shown promising directions but remained largely experimental. Yang et al. 53 developed the PsyCoT framework for personality trait detection, while Chen et al. 54 focused on cognitive distortion detection through their Diagnosis of Thought (DoT) framework. These approaches demonstrated LLMs´ potential for psychological reasoning but were limited to specific domains. Our research extends these efforts by adapting structured prompting techniques to standardized suicide risk assessment, building particularly on Wu et al.´s 55 work on chain-of-thought prompting for diagnostic reasoning. A crucial distinction of our study lies in its use of authentic clinical data. While previous work, such as Blanco-Cuaresma´s 17 analysis of suicide risk in Reddit comments, relied on public social media data, our study utilized real crisis helpline transcripts. This represents a significant advance in ecological validity, as it evaluates LLM performance in the actual context where such systems might be deployed. This clinical dataset allowed us to assess not only technical performance but also practical applicability in authentic healthcare settings. The comprehensive evaluation of diverse prompting styles and hyperparameters on real clinical data offers unique insights into the practical challenges of implementing LLMs in mental health assessment. Our findings contribute vital understanding of both the potential and limitations of LLMs in psychological assessment, particularly in high-stakes domains like suicide risk evaluation. Our findings both confirm and challenge previous research. While we confirm Yang et al.´s 53 observation that LLMs can engage in psychological reasoning, our error analysis reveals more severe limitations in clinical logic than previously reported. Similarly, while we support Wu et al.´s 55 finding that chain-of-thought prompting can improve reasoning transparency, we found it actually decreased reliability at higher temperatures - a crucial distinction for clinical applications. Unlike Blanco-Cuaresma´s 17 promising results with social media data, our analysis of clinical transcripts showed substantially lower performance, particularly for moderate risk cases, highlighting the challenges of real-world clinical assessment versus public data analysis. 5.4 Clinical Implications Our findings identify three promising clinical applications for LLMs in psychological assessment. First, LLMs can serve as preliminary screening tools in high-volume clinical settings, supporting initial triage decisions. Second, they can function as decision support systems, providing structured evaluations to complement clinical judgment. Third, LLMs can help standardize assessment approaches across different clinical contexts, improving multi-site consistency. Current implementations require specific conditions for optimal performance. Temperature settings and prompting styles significantly influence assessment reliability, necessitating careful calibration. LLM performance varies across clinical indicators, performing best with concrete behavioral symptoms rather than complex clinical judgments. Performance on critical risk factors, particularly suicide-related items, remains insufficient for autonomous clinical use. Advancing clinical viability requires enhanced prompting strategies for consistent reasoning, robust RAG mechanisms for diverse cases, and optimized parameters and validation protocols for human-AI agreement. Implementation must address ethical considerations including informed consent, data privacy, and regulatory compliance. While our study demonstrates one possible approach, alternative implementations may yield improved performance. However, any clinical applications must be developed with careful attention to both technical performance and ethical implications, particularly in high-stakes domains like suicide risk assessment. 5.5 Future Directions Our findings indicate key priorities for advancing LLM applications in mental health assessment. We need mental health-specific LLMs that better capture psychological nuances, supported by open-source development for scientific replication. Current models show basic clinical reasoning capabilities but require specialized architectures and training. Future research should consider using assessment instruments with more precise operational definitions and better validated for youth populations when evaluating AI performance in psychological assessment. Moreover, robust validation frameworks are essential, as our error analysis revealed that standard metrics may mask reasoning failures. Future protocols must detect logical inconsistencies, ensure diagnostic concept hierarchies are understood, and validate criterion consistency. The temporal aspects of assessment also need attention, as current LLMs lack mechanisms to model mental state progression over time. Practical challenges include developing privacy-preserving clinical datasets for domain adaptation and addressing cultural-linguistic variations in psychological expression. LLM decision interpretability requires investigation, particularly regarding hallucinated criteria. 5.6 Final Conclusion This study advances our understanding of LLM applications in psychological assessment through systematic evaluation of implementation parameters and real-world clinical data. While our findings demonstrate potential for supporting specific aspects of clinical work, particularly in initial screening and standardization of assessment procedures, they also reveal fundamental challenges in clinical reasoning that current implementations have yet to overcome. The observed pattern of decreasing sensitivity with increasing risk levels poses particular concerns for high-stakes clinical applications. Our methodological framework, emphasizing comparison against base rates and comprehensive error analysis, provides valuable guidance for future evaluations of AI systems in clinical settings. The stark contrast between surface-level performance metrics and detailed reasoning analysis emphasizes the need for more sophisticated validation approaches in clinical AI research. Looking forward, these findings suggest that advancing LLM applications in psychological assessment requires not just technical improvements, but fundamental reconsideration of how we implement and validate AI systems in clinical contexts. While current implementations are not ready for autonomous clinical application, they point toward promising directions for human-AI collaborative systems that leverage the strengths of both automated and human assessment. Declarations Conflict of Interest J.T is employed and receives a salary from krisenchat. krisenchat had no impact on the design of this study and did not influence the collection, execution, analyses, interpretation of the data, or the decision to submit the article/contribution for publication. GM received funding from the Stanley Thomas Johnson Stiftung & Gottfried und Julia Bangerter-Rhyner-Stiftung under projects no. PC 28/17 and PC 05/18, from Gesundheitsförderung Schweiz under project no. 18.191/K50001, from the Swiss Heart Foundation under project no. FF21101, from the Research Foundation of the International Psychoanalytic University (IPU) Berlin under projects no. 5087 and 5217, from the Swiss National Science Foundation (SNSF) under project no. 100014_135328, from the German Federal Ministry of Education and Research under budget item 68606 in the context of an evaluation project conducted amongst others in collaboration with Krisenchat, from the Hasler Foundation under project No. 23004, in the context of a Horizon Europe project from the Swiss State Secretariat for Education, Research and lnnovation (SERI) under contract number 22.00094, and from Wings Health in the context of a proof-of-concept study. GM is a co-founder, and shareholder of Therayou AG, active in digital and blended mental healthcare. GM receives royalties from publishing companies as author, including a book published by Springer, and an honorarium from Lundbeck for speaking at a symposium. Furthermore, GM is compensated for providing psychotherapy to patients, acting as a supervisor, serving as a self-experience facilitator (´Selbsterfahrungsleiter´), and for postgraduate training of psychotherapists and supervisors. L.K. and Z.E. have no competing interests. We used artificial intelligence (AI)-based tools, including Claude and ChatGPT to support manuscript preparation. Further, we used publicly available search technologies, which we recognize likely utilise AI capabilities. We confirm that the contributions of AI were strictly in an assistive capacity. AI was not involved in conceptual tasks. Human oversight was continuously employed to ensure the accuracy of content and address any ethical concerns Author Contributions JT led the project, developed the study concept in collaboration with the co-authors, performed the data analysis, wrote the original manuscript, and created the visualizations. GM supervised the project, contributed to conceptualization, provided methodological guidance, and critically reviewed and edited the manuscript. ZE provided expertise in artificial intelligence methods and contributed to manuscript review and editing. LK provided methodological expertise in statistical analysis and contributed to manuscript review and editing. All authors contributed to manuscript revision, read, and approved the submitted version. Funding There was no funding for this study. References Sartori, G. & Orrù, G. Language models and psychological sciences. Front. Psychol. 14 , 1279317 (2023). Vaswani, A. et al. Attention is All you Need. in Advances in Neural Information Processing Systems vol. 30 (Curran Associates, Inc., 2017). Bommasani, R. et al. On the Opportunities and Risks of Foundation Models. Preprint at https://doi.org/10.48550/arXiv.2108.07258 (2022). Wei, J. et al. Emergent Abilities of Large Language Models. Preprint at https://doi.org/10.48550/arXiv.2206.07682 (2022). Wei, J. et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Preprint at https://doi.org/10.48550/arXiv.2201.11903 (2023). Zhang, Z. et al. Igniting Language Intelligence: The Hitchhiker’s Guide From Chain-of-Thought Reasoning to Language Agents. Preprint at https://doi.org/10.48550/arXiv.2311.11797 (2023). Ke, L., Tong, S., Cheng, P. & Peng, K. Exploring the Frontiers of LLMs in Psychological Applications: A Comprehensive Review. Preprint at https://doi.org/10.48550/arXiv.2401.01519 (2024). Wiest, I. C. et al. Privacy-preserving large language models for structured medical information retrieval. NPJ Digit. Med. 7 , 257 (2024). Stade, E. C. et al. Large language models could change the future of behavioral healthcare: a proposal for responsible development and evaluation. Npj Ment. Health Res. 3 , 1–12 (2024). Van Veen, D. et al. Clinical Text Summarization: Adapting Large Language Models Can Outperform Human Experts. Res. Sq. rs.3.rs-3483777 (2023) doi:10.21203/rs.3.rs-3483777/v1. Singhal, K. et al. Large Language Models Encode Clinical Knowledge. Preprint at https://doi.org/10.48550/arXiv.2212.13138 (2022). Er-Rays, Y. & M’dioud, M. ChatGPT in Healthcare Facilities: An Overview and Innovations in Technical Efficiency Analysis. SSRN Scholarly Paper at https://doi.org/10.2139/ssrn.4771070 (2024). Kjell, O. N. E., Sikström, S., Kjell, K. & Schwartz, H. A. Natural language analyzed with AI-based transformers predict traditional subjective well-being measures approaching the theoretical upper limits in accuracy. Sci. Rep. 12 , 3918 (2022). Kjell, O. N. E., Kjell, K. & Schwartz, H. A. Beyond rating scales: With targeted evaluation, large language models are poised for psychological assessment. Psychiatry Res. 333 , 1–12 (2024). Ganesan, A. V., Matero, M., Ravula, A. R., Vu, H. & Schwartz, H. A. Empirical Evaluation of Pre-trained Transformers for Human-Level NLP: The Role of Sample Size and Dimensionality. Proc. Conf. Assoc. Comput. Linguist. North Am. Chapter Meet. 2021 , 4515–4532 (2021). Levkovich, I., Shinan-Altman, S. & Elyoseph, Z. Can large language models be sensitive to culture suicide risk assessment? J. Cult. Cogn. Sci. No Pagination Specified-No Pagination Specified (2024) doi:10.1007/s41809-024-00151-9. Blanco-Cuaresma, S. Psychological Assessments with Large Language Models: A Privacy-Focused and Cost-Effective Approach. Preprint at https://doi.org/10.48550/arXiv.2402.03435 (2024). Ji, Z. et al. Survey of Hallucination in Natural Language Generation. ACM Comput. Surv. 55 , 1–38 (2023). Bishop, J. M. Artificial Intelligence Is Stupid and Causal Reasoning Will Not Fix It. Front. Psychol. 11 , (2021). Amatriain, X. MEASURING AND MITIGATING HALLUCINATIONS IN LARGE LANGUAGE MODELS: A MULTIFACETED APPROACH. (2024). Adolescent mortality ranking - top 5 causes (country). WHO Data https://platform.who.int/data/maternal-newborn-child-adolescent-ageing/indicator-explorer-new/mca/adolescent-mortality-ranking---top-5-causes-(country). Bernert, R. A. et al. Artificial Intelligence and Suicide Prevention: A Systematic Review of Machine Learning Investigations. Int. J. Environ. Res. Public. Health 17 , 5929 (2020). Lejeune, A. et al. Artificial intelligence and suicide prevention: A systematic review. Eur. Psychiatry 65 , e19 (2022). Menon, V. & Vijayakumar, L. Artificial intelligence-based approaches for suicide prediction: Hope or hype? Asian J. Psychiatry 88 , 103728 (2023). Elyoseph, Z., Levkovich, I., Haber, Y. & Levi-Belz, Y. Using GenAI to Train Mental Health Professionals in Suicide Risk Assessment: Preliminary Findings . (2024). doi:10.1101/2024.07.17.24310579. Shinan-Altman, S., Elyoseph, Z. & Levkovich, I. The impact of history of depression and access to weapons on suicide risk assessment: a comparison of ChatGPT-3.5 and ChatGPT-4. PeerJ 12 , (2024). Elyoseph, Z. et al. Applying Language Models for Suicide Prevention: Evaluating News Article Adherence to WHO Reporting Guidelines . (2024). doi:10.21203/rs.3.rs-4180591/v1. Lee, C., Mohebbi, M., O’Callaghan, E. & Winsberg, M. Large Language Models Versus Expert Clinicians in Crisis Prediction Among Telemental Health Patients: Comparative Study. JMIR Ment. Health 11 , e58129 (2024). Chen, Y. et al. Deep Learning and Large Language Models for Audio and Text Analysis in Predicting Suicidal Acts in Chinese Psychological Support Hotlines. Preprint at https://doi.org/10.48550/arXiv.2409.06164 (2024). Baldofski, S. et al. The Impact of a Messenger-Based Psychosocial Chat Counseling Service on Further Help-Seeking Among Children and Young Adults: Longitudinal Study. JMIR Ment. Health 10 , e43780 (2023). Cutcliffe, J. R. & Barker, P. The Nurses’ Global Assessment of Suicide Risk (NGASR): developing a tool for clinical practice. J. Psychiatr. Ment. Health Nurs. 11 , 393–400 (2004). Kozel, B., Grieser, M., Rieder, P., Seifritz, E. & Abderhalden, C. Nurses`Global Assessment of Suicide Risk – Skala (NGASR): Die Interrater - Reliabilität eines Instrumentes zur systematisierten pflegerischen Einschätzung der Suizidalität. Z. Für Pflegewissenschaft Psych. Gesundh. 1 , 17–26 (2007). Kozel, B., Hegedüs, A., Dassen, T. & Abderhalden, C. Die Kriteriumsvalidität der deutschen Version der Nurses`Global Assessment of Suicide Risk Scale (NGASR-Scale). in 186–191 (2012). Kohls, E. et al. Suicidal Ideation Among Children and Young Adults in a 24/7 Messenger-Based Psychological Chat Counseling Service. (2022) doi:10.18452/24781. krippendorff, klaus. Computing Krippendorff’s Alpha-Reliability. (2011). Krippendorff, K. Reliability in Content Analysis. Hum. Commun. Res. 30 , 411–433 (2004). Kazdin, A. E. ARTIFACT, BIAS, AND COMPLEXITY OF ASSESSMENT: THE ABCs OF RELIABILITY. J. Appl. Behav. Anal. 10 , 141–150 (1977). Jiang, A. Q. et al. Mixtral of Experts. Preprint at https://doi.org/10.48550/arXiv.2401.04088 (2024). Su, C. et al. Machine learning for suicide risk prediction in children and adolescents with electronic health records. Transl. Psychiatry 10 , 1–10 (2020). Perković, G., Drobnjak, A. & Botički, I. Hallucinations in LLMs: Understanding and Addressing Challenges. in 2024 47th MIPRO ICT and Electronics Convention (MIPRO) 2084–2088 (2024). doi:10.1109/MIPRO60963.2024.10569238. Hong, G. et al. The Hallucinations Leaderboard -- An Open Effort to Measure Hallucinations in Large Language Models. Preprint at https://doi.org/10.48550/arXiv.2404.05904 (2024). Gao, Y. et al. Retrieval-Augmented Generation for Large Language Models: A Survey. Preprint at https://doi.org/10.48550/arXiv.2312.10997 (2024). Peeperkorn, M., Kouwenhoven, T., Brown, D. & Jordanous, A. Is Temperature the Creativity Parameter of Large Language Models? Preprint at http://arxiv.org/abs/2405.00492 (2024). Ackley, D. H., Hinton, G. E. & Sejnowski, T. J. A learning algorithm for boltzmann machines. Cogn. Sci. 9 , 147–169 (1985). Min, S. et al. Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? Preprint at https://doi.org/10.48550/arXiv.2202.12837 (2022). langchain/docs/docs/introduction.mdx at master · langchain-ai/langchain. GitHub https://github.com/langchain-ai/langchain/blob/master/docs/docs/introduction.mdx. Mckinney, W. pandas: a Foundational Python Library for Data Analysis and Statistics. Python High Perform. Sci. Comput. (2011). Installation — pingouin 0.5.5 documentation. https://pingouin-stats.org/build/html/index.html. krippendorff: Fast computation of the Krippendorff’s alpha measure. Waskom, M. L. seaborn: statistical data visualization. J. Open Source Softw. 6 , 3021 (2021). Matplotlib — Visualization with Python. https://matplotlib.org/. re — Regular expression operations. Python documentation https://docs.python.org/3/library/re.html. Yang, T. et al. PsyCoT: Psychological Questionnaire as Powerful Chain-of-Thought for Personality Detection. Preprint at https://doi.org/10.48550/arXiv.2310.20256 (2023). Chen, Z., Lu, Y. & Wang, W. Y. Empowering Psychotherapy with Large Language Models: Cognitive Distortion Detection through Diagnosis of Thought Prompting. Preprint at https://doi.org/10.48550/arXiv.2310.07146 (2023). Wu, C.-K., Chen, W.-L. & Chen, H.-H. Large Language Models Perform Diagnostic Reasoning. Preprint at https://doi.org/10.48550/arXiv.2307.08922 (2023). Tables Table 1. Demographic and Clinical Characteristics of Crisis Helpline Users Stratified by Suicide Risk Level (N=100), assessing the presence of suicide risk factors based on n=100 session transcripts of German Youth Crisis Helpline Users between 2021-11-30 and 2022-04-30 NGASR a Risk Level Age in years (Mean ± SD) NGASR a Sum Score (Mean ± SD) Number of Counselor Messages (Mean ± SD) Number of Chatter Messages (Mean ± SD) Number of Counseling Sessions (Mean ± SD) Low 16.8 ± 2.66 2.12 ± 1.58 143.32 ± 208.68 206.52 ± 414.45 16.00 ± 26.00 Moderate 16.0 ± 2.25 6.32 ± 1.18 403.96 ± 769.45 512.00 ± 1071.12 40.00 ± 75.59 High 16.83 ± 2.79 10.16 ± 0.86 506.50 ± 781.38 612.95 ± 1004.71 47.62 ± 68.96 Very High 17.32 ± 2.71 14.56 ± 2.39 571.20 ± 624.92 643.48 ± 723.56 49.44 ± 57.74 Note : Values presented as Mean ± Standard Deviation (SD) for age, NGASRa sum scores, message counts, and session counts. a NGASR = Nurses Global Assessment of Suicide Risk Table 2. Human Inter-Rater Reliability Analysis Across Items and Risk Levels (N=400 ratings), assessing the presence of suicide risk factors based on n=100 session transcripts of German Youth Crisis Helpline Users between 2021-11-30 and 2022-04-30 Metric Type Low Risk Moderate Risk High Risk Very High Risk Sum Score Human Reliability 0.91 95% CI: [0.85, 0.97] 0.67 95% CI: [0.57, 0.78] 0.63 95% CI: [0.51, 0.74] 0.76 95% CI: [0.66, 0.87] -0.02 95% CI: [-0.16, 0.11] Note : Values represent Krippendorff's α coefficients shown as Mean with 95% Confidence Intervals. Analysis based on ratings from 4 independent clinical experts. Perfect agreement cases coded as 1.0. Negative values indicate systematic disagreement. Table 3. LLMa Inter-Rating Reliability and Observer Agreement of Risk Levels and Sum Score Compared Across Prompting Style and Temperature (N=48,000 per configuration), assessing the presence of suicide risk factors based on n=100 session transcripts of German Youth Crisis Helpline Users between 2021-11-30 and 2022-04-30 Risk Level Temp Metric Chain of Thought Few-Shot Zero-Shot Low 0 LLM Reliability 0.97 95% CI: [0.93, 1.01] 0.98 95% CI: [0.95, 1.02] 0.96 95% CI: [0.91, 1.01] Observer Agreement 0.72 95% CI: [0.62, 0.83] 0.78 95% CI: [0.68, 0.88] 0.39 95% CI: [0.24, 0.53] 0.5 LLM Reliability 0.75 95% CI: [0.65, 0.85] 0.93 95% CI: [0.87, 0.99] 0.80 95% CI: [0.70, 0.91] Observer Agreement 0.69 95% CI: [0.58, 0.81] 0.78 95% CI: [0.68, 0.89] 0.42 95% CI: [0.27, 0.57] 1 LLM Reliability 0.61 95% CI: [0.50, 0.72] 0.87 95% CI: [0.80, 0.95] 0.77 95% CI: [0.66, 0.87] Observer Agreement 0.67 95% CI: [0.54, 0.79] 0.78 95% CI: [0.67, 0.88] 0.41 95% CI: [0.26, 0.56] Moderate 0 LLM Reliability 0.98 95% CI: [0.94, 1.01] 0.97 95% CI: [0.94, 1.01] 0.96 95% CI: [0.91, 1.01] Observer Agreement 0.33 95% CI: [0.19, 0.48] 0.33 95% CI: [0.18, 0.47] 0.30 95% CI: [0.15, 0.45] 0.5 LLM Reliability 0.40 95% CI: [0.28, 0.53] 0.83 95% CI: [0.74, 0.92] 0.78 95% CI: [0.68, 0.88] Observer Agreement 0.23 95% CI: [0.06, 0.41] 0.31 95% CI: [0.16, 0.46] 0.25 95% CI: [0.10, 0.40] 1 LLM Reliability 0.27 95% CI: [0.15, 0.39] 0.55 95% CI: [0.42, 0.69] 0.68 95% CI: [0.56, 0.79] Observer Agreement 0.20 95% CI: [0.03, 0.37] 0.30 95% CI: [0.13, 0.46] 0.26 95% CI: [0.10, 0.42] High 0 LLM Reliability 0.94 95% CI: [0.89, 1.00] 0.96 95% CI: [0.92, 1.01] 1.00 95% CI: [--, --] Observer Agreement 0.55 95% CI: [0.41, 0.70] 0.53 95% CI: [0.39, 0.67] 0.67 95% CI: [0.55, 0.80] 0.5 LLM Reliability 0.58 95% CI: [0.46, 0.70] 0.80 95% CI: [0.70, 0.90] 0.95 95% CI: [0.90, 1.00] Observer Agreement 0.34 95% CI: [0.17, 0.51] 0.47 95% CI: [0.30, 0.64] 0.66 95% CI: [0.53, 0.79] 1 LLM Reliability 0.50 95% CI: [0.38, 0.62] 0.49 95% CI: [0.35, 0.62] 0.88 95% CI: [0.80, 0.96] Observer Agreement 0.35 95% CI: [0.17, 0.52] 0.34 95% CI: [0.15, 0.52] 0.65 95% CI: [0.52, 0.78] Very High 0 LLM Reliability 0.95 95% CI: [0.91, 1.00] 0.97 95% CI: [0.92, 1.01] 1.00 95% CI: [--, --] Observer Agreement 0.73 95% CI: [0.61, 0.85] 0.78 95% CI: [0.67, 0.89] 0.75 95% CI: [0.64, 0.86] 0.5 LLM Reliability 0.77 95% CI: [0.68, 0.87] 0.91 95% CI: [0.85, 0.98] 1.00 95% CI: [--, --] Observer Agreement 0.74 95% CI: [0.62, 0.86] 0.71 95% CI: [0.58, 0.83] 0.75 95% CI: [0.64, 0.86] 1 LLM Reliability 0.69 95% CI: [0.58, 0.80] 0.80 95% CI: [0.72, 0.89] 1.00 95% CI: [--, --] Observer Agreement 0.72 95% CI: [0.60, 0.84] 0.69 95% CI: [0.57, 0.81] 0.75 95% CI: [0.64, 0.86] Sum Score 0 LLM Reliability 0.78 95% CI: [0.66, 0.89] 0.86 95% CI: [0.76, 0.95] 0.90 95% CI: [0.81, 0.98] Observer Agreement -0.59 95% CI: [-0.71, -0.47] -0.58 95% CI: [-0.65, -0.50] -0.40 95% CI: [-0.58, -0.22] 0.5 LLM Reliability -0.18 95% CI: [-0.30, -0.06] 0.41 95% CI: [0.25, 0.56] 0.34 95% CI: [0.19, 0.50] Observer Agreement -0.90 95% CI: [-0.94, -0.85] -0.76 95% CI: [-0.83, -0.69] -0.64 95% CI: [-0.76, -0.51] 1 LLM Reliability -0.36 95% CI: [-0.41, -0.32] -0.12 95% CI: [-0.26, 0.01] 0.11 95% CI: [-0.04, 0.26] Observer Agreement -0.90 95% CI: [-0.94, -0.86] -0.85 95% CI: [-0.94, -0.77] -0.74 95% CI: [-0.83, -0.64] Note: Values shown as Mean with 95% Confidence Intervals. LLM Reliability represents agreement among LLM ratings (llm_α); Observer Agreement represents regression-bias corrected agreement between human and LLM ratings (corrected_α). All metrics calculated using Krippendorff's α with perfect agreement coded as 1.0. Negative values indicate systematic disagreement. a LLM = Large Language Model Table 4. Balanced Accuracy, Sensitivity and Specificity per Operational Configuration Across Risk Levels (N=48,000 per configuration), assessing the presence of suicide risk factors based on n=100 session transcripts of German Youth Crisis Helpline Users between 2021-11-30 and 2022-04-30 Risk Level Temp Metric Chain of Thought Few-Shot Zero-Shot Low 0 Balanced Accuracy 0.69 95% CI: [0.62-0.79] 0.71 95% CI: [0.64-0.75] 0.71 95% CI: [0.67-0.79] Sensitivity 0.50 95% CI: [0.20-0.66] 0.50 95% CI: [0.35-0.77] 0.93 95% CI: [0.81-1.00] Specificity 0.88 95% CI: [0.84-0.90] 0.92 95% CI: [0.90-0.95] 0.49 95% CI: [0.42-0.61] 0.5 Balanced Accuracy 0.71 95% CI: [0.60-0.86] 0.68 95% CI: [0.55-0.79] 0.71 95% CI: [0.67-0.77] Sensitivity 0.56 95% CI: [0.21-0.78] 0.43 95% CI: [0.16-0.63] 0.93 95% CI: [0.83-1.00] Specificity 0.85 95% CI: [0.81-0.92] 0.92 95% CI: [0.87-0.98] 0.49 95% CI: [0.41-0.56] 1 Balanced Accuracy 0.66 95% CI: [0.57-0.74] 0.69 95% CI: [0.56-0.76] 0.72 95% CI: [0.70-0.78] Sensitivity 0.56 95% CI: [0.37-0.84] 0.43 95% CI: [0.30-0.56] 0.93 95% CI: [0.76-1.00] Specificity 0.76 95% CI: [0.71-0.83] 0.95 95% CI: [0.90-0.97] 0.52 95% CI: [0.46-0.60] Moderate 0 Balanced Accuracy 0.53 95% CI: [0.41-0.67] 0.54 95% CI: [0.49-0.59] 0.44 95% CI: [0.33-0.51] Sensitivity 0.40 95% CI: [0.29-0.58] 0.42 95% CI: [0.26-0.55] 0.25 95% CI: [0.15-0.47] Specificity 0.66 95% CI: [0.59-0.69] 0.67 95% CI: [0.58-0.74] 0.63 95% CI: [0.56-0.69] 0.5 Balanced Accuracy 0.51 95% CI: [0.43-0.58] 0.56 95% CI: [0.49-0.65] 0.44 95% CI: [0.40-0.60] Sensitivity 0.44 95% CI: [0.36-0.54] 0.46 95% CI: [0.31-0.61] 0.25 95% CI: [0.08-0.34] Specificity 0.58 95% CI: [0.53-0.72] 0.66 95% CI: [0.59-0.73] 0.63 95% CI: [0.58-0.69] 1 Balanced Accuracy 0.49 95% CI: [0.40-0.56] 0.55 95% CI: [0.48-0.57] 0.43 95% CI: [0.34-0.45] Sensitivity 0.32 95% CI: [0.29-0.49] 0.46 95% CI: [0.37-0.57] 0.25 95% CI: [0.12-0.47] Specificity 0.66 95% CI: [0.60-0.76] 0.64 95% CI: [0.56-0.70] 0.61 95% CI: [0.54-0.73] High 0 Balanced Accuracy 0.51 95% CI: [0.42-0.58] 0.59 95% CI: [0.47-0.72] 0.49 95% CI: [0.45-0.52] Sensitivity 0.23 95% CI: [0.09-0.33] 0.48 95% CI: [0.33-0.65] 0.05 95% CI: [0.00-0.11] Specificity 0.80 95% CI: [0.78-0.86] 0.70 95% CI: [0.62-0.74] 0.93 95% CI: [0.89-0.97] 0.5 Balanced Accuracy 0.47 95% CI: [0.43-0.53] 0.66 95% CI: [0.52-0.70] 0.49 95% CI: [0.44-0.50] Sensitivity 0.14 95% CI: [0.05-0.17] 0.57 95% CI: [0.42-0.76] 0.05 95% CI: [0.00-0.12] Specificity 0.80 95% CI: [0.76-0.86] 0.74 95% CI: [0.66-0.82] 0.93 95% CI: [0.88-0.97] 1 Balanced Accuracy 0.49 95% CI: [0.39-0.54] 0.67 95% CI: [0.55-0.74] 0.48 95% CI: [0.44-0.55] Sensitivity 0.23 95% CI: [0.05-0.27] 0.62 95% CI: [0.50-0.73] 0.05 95% CI: [0.00-0.22] Specificity 0.74 95% CI: [0.69-0.82] 0.71 95% CI: [0.66-0.82] 0.91 95% CI: [0.85-0.95] Very High 0 Balanced Accuracy 0.60 95% CI: [0.52-0.66] 0.61 95% CI: [0.58-0.65] 0.53 95% CI: [0.52-0.58] Sensitivity 0.30 95% CI: [0.17-0.43] 0.31 95% CI: [0.09-0.53] 0.06 95% CI: [0.00-0.10] Specificity 0.89 95% CI: [0.81-0.92] 0.92 95% CI: [0.86-0.96] 1.00 95% CI: [1.00-1.00] 0.5 Balanced Accuracy 0.58 95% CI: [0.55-0.66] 0.60 95% CI: [0.58-0.67] 0.53 95% CI: [0.52-0.58] Sensitivity 0.27 95% CI: [0.15-0.44] 0.28 95% CI: [0.12-0.43] 0.06 95% CI: [0.03-0.15] Specificity 0.89 95% CI: [0.84-0.95] 0.92 95% CI: [0.88-0.94] 1.00 95% CI: [1.00-1.00] 1 Balanced Accuracy 0.61 95% CI: [0.55-0.64] 0.61 95% CI: [0.56-0.65] 0.53 95% CI: [0.50-0.56] Sensitivity 0.27 95% CI: [0.15-0.37] 0.28 95% CI: [0.28-0.46] 0.06 95% CI: [0.00-0.11] Specificity 0.95 95% CI: [0.91-0.98] 0.93 95% CI: [0.88-0.95] 1.00 95% CI: [1.00-1.00] Note: Values shown as Mean with 95% Confidence Intervals. Balanced Accuracy represents the mean of sensitivity and specificity. Best values per metric within each prompting style and temperature combination are shown in bold. Temperature affects model output randomness (0 = deterministic, 1 = most random). Additional Declarations Competing interest reported. J.T is employed and receives a salary from krisenchat. krisenchat had no impact on the design of this study and did not influence the collection, execution, analyses, interpretation of the data, or the decision to submit the article/contribution for publication. L.K. and Z.E. have no competing interests. GM received funding from the Stanley Thomas Johnson Stiftung & Gottfried und Julia Bangerter-Rhyner-Stiftung under projects no. PC 28/17 and PC 05/18, from Gesundheitsförderung Schweiz under project no. 18.191/K50001, from the Swiss Heart Foundation under project no. FF21101, from the Research Foundation of the International Psychoanalytic University (IPU) Berlin under projects no. 5087 and 5217, from the Swiss National Science Foundation (SNSF) under project no. 100014_135328, from the German Federal Ministry of Education and Research under budget item 68606 in the context of an evaluation project conducted amongst others in collaboration with Krisenchat, from the Hasler Foundation under project No. 23004, in the context of a Horizon Europe project from the Swiss State Secretariat for Education, Research and lnnovation (SERI) under contract number 22.00094, and from Wings Health in the context of a proof-of-concept study. GM is a co-founder, and shareholder of Therayou AG, active in digital and blended mental healthcare. GM receives royalties from publishing companies as author, including a book published by Springer, and an honorarium from Lundbeck for speaking at a symposium. Furthermore, GM is compensated for providing psychotherapy to patients, acting as a supervisor, serving as a self-experience facilitator (´Selbsterfahrungsleiter´), and for postgraduate training of psychotherapists and supervisors. We used artificial intelligence (AI)-based tools, including Claude and ChatGPT to support manuscript preparation. Further, we used publicly available search technologies, which we recognize likely utilise AI capabilities. We confirm that the contributions of AI were strictly in an assistive capacity. AI was not involved in conceptual tasks. Human oversight was continuously employed to ensure the accuracy of content and address any ethical concerns. Other authors have no competing interest, Supplementary Files MultimediaAppendixLargeLanguageModelsinAutomatedSuicideRiskFactorMonitoringAComparativeStudyofAIandHumanRatingsUsingLLMBasedChatAgentsonanObservationalSuicideScale.docx Cite Share Download PDF Status: Published Journal Publication published 10 Nov, 2025 Read the published version in Scientific Reports → Version 1 posted Editorial decision: Revision requested 10 Jun, 2025 Reviews received at journal 29 May, 2025 Reviews received at journal 25 May, 2025 Reviews received at journal 21 May, 2025 Reviewers agreed at journal 09 May, 2025 Reviewers agreed at journal 09 May, 2025 Reviewers agreed at journal 08 May, 2025 Reviewers agreed at journal 08 May, 2025 Reviewers invited by journal 08 May, 2025 Editor assigned by journal 08 May, 2025 Editor invited by journal 22 Mar, 2025 Submission checks completed at journal 21 Mar, 2025 First submitted to journal 12 Mar, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-6210376","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":454340811,"identity":"3212c485-1d07-4938-b25e-4a3af84955e1","order_by":0,"name":"Julia Thomas","email":"","orcid":"","institution":"University of Basel","correspondingAuthor":false,"prefix":"","firstName":"Julia","middleName":"","lastName":"Thomas","suffix":""},{"id":454340812,"identity":"85be21b8-5264-496d-a469-681cb33686fa","order_by":1,"name":"Zohar Elyoseph","email":"","orcid":"","institution":"University of Haifa","correspondingAuthor":false,"prefix":"","firstName":"Zohar","middleName":"","lastName":"Elyoseph","suffix":""},{"id":454340813,"identity":"98039e55-13ef-4a60-815a-75d5d91bdae0","order_by":2,"name":"Lars Kuchinke","email":"","orcid":"","institution":"International Psychoanalytic University (IPU) Berlin","correspondingAuthor":false,"prefix":"","firstName":"Lars","middleName":"","lastName":"Kuchinke","suffix":""},{"id":454340814,"identity":"ea103ac9-8ce7-4228-b371-0c2098e28794","order_by":3,"name":"Gunther Meinlschmidt","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABCElEQVRIie3PP2vCQBjH8V8IXJe8gBuUvgIhQQgI/nkrOYRuFqTQ2RK4LIprfBdOdin0woEupa4OhSqFzOnmkME7Y6VDz9Ktw33H4z733APYbP8yZ4QtEMGFjwLw9Jmozi+QqCLBQ6oJ+ZUAFQGC+DiCnJ/6uUYS821Udm8bifv40eFvNVxPduLzCfWpgYQvWeIz3r8LJRnGA56rj135WZqjOTOMCTeMUzYSbCG9QBHp9ZYE0hNgc2Eg7ztOo/JEWoqAKFIq8mwiG0cRciLOF4GeYtxFf4z3FSHD2fhVkxtkY0GbqWnKapXTouyyxVrOi/297MFdusVetOtTw/rG6B/v22w2m+17B0a9Y3Y66yKbAAAAAElFTkSuQmCC","orcid":"","institution":"Trier University","correspondingAuthor":true,"prefix":"","firstName":"Gunther","middleName":"","lastName":"Meinlschmidt","suffix":""}],"badges":[],"createdAt":"2025-03-12 08:53:28","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-6210376/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-6210376/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1038/s41598-025-22402-7","type":"published","date":"2025-11-10T15:57:08+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":82588213,"identity":"6c77f496-b1a3-4565-a2de-3d877504289d","added_by":"auto","created_at":"2025-05-13 07:27:18","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":69702,"visible":true,"origin":"","legend":"\u003cp\u003eMethodological Framework: Data Processing Pipeline for Human-LLMa Comparison in NGASRb Suicide Risk Assessment, assessing the presence of suicide risk factors based on n=100 session transcripts of German Youth Crisis Helpline Users between 2021-11-30 and 2022-04-30\u003c/p\u003e\n\u003cp\u003ea LLM = Large Language Model\u003c/p\u003e\n\u003cp\u003eb NGASR = Nurses Global Assessment Scale of Suicide\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-6210376/v1/2183bb9b296853f936f5ac83.png"},{"id":82588216,"identity":"4048a6d6-e022-42ed-aade-d013877d2a9b","added_by":"auto","created_at":"2025-05-13 07:27:18","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":206444,"visible":true,"origin":"","legend":"\u003cp\u003eVignettes of Different Prompting Styles for LLMa-based Suicide Risk Assessment: A) Zero-shot, B) Chain-of-thought, and C) Few-shot Prompting Examples, showing template structure and example interactions for each approach (N=3 prompting styles), assessing the presence of suicide risk factors based on n=100 session transcripts of German Youth Crisis Helpline Users between 2021-11-30 and 2022-04-30\u003c/p\u003e\n\u003cp\u003ea LLM = Large Language Model\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-6210376/v1/e758050babdfabdd7a32cd18.png"},{"id":82588214,"identity":"49f12b92-da8f-4700-9016-0d2ce60afe66","added_by":"auto","created_at":"2025-05-13 07:27:18","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":68457,"visible":true,"origin":"","legend":"\u003cp\u003eComparative Analysis of LLMa Inter-Rating Reliability and Human-to-LLMa Observer Agreement Across Temperature Settings (0, 0.5, 1) and Prompting Styles (zero-shot, few-shot, chain-of-thought), measured using Krippendorff's α and regression bias corrected Krippendorff's α (N=48,000 per configuration), assessing the presence of suicide risk factors based on n=100 session transcripts of German Youth Crisis Helpline Users between 2021-11-30 and 2022-04-30\u003c/p\u003e\n\u003cp\u003ea LLM = Large Language Model\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-6210376/v1/09268b69ee67aa1124c24cbc.png"},{"id":82588215,"identity":"701491ee-42e9-4909-8c23-d3586d55ab88","added_by":"auto","created_at":"2025-05-13 07:27:18","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":116028,"visible":true,"origin":"","legend":"\u003cp\u003eRegression Bias Corrected Observer Agreement Values Comparing Human (n=4) and LLMa Ratings (N=30) Across Prompting Styles at Temperature 0, Panel A: Aggregated per Item, Panel B: Aggregated per NGASRb Risk Level (N=2,700 each), measured with Krippendorff's α, values shown as Mean and 95% Confidence Interval, assessing the presence of suicide risk factors based on n=100 session transcripts of German Youth Crisis Helpline Users between 2021-11-30 and 2022-04-30\u003c/p\u003e\n\u003cp\u003ea LLM = Large Language Model\u003c/p\u003e\n\u003cp\u003eb NGASR = Nurses Global Assessment of Suicide Risk Scale\u003c/p\u003e","description":"","filename":"4.png","url":"https://assets-eu.researchsquare.com/files/rs-6210376/v1/e8620ffeeda2450940b45f66.png"},{"id":82588217,"identity":"68b64e97-880c-4120-ad94-b3cc6ffd2a79","added_by":"auto","created_at":"2025-05-13 07:27:18","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":104486,"visible":true,"origin":"","legend":"\u003cp\u003eBalanced Accuracy Values Comparing Human (n=4) and LLMa Ratings (N=30) Across Prompting Styles at Temperature 0, Panel A: Aggregated per Item, Panel B: Aggregated per NGASRb Risk Level (N=2,700 each), values shown as Mean and 95% Confidence Interval, assessing the presence of suicide risk factors based on n=100 session transcripts of German Youth Crisis Helpline Users between 2021-11-30 and 2022-04-30\u003c/p\u003e\n\u003cp\u003ea LLM = Large Language Model\u003c/p\u003e\n\u003cp\u003eb NGASR = Nurses Global Assessment of Suicide Risk Scale\u003c/p\u003e","description":"","filename":"5.png","url":"https://assets-eu.researchsquare.com/files/rs-6210376/v1/1c94e2012f39c90ee92be0c5.png"},{"id":96104964,"identity":"3015ebba-cc8d-4699-85f9-45d89145115f","added_by":"auto","created_at":"2025-11-17 16:05:11","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1774694,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6210376/v1/04ab1f5d-99fc-4646-9ebc-a26354040b35.pdf"},{"id":82588218,"identity":"025bf184-21ef-4c21-b7d6-1f6a146020a9","added_by":"auto","created_at":"2025-05-13 07:27:18","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"supplement","size":819852,"visible":true,"origin":"","legend":"","description":"","filename":"MultimediaAppendixLargeLanguageModelsinAutomatedSuicideRiskFactorMonitoringAComparativeStudyofAIandHumanRatingsUsingLLMBasedChatAgentsonanObservationalSuicideScale.docx","url":"https://assets-eu.researchsquare.com/files/rs-6210376/v1/c43c0da00e3c7cb5779c23be.docx"}],"financialInterests":"Competing interest reported. J.T is employed and receives a salary from krisenchat. krisenchat had no impact on the design of this study and did not influence the collection, execution, analyses, interpretation of the data, or the decision to submit the article/contribution for publication.\n\nL.K. and Z.E. have no competing interests.\n\nGM received funding from the Stanley Thomas Johnson Stiftung \u0026 Gottfried und Julia Bangerter-Rhyner-Stiftung under projects no. PC 28/17 and PC 05/18, from Gesundheitsförderung Schweiz under project no. 18.191/K50001, from the Swiss Heart Foundation under project no. FF21101, from the Research Foundation of the International Psychoanalytic University (IPU) Berlin under projects no. 5087 and 5217, from the Swiss National Science Foundation (SNSF) under project no. 100014_135328, from the German Federal Ministry of Education and Research under budget item 68606 in the context of an evaluation project conducted amongst others in collaboration with Krisenchat, from the Hasler Foundation under project No. 23004, in the context of a Horizon Europe project from the Swiss State Secretariat for Education, Research and lnnovation (SERI) under contract number 22.00094, and from Wings Health in the context of a proof-of-concept study. GM is a co-founder, and shareholder of Therayou AG, active in digital and blended mental healthcare. GM receives royalties from publishing companies as author, including a book published by Springer, and an honorarium from Lundbeck for speaking at a symposium. Furthermore, GM is compensated for providing psychotherapy to patients, acting as a supervisor, serving as a self-experience facilitator (´Selbsterfahrungsleiter´), and for postgraduate training of psychotherapists and supervisors.\nWe used artificial intelligence (AI)-based tools, including Claude and ChatGPT to support manuscript preparation. Further, we used publicly available search technologies, which we recognize likely utilise AI capabilities. We confirm that the contributions of AI were strictly in an assistive capacity. AI was not involved in conceptual tasks. Human oversight was continuously employed to ensure the accuracy of content and address any ethical concerns.\n\nOther authors have no competing interest,","formattedTitle":"Automated Suicide Risk Factor Monitoring in Crisis Text Line Users: Comparative Study of AI and Human Ratings Using Large Language Models","fulltext":[{"header":"Introduction","content":"\u003cp\u003eLarge Language Models (LLM) are neural networks that predict text sequences using conditional word probabilities \u003csup\u003e1\u003c/sup\u003e. Through self-supervised learning, they process language by optimizing billions of parameters for text prediction \u003csup\u003e2\u003c/sup\u003e. These \u0026ldquo;foundational models\u0026rdquo; execute various tasks based on textual instructions or \u0026ldquo;prompts\u0026rdquo; \u003csup\u003e3\u003c/sup\u003e. LLMs demonstrate emergent capabilities in processing and reasoning by identifying complex word associations and developing implicit knowledge through vast training \u003csup\u003e4\u0026ndash;6\u003c/sup\u003e.\u003c/p\u003e\n\u003cp\u003eLLMs show promise in clinical psychology due to their language processing capabilities \u003csup\u003e7\u003c/sup\u003e. They support medical information retrieval \u003csup\u003e8\u003c/sup\u003e, treatment decisions \u003csup\u003e9\u003c/sup\u003e, clinical summarization \u003csup\u003e10\u003c/sup\u003e, and patient education \u003csup\u003e11\u003c/sup\u003e. Their extensive training enables access to medical and psychological knowledge \u003csup\u003e11\u003c/sup\u003e. AI-powered applications offer 24/7 availability, streamlined processing \u003csup\u003e12\u003c/sup\u003e, and reduced administrative burden \u003csup\u003e3\u003c/sup\u003e. LLMs\u0026acute; contextual embeddings capture language nuances \u003csup\u003e13,14\u003c/sup\u003eand individual usage patterns \u003csup\u003e15\u003c/sup\u003e, while demonstrating cultural sensitivity in assessments\u0026nbsp;\u003csup\u003e16\u003c/sup\u003e.\u003cbr\u003e\u0026nbsp;\u003cbr\u003eAlthough not explicitly designed for psychological assessments, LLMs can be adapted through techniques like structured evaluation-based scoring \u003csup\u003e17\u003c/sup\u003e. However, ensuring high-quality benchmarks for clinical decision-making remains challenging due to \u0026ldquo;hallucination\u0026rdquo; - generating plausible but incorrect outputs \u003csup\u003e18\u003c/sup\u003e. This phenomenon, stemming from LLMs\u0026acute; probabilistic nature and lack of intrinsic truth understanding \u003csup\u003e19\u003c/sup\u003e, poses a significant obstacle to achieving clinical-grade accuracy and reliability in LLM-based assessments\u0026nbsp;\u003csup\u003e20\u003c/sup\u003e. This holds especially true in high risk domains of psychology and medicine.\u003cbr\u003e\u0026nbsp;\u003cbr\u003eOne of these domains is suicide prevention. Suicide remains a leading global mortality cause \u003csup\u003e21\u003c/sup\u003e, presenting an urgent need for cost-effective tools for prevention and monitoring\u0026nbsp;\u003csup\u003e22\u0026ndash;24\u003c/sup\u003e. Crisis text lines have demonstrated potential in suicide prevention by offering accessible support, making them an ideal testing ground for AI applications. The integration of LLM systems into mental health care holds particular promise for suicide prevention, where timely interventions can save lives.\u003cbr\u003e\u0026nbsp;\u003cbr\u003eRecent studies demonstrate LLMs\u0026acute; potential in psychiatric risk assessment. GPT-4 matched mental health professionals\u0026acute; assessment capabilities \u003csup\u003e16,25\u003c/sup\u003e, showed enhanced risk factor detection \u003csup\u003e26\u003c/sup\u003e, and analyzed suicide-related media content effectively \u003csup\u003e27\u003c/sup\u003e. GPT-4 also achieved 0.6 precision in suicide plan prediction versus clinicians\u0026acute; 0.7, with higher sensitivity (0.62 vs 0.53;\u003csup\u003e28\u003c/sup\u003e). LLM analysis of crisis hotlines achieved 76% F1 score, outperforming manual assessments and traditional deep learning \u003csup\u003e29\u003c/sup\u003e. However, validation studies with clinical data remain limited.\u003c/p\u003e\n\u003cp\u003eWhile these studies demonstrate promising potential, critical gaps remain in understanding how to achieve reliable and valid LLM-based clinical assessments. First, existing research hasn\u0026acute;t systematically examined how different LLM configurations affect assessment reliability and validity. Second, while various prompting strategies exist, their comparative effectiveness for clinical assessment remains untested. Third, the impact of temperature settings on clinical judgment reliability is unexplored, particularly for high-stakes decisions. Finally, no studies have conducted item-level analyses to identify which clinical assessment components are most suitable for LLM evaluation. To address these gaps, we investigated retrieval-augmented (RAG) LLM agents for structured psychological suicide assessments by measuring agreement between human and LLM ratings. We examined 1) the impact of various prompting styles (zero-shot, few-shot, and chain-of-thought) on reliability and validity, 2) observer agreement and classification performance across different operational settings to human expert raters, and 3) conducted granular analysis of item-specific metrics to assess which individual items were most amenable to automated assessment.\u003c/p\u003e"},{"header":"Methods","content":"\u003ch2\u003e2.1 \u0026nbsp; \u0026nbsp;Study Design\u003c/h2\u003e\n\u003cp\u003eWe present the study design in Figure 1. The study analyzed chat transcripts from the German crisis text line, krisenchat (Figure 1). Four expert raters independently scored 16 items of the NGASR scale (Cutcliffe \u0026amp; Barker, 2004) to assess suicide risk. An LLM agent generated similar ratings using varied temperature values and prompting styles. We compared human and AI evaluations through interrater reliability, observer agreement, and classification metrics across operational settings.\u003c/p\u003e\n\u003ch2\u003e2.2 Data Preparation\u003c/h2\u003e\n\u003cp\u003eThis study analyzed chat transcripts from krisenchat, a German preclinical crisis intervention service for individuals up to 25 years old \u003csup\u003e30\u003c/sup\u003e. The data comprised counseling sessions conducted between 2021-11-30, and 2022-04-30, with transcripts representing complete counseling histories.\u003c/p\u003e\n\u003cp\u003eSample Selection and Stratification\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eFrom an initial pool of 439 labeled cases, we selected 100 cases using stratified random sampling to ensure balanced representation across NGASR-assessed risk levels. The sample was equally distributed with 25 cases in each risk category: low (\u0026lt; 4), moderate (5-8), high (9-12), and very high (\u0026gt; 12). Study participation was restricted to female participants with a minimum age of 14 years who were seeking help for themselves, excluding cases of help-seeking for others as well as male and diverse gender cases.\u003c/p\u003e\n\u003cp\u003eData Processing\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eTo maintain internal validity, transcripts were preserved without modifications. Risk levels were determined using majority-voted NGASR sum scores from four independent clinical experts. For binary classifications (presence/absence of risk factors), a threshold of greater than 50% agreement among human raters or LLM ratings was used to establish positive items, 50:50 situation would result in a negative item.\u0026nbsp;\u003c/p\u003e\n\u003ch2\u003e2.3 Measures\u003c/h2\u003e\n\u003cp\u003eThe NGASR scale, developed by Cutcliffe and Baker \u003csup\u003e31\u003c/sup\u003eand translated into German by Kozel et al. \u003csup\u003e32,33\u003c/sup\u003e, is a structured 16-item questionnaire assessing evidence-based suicide risk factors, not individual suicide probability. The scale encompasses a comprehensive range of risk factors: hopelessness, recent stress events, hallucinations/delusions, depression, social withdrawal, suicidal intention, suicide plans, family psychiatric/suicide history, recent losses, psychotic disorder, widowhood, previous attempts, poor socioeconomic conditions, substance abuse, terminal illness, and multiple hospitalizations. Five items - hopelessness, depression, suicidal plans, recent losses, and previous attempts - carry triple weight in scoring due to their elevated predictive value. Total scores indicate risk levels categorized as low (4 or below), moderate (5-8), high (9-11), and very high (12 and above). The German validation study demonstrated strong psychometric properties, with median item-wise observer agreements of 0.64 in Cohen\u0026acute;s Kappa (K) and 0.85 in Gwet\u0026acute;s AC1(AC1), while sum score agreements reached 0.90 and 0.91 in absolute agreement of Intra-Class-Correlation (ICC) and consistency, respectively.\u003c/p\u003e\n\u003cp\u003eRating Procedure\u003c/p\u003e\n\u003cp\u003eFour independent expert raters from a specialized suicide and self-harm counseling unit \u003csup\u003e34\u003c/sup\u003econducted the clinical assessments. The raters underwent comprehensive training on NGASR items through panel ratings and group discussions using non-study cases prior to conducting assessments. Each rater independently evaluated the complete set of counseling transcripts across all NGASR items. Inter-rater agreement was evaluated using Krippendorff\u0026acute;s \u0026alpha; \u003csup\u003e35,36\u003c/sup\u003e. To maintain consistency and prevent observer drift, integrity discussions were conducted between rating sessions \u003csup\u003e37\u003c/sup\u003e, allowing raters to share insights and standardize their approach without modifying existing ratings. For analysis purposes, individual ratings were aggregated using majority voting, where agreement from more than 50% of raters established positive cases. Final sum scores and risk level assignments were calculated based on these aggregated ratings, incorporating the differential item weights specified in the NGASR manual.\u0026nbsp;\u003c/p\u003e\n\u003ch2\u003e2.4 LLM Framework and Implementation\u003c/h2\u003e\n\u003cp\u003eWe implemented a framework to reduce LLM hallucination using Mixtral 8x7B, which employs sparse mixture of experts architecture to activate relevant model components for focused processing \u003csup\u003e38\u003c/sup\u003e. The model converts conversations into numerical embeddings using an instructor-transformer model based on T5 architecture \u003csup\u003e39\u003c/sup\u003e, enabling similarity comparisons via euclidean distance \u003csup\u003e20,40,41\u003c/sup\u003e. Our RAG approach anchors LLM responses to conversation context \u003csup\u003e42\u003c/sup\u003e. Implementation parameters included: 500-token chunks with 25% overlap, top 5 conversation chunks, and .95 similarity threshold. We tested temperature settings of 0.0, 0.5, and 1.0 to control output randomness, with lower values producing more deterministic results \u003csup\u003e43,44\u003c/sup\u003e.\u003c/p\u003e\n\u003cp\u003eThe study explored three distinct prompting styles: zero-shot, few-shot, and chain of thought. Zero-shot prompting presented questions directly from the scale manual without examples, relying on the model\u0026acute;s pre-existing knowledge to interpret and rate counseling transcripts. Few-shot prompting enhanced contextual understanding by providing carefully selected positive and negative examples prior to the rating task, while avoiding potential answer bias through example selection \u003csup\u003e45\u003c/sup\u003e. Chain of thought (CoT) prompting encouraged structured clinical reasoning by requiring step-by-step articulation of the assessment process, enabling insight into the model\u0026acute;s decision-making approach. Refer to (Figure 2) for exemplary prompting style formulations.\u003c/p\u003e\n\u003cp\u003eEach prompting style incorporated a RAG context, role specification, and clear output requirements. The implemented framework can be represented as:\u003c/p\u003e\n\u003cp\u003e\u003cem\u003eZero-Shot-Prompt = RAG\u003csub\u003e\u0026nbsp;context\u003c/sub\u003e + P\u003csub\u003erole\u0026nbsp;\u003c/sub\u003e+ P\u003csub\u003eoutput specification\u003c/sub\u003e + P\u003csub\u003equestion\u003c/sub\u003e\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003e\u003cem\u003eFew-Shot-Prompt= RAG\u003csub\u003e\u0026nbsp;context\u003c/sub\u003e + P\u003csub\u003erole\u0026nbsp;\u003c/sub\u003e+ P\u003csub\u003eoutput specification\u003c/sub\u003e + P\u003csub\u003eexamples\u003c/sub\u003e+ P\u003csub\u003equestion\u003c/sub\u003e\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003e\u003cem\u003eChain-of-Thought-Prompt = RAG\u003csub\u003e\u0026nbsp;context\u003c/sub\u003e + P\u003csub\u003erole\u0026nbsp;\u003c/sub\u003e+ P\u003csub\u003eoutput specification\u003c/sub\u003e + P\u003csub\u003equestion\u0026nbsp;\u003c/sub\u003e+ P\u003csub\u003eCOT-Instruction\u003c/sub\u003e\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003eThe LLM generated 4,320 ratings per transcript (16 NGASR items \u0026times; 3 temperature settings \u0026times; 3 prompting styles \u0026times; 30 repetitions). We termed each prompting style and temperature combination an \u003cem\u003eoperational configuration\u003c/em\u003e. For each configuration, we aggregated individual item ratings through majority voting, requiring \u0026gt;50% agreement to establish positive cases. We then calculated risk levels and sum scores following the NGASR manual\u0026acute;s scoring rules.\u003c/p\u003e\n\u003ch2\u003e2.5 Statistical Analysis\u003c/h2\u003e\n\u003cp\u003eDescriptive Analysis\u003c/p\u003e\n\u003cp\u003eFor descriptive analysis, we characterized sociodemographic characteristics and service usage behaviors using frequencies for categorical variables, mean and standard deviation for normally distributed continuous variables. These statistics were stratified by risk level, with differences between risk levels evaluated using Chi-square tests for categorical variables and ANOVAs for normally distributed continuous variables.\u003c/p\u003e\n\u003cp\u003eReliability Analysis\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eWe assessed LLM measurement reliability of NGASR risk levels using Krippendorff\u0026acute;s \u0026alpha; coefficients, treating each binary risk level as an independent rater. Krippendorff\u0026acute;s \u0026alpha; accommodates multiple scale types (binary, ordinal, metric), enabling consistent comparison across NGASR items, risk levels, and sum scores. We used established agreement thresholds: perfect (1), substantial (\u0026ge;0.80), moderate (0.67-0.79), weak (0.60-0.66), and poor (\u0026lt;0.60). Negative values indicated systematic disagreement. Uncertainty was quantified through bootstrapping (1000 resamples) to compute 95% confidence intervals for \u0026alpha; values per operational configuration.\u003c/p\u003e\n\u003cp\u003eObserver Agreement Analysis\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eUsing individual item ratings, we calculated sum scores and risk levels for each LLM and human rating. To evaluate validity against human ratings, we employed Krippendorff\u0026acute;s \u0026alpha; coefficient with regression bias correction, accounting for nested rater groups of human raters and LLM ratings. This correction adjusts for the fact that overall agreement between groups is limited by within-group agreement levels, providing more accurate estimates of true inter-group agreement. Separate \u0026alpha; coefficients were computed across risk levels, and sum scores aggregated per operational configuration. The overall \u0026alpha; value was corrected for within-group agreement using:\u003c/p\u003e\n\u003cp\u003e\u003cem\u003e\u0026alpha;_corrected = \u0026alpha;_observed + \u0026beta;(\u0026alpha;_expected - \u0026alpha;_observed)\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003ewhere \u0026alpha;_corrected represents the regression bias corrected Krippendorff\u0026acute;s \u0026alpha;, \u0026alpha;_observed is the originally calculated \u0026alpha;, \u0026alpha;_expected is the expected \u0026alpha; value under the null hypothesis (typically 0), and \u0026beta; represents the regression coefficient capturing the relationship between within-group and between-group agreement rates. This coefficient essentially determines how much the observed agreement should be adjusted based on within-group rating patterns. We quantified uncertainty through bootstrapping (1000 resamples) to compute 95% confidence intervals for individual \u0026alpha; values of risk levels per operational configuration.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eClassification Performance Analysis\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eWe derived final ratings through majority voting (\u0026gt;50% agreement) from LLM outputs (30 ratings per item/chain/temperature combination) and human ratings (4 per item). NGASR sum scores were calculated and categorized into four risk levels. As per each risk level, we computed binary classification metrics by comparing that level against all others combined (e.g., \u0026quot;high risk\u0026quot; vs. \u0026quot;not high risk\u0026quot;).\u003c/p\u003e\n\u003cp\u003eWe assessed validity against the human gold standardt through balanced accuracy, sensitivity, and specificity. Balanced accuracy addresses imbalanced sample rates by measuring detection ability for both present and absent risk factors, crucial for rare but critical symptoms. Sensitivity measures ability to detect present risk factors relative to positive case base rate, critical for identifying potential dangers. Specificity evaluates correct identification of true negatives compared to negative case base rate, important for avoiding false alarms and incurring costly and unnecessary treatment.\u003c/p\u003e\n\u003cp\u003ePerformance above respective base rates indicates meaningful discriminative ability, distinguishing true predictive performance from class distribution effects. We calculated 95% confidence intervals through bootstrapping (1000 resamples) for each operational configuration. Values exceeding 0.5 demonstrate above-random performance.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eItem Specific Analysis\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTo evaluate automation potential across risk factors, we conducted item-specific analyses using deterministic model outputs (temperature 0) from different prompting approaches. Item observer agreement per prompting style was evaluated through Krippendorff\u0026acute;s \u0026alpha; coefficient with regression bias correction.\u003c/p\u003e\n\u003cp\u003eFinal item classifications were derived via majority voting for each prompting style and compared against human consensus ratings. We evaluated classification performance through balanced accuracy, sensitivity, and specificity metrics, comparing these against respective base rates to determine significant improvements over chance-level performance.\u003c/p\u003e\n\u003cp\u003eError Analysis\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eWe analyzed cases where LLM ratings diverged from expected clinical reasoning through qualitative assessment. Our examination of chain-of-thought outputs revealed patterns in failed assessments. We analyzed both content and structure of the model\u0026acute;s clinical reasoning process, focusing on deviations from standard clinical judgment.\u0026nbsp;\u003c/p\u003e\n\u003ch2\u003e2.6 \u0026nbsp; \u0026nbsp;Tools and Software\u003c/h2\u003e\n\u003cp\u003eAnalyses were conducted using Python 3.8 on a Google Cloud Platform Kubernetes cluster. A 5-bit quantized Mixtral7x8b model was deployed on a 24GB L4 GPU machine using Ollama. The workflow utilized LangChain \u003csup\u003e46\u003c/sup\u003e for LLM interaction and retrieval augmentation, Pandas\u003csup\u003e47\u003c/sup\u003e for data manipulation, Pingouin\u003csup\u003e48\u003c/sup\u003e and Krippendorff\u003csup\u003e49\u003c/sup\u003e packages for statistical calculations, and Seaborn\u003csup\u003e50\u003c/sup\u003e and Matplotlib\u003csup\u003e51\u003c/sup\u003e for visualizations and re for regular expression string matching \u003csup\u003e52\u003c/sup\u003e.\u0026nbsp;\u003c/p\u003e\n\u003ch2\u003e2.7 \u0026nbsp; \u0026nbsp;Ethical Considerations\u003c/h2\u003e\n\u003cp\u003eAll methods in this study were carried out in accordance with relevant guidelines and regulations. All experimental protocols were approved by the Ethics Committee of the International Psychoanalytic University (IPU) Berlin (approval number: 2023_08). Informed consent was obtained from all subjects through Krisenchat\u0026apos;s terms of service, which explicitly state that user data may be used for research purposes without direct identification of individuals. All personally identifiable information was removed from chat transcripts during preprocessing. The study utilized existing data from the crisis helpline, and participants were not compensated as this was a secondary analysis of routine service data. Research was performed in accordance with the Declaration of Helsinki.\u003c/p\u003e\n\u003ch2\u003e2.8 \u0026nbsp; \u0026nbsp;Data Availability Statement\u003c/h2\u003e\n\u003cp\u003eThe datasets generated during and/or analysed during the current study are not publicly available and cannot be shared due to the highly sensitive and confidential nature of crisis helpline chat transcripts from vulnerable individuals, including minors who cannot provide consent for data sharing. These conversations frequently contain personal details and sensitive information regarding mental health and suicidal ideation. This restriction is necessary to protect participant privacy and confidentiality and to comply with ethical guidelines and data protection regulations, including the General Data Protection Regulation (GDPR). The nature of our Institutional Review Board approval and ethical framework for this research explicitly prohibits any sharing of this data beyond the approved research team. For questions about the methodological approach, the corresponding author J.T at
[email protected] may be contacted.\u003c/p\u003e"},{"header":"Results","content":"\u003cp\u003e\u003cstrong\u003eDescriptive Analysis\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe analysis included 100 cases stratified by NGASR-assigned risk levels: low (\u0026lt; 4), moderate (5-8), high (9-12), and very high (\u0026gt;12), randomly sampled from 439 labeled cases. Chi-square tests indicated group differences in age, with very high risk cases showing higher overall age. Refer to (Table 1) for mor detail. Analysis of demographic and interaction variables across risk levels revealed no significant differences. ANOVA tests yielded F-statistics of 1.088 for age (\u003cem\u003ep\u003c/em\u003e=0.358), 2.170 for counselor messages (\u003cem\u003ep\u003c/em\u003e=0.097), 1.396 for chatter messages (\u003cem\u003ep\u003c/em\u003e=0.249), and 1.634 for session count (\u003cem\u003ep\u003c/em\u003e=0.187), suggesting consistency across risk levels.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eReliability Analysis\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eHuman raters demonstrated varying reliability across risk levels, from high reliability in low-risk cases (α = 0.91 [0.85, 0.97]) to weaker agreement in high-risk assessments (α = 0.63 [0.51, 0.74]). LLM reliability analysis revealed distinct patterns across risk levels and prompting approaches. For low-risk cases, few-shot prompting at temperature 0 achieved highest reliability (α = 0.98 [0.95, 1.02]), exceeding human reliability. Zero-shot maintained perfect reliability (α = 1.00) for high and very high risk levels, surpassing human agreement (α = 0.63 [0.51, 0.74] and α = 0.76 [0.66, 0.87] respectively). Please refer to (Table 2) for a detailed lineout of human observer agreement.\u003c/p\u003e\n\u003cp\u003eTemperature increase markedly affected reliability across prompting styles. Chain-of-thought showed most pronounced degradation, with low-risk reliability dropping from α = 0.97 [0.93, 1.01] to α = 0.61 [0.50, 0.72] between temperature 0 and 1(Figure 3). Few-shot demonstrated more stability, particularly in very high risk cases (α = 0.97 [0.92, 1.01] at temperature 0 to α = 0.80 [0.72, 0.89] at temperature 1). Sum scores showed systematic disagreement in both human (α = -0.02 [-0.16, 0.11]) and LLM ratings, with the effect amplifying at higher temperatures. For all LLM Inter-Rating Reliability and Observer Agreement values refer to (Table 3).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eObserver Agreement Analysis\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAs highlighted in (Figure 4, panel A), \u0026nbsp; the observer agreement analysis, few-shot prompting at temperature 0 achieved highest agreement for low-risk cases (α = 0.78 [0.68, 0.88]), while zero-shot showed poorest agreement (α = 0.39 [0.24, 0.53]). Higher temperatures minimally affected few-shot performance but degraded chain-of-thought agreement from α = 0.72 [0.62, 0.83] to α = 0.67 [0.54, 0.79]. For moderate risk cases, all prompting styles demonstrated weak agreement, with chain-of-thought and few-shot at temperature 0 performing marginally better (α = 0.33 [0.19, 0.48]). Agreement declined with temperature increases, most notably in chain-of-thought dropping to α = 0.20 [0.03, 0.37] at temperature 1. In high-risk evaluations, zero-shot demonstrated strongest agreement (α = 0.67 [0.55, 0.80]) at temperature 0, maintaining stability across temperatures, while few-shot and chain-of-thought showed marked degradation with increased temperatures, dropping to α = 0.34 [0.15, 0.52] and α = 0.35 [0.17, 0.52] respectively. For very high-risk cases, few-shot at temperature 0 achieved highest agreement (α = 0.78 [0.67, 0.89]), with all styles maintaining relatively stable performance. Zero-shot demonstrated most consistent agreement (α = 0.75 [0.64, 0.86]) across temperature settings. Lastly, sum scores revealed systematic disagreement across all configurations, with negative α values deteriorating at higher temperatures. Few-shot at temperature 0 showed least disagreement (α = -0.58 [-0.65, -0.50]), while chain-of-thought at temperature 1 demonstrated strongest disagreement (α = -0.90 [-0.94, -0.86]). See also (Table 3).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eClassification Performance\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe LLM framework demonstrated distinct performance patterns across risk levels as highlighted in (Figure 5, panel A). For low-risk cases, performance was consistently strong (BA: 0.71-0.72 [0.67-0.79]), with zero-shot prompting achieving highest sensitivity (0.93 [0.81-1.00]) despite lower specificity (0.49 [0.42-0.61]). Few-shot prompting provided better balance with high specificity (0.92 [0.90-0.95]), maintaining stable performance across temperature settings, this being particularly valuable as it minimizes false positives, reducing unnecessary clinical interventions while maintaining screening efficiency. Meanwhile, performance declined substantially for moderate risk cases, approaching random classification. Few-shot prompting showed marginally better results (BA: 0.54 [0.49-0.59]) with balanced sensitivity (0.42 [0.26-0.55]) and specificity (0.67 [0.58-0.74]), while temperature variations had minimal impact on classification accuracy. Near-random classification and lowered sensitivity may raise concerns, as missing these cases could prevent early intervention. For high-risk cases, few-shot prompting demonstrated superior performance (BA: 0.67 [0.55-0.74]), achieving better sensitivity (0.62 [0.50-0.73]) and specificity (0.71 [0.66-0.82]). In contrast, zero-shot´s poor sensitivity (0.05 [0.00-0.11]) poses substantial clinical risk, despite high specificity (0.93 [0.89-0.97]). Lastly, for very high-risk cases, few-shot prompting achieved the best balanced accuracy (BA: 0.61 [0.58-0.65]), maintaining moderate sensitivity (0.31 [0.09-0.53]) and high specificity (0.92 [0.86-0.96]). Chain-of-thought showed similar performance (BA: 0.60 [0.52-0.66]) but lower sensitivity (0.30 [0.17-0.43]). Zero-shot performed worst with perfect specificity (1.00 [1.00-1.00]) but negligible sensitivity (0.06 [0.00-0.10]), making it clinically unsuitable for severe risk assessment where missed cases have the highest potential consequences. Please also see (Table 4) for a detailed breakdown of all values.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eItem Specific Analysis\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eObserver agreement varied across NGASR items at temperature 0, with distinct patterns for different item types (Figure 4, Panel A). Behaviorally-anchored items showed highest agreement: hearing voices achieved near-perfect agreement (human α = 1.00, chain-of-thought α = 0.95 95% CI: [0.90-1.00], few-shot α = 0.91 95% CI:[0.84-0.98], zero-shot α = 0.97 95% CI: [0.93-1.01]). Items requiring clinical inference showed lower agreement: hopelessness assessment demonstrated poor agreement (human α = 0.62, chain-of-thought α = 0.24 95% CI: [0.06-0.41], few-shot α = 0.29 95% CI: [0.12-0.46], zero-shot α = 0.42 [0.26-0.58]).\u003c/p\u003e\n\u003cp\u003eClassification metrics revealed similar patterns (Figure 5, Panel B). Behavioral items showed strong performance: hearing voices achieved high balanced accuracy with few-shot prompting (BA = 0.97 95% CI: [0.94-0.99]). Complex clinical items performed near random: social withdrawal showed BA = 0.62 95% CI: [0.47-0.77] despite high human reliability (α = 0.92). Few-shot prompting achieved highest balanced accuracy for suicide ideation (BA = 0.80 95% CI: [0.69-0.89]).\u003c/p\u003e\n\u003cp\u003eSensitivity varied by item type and prompting style. Few-shot excelled with behavioral items (hearing voices: 1.00 95% CI: [1.00-1.00]), while zero-shot struggled with suicide assessment (suicide plan: 0.04 95% CI: [0.00-0.11]). All prompting styles maintained high specificity, particularly for observable factors (hearing voices - chain-of-thought: 0.98 95% CI: [0.94-1.00], few-shot: 0.93 95% CI: [0.88-0.98], zero-shot: 0.99 95% CI: [0.96-1.00]).\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eRegarding the tradeoff between sensitivity and specificity, Zero-shot excelled in specificity but struggled with sensitivity, particularly for suicide-related items. Few-shot achieved the most balanced trade-off, maintaining good sensitivity without sacrificing specificity. Chain-of-thought showed moderate performance in both metrics but with less extreme trade-offs. This suggests that improvements in sensitivity often came at minimal cost to specificity, particularly for few-shot prompting\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eError Analysis\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eOur error analysis revealed critical inconsistencies and logical failures in clinical reasoning, even under identical conditions. Using chain-of-thought prompting across temperatures, the model not only provided contradictory assessments but also demonstrated fundamental logical errors in clinical judgment. In one striking example, the model concluded: \"while there are indications of suicidal thoughts 95% CI: [...] there is no explicit expression of current suicidal ideation\" despite previously noting \"the patient confirms having a plan for suicide.\" This represents a severe logical error, as the presence of a suicide plan necessarily implies suicidal ideation. In another case with similar input, the model imposed hallucinated diagnostic criteria: \"while the patient frequently discusses their intense suicidal thoughts, they do not express any actual suicidal ideation in terms of having a plan or intent.\" Yet, given similar input under identical operational conditions, it correctly identified suicidal ideation based solely on thought content: \"the patient expresses suicidal ideation with an intensity of 65 out of 100[...]\" These inconsistencies and logical failures suggest that despite the appearance of structured clinical reasoning through step-by-step analysis, the model lacks thorough understanding of the hierarchical and logical relationships between clinical concepts.\u0026nbsp;\u003c/p\u003e"},{"header":"Discussion","content":"\u003cp\u003eThis study evaluated the performance of a LLM for standardized psychological risk assessments using the Mixtral7x8b model under a RAG framework. We assessed the LLM\u0026acute;s ability to rate binary items from the Nurses\u0026acute; Global Assessment of Suicide Risk (NGASR) scale in German crisis text line transcripts, focusing on different prompting Styles (zero-shot, few-shot, chain of thought) and temperature settings, which in combination we call operational configurations.\u003c/p\u003e\n\u003ch2\u003e5.1 Principal Results\u003c/h2\u003e\n\u003cp\u003eOur analysis revealed distinct patterns in LLM performance across reliability, observer agreement, and classification metrics. While LLMs demonstrated high internal consistency, particularly at temperature 0, this reliability did not translate to clinical validity. Zero-shot prompting achieved highest internal consistency but showed poor alignment with human ratings, especially for complex clinical judgments. Few-shot prompting offered better balance, achieving strongest human-AI agreement for very high risk categories, though agreement remained only moderate overall.\u003c/p\u003e\n\u003cp\u003eClassification performance highlighted critical limitations in risk assessment. The framework performed best for low-risk cases but approached random classification for moderate risks. Few-shot prompting at temperature 0 provided the most balanced performance for initial screening, while zero-shot showed concerning patterns of high specificity but negligible sensitivity for high-risk cases - a limitation particularly problematic in suicide risk assessment where missing cases could have catastrophic consequences. Notably, sensitivity decreased with increasing risk levels across all prompting styles. While structured prompting improved surface-level metrics, detailed examination revealed persistent issues in clinical reasoning consistency. Given these limitations, current LLM capabilities fall short of requirements for fine-grained clinical assessment, necessitating mandatory clinical verification for moderate to high-risk cases and emphasizing that LLMs should augment rather than replace clinical judgment.\u003c/p\u003e\n\u003cp\u003eItem-level analysis revealed clear performance patterns based on item characteristics. The framework performed well on behaviorally-anchored items like hearing voices but struggled with items requiring complex clinical inference such as hopelessness assessment. Few-shot prompting showed advantages for suicide-related items, though performance remained below human agreement levels. These patterns suggest that LLM effectiveness varies significantly with the type of clinical judgment required, performing best when assessing concrete, observable factors rather than interpretative clinical concepts.\u003c/p\u003e\n\u003ch2\u003e5.2 Merits and Limitations\u003c/h2\u003e\n\u003cp\u003eOur study offered valuable ecological validity by analyzing real clinical data from a German crisis text line, though generalizability is limited by the narrow demographic scope (female youth) and potential language model biases in youth communication patterns.\u003c/p\u003e\n\u003cp\u003eThe systematic comparison of prompting styles and temperatures revealed reliability-performance trade-offs, but excluded temporal crisis dynamics and multimodal assessment factors. Our comprehensive evaluation framework included confidence intervals and multiple reliability metrics, though binary classification may oversimplify risk progression. Item-level analysis distinguished between behavioral and interpretative assessments, despite uneven base rates affecting discriminative ability measurement.\u003c/p\u003e\n\u003cp\u003eThe technical implementation featured state-of-the-art components but faced limitations in embedding quality variability and chunk size optimization. While demonstrating research feasibility, the reliance on high-performance GPUs limits practical scalability. Expert clinical ratings provided quality ground truth data, though rater diversity and expertise variations weren\u0026acute;t explored. German-specific cultural and linguistic nuances warrant further investigation.\u003c/p\u003e\n\u003cp\u003eThe relationship between confident but incorrect LLM responses deserves deeper examination, as the nature and reason for hallucination were not the focus of this work. Overall, results reflect one specific implementation choice rather than inherent LLM capabilities, suggesting potential for alternative approaches.\u003c/p\u003e\n\u003cp\u003eAn important limitation of this study relates to the NGASR scale itself, which was not originally designed for youth populations. The scale\u0026apos;s applicability to adolescents may be limited by developmental considerations not accounted for in its original validation. Furthermore, several NGASR items provide minimal scoring instructions, creating inherent ambiguity that challenges both human raters and LLMs. Where human raters struggled to achieve consensus (particularly for moderate and high risk categories), the LLM similarly demonstrated lower performance. This pattern is mathematically expected given that regression bias corrected Krippendorff\u0026apos;s \u0026alpha; is dependent on human agreement levels, creating a ceiling effect on potential human-AI agreement. The strong performance observed in low and very high risk categories, contrasted with poorer results in moderate risk assessment, may therefore reflect inherent psychometric limitations of the scale rather than solely AI capability constraints.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e5.3 \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; Comparison with Prior Work\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eOur study advances the emerging field of LLM applications in psychological assessment through three key contributions: implementation of state-of-the-art prompting frameworks, extension into psychological rather than purely medical assessments, and validation on real-world clinical data. While previous research has demonstrated LLMs\u0026acute; potential in medical contexts, with Singhal et al. \u003csup\u003e11\u003c/sup\u003eachieving notable accuracy on MedQA exam questions using Flan-PaLM, psychological applications present unique challenges requiring specialized approaches. Our investigation bridges the gap between theoretical benchmarks and practical psychological assessments by implementing sophisticated prompting frameworks in mental health contexts. Recent developments in psychological applications of LLMs have shown promising directions but remained largely experimental. Yang et al. \u003csup\u003e53\u003c/sup\u003edeveloped the PsyCoT framework for personality trait detection, while Chen et al. \u003csup\u003e54\u003c/sup\u003efocused on cognitive distortion detection through their Diagnosis of Thought (DoT) framework. These approaches demonstrated LLMs\u0026acute; potential for psychological reasoning but were limited to specific domains. Our research extends these efforts by adapting structured prompting techniques to standardized suicide risk assessment, building particularly on Wu et al.\u0026acute;s \u003csup\u003e55\u003c/sup\u003ework on chain-of-thought prompting for diagnostic reasoning. A crucial distinction of our study lies in its use of authentic clinical data. While previous work, such as Blanco-Cuaresma\u0026acute;s \u003csup\u003e17\u003c/sup\u003eanalysis of suicide risk in Reddit comments, relied on public social media data, our study utilized real crisis helpline transcripts. This represents a significant advance in ecological validity, as it evaluates LLM performance in the actual context where such systems might be deployed. This clinical dataset allowed us to assess not only technical performance but also practical applicability in authentic healthcare settings. The comprehensive evaluation of diverse prompting styles and hyperparameters on real clinical data offers unique insights into the practical challenges of implementing LLMs in mental health assessment. Our findings contribute vital understanding of both the potential and limitations of LLMs in psychological assessment, particularly in high-stakes domains like suicide risk evaluation.\u003c/p\u003e\n\u003cp\u003eOur findings both confirm and challenge previous research. While we confirm Yang et al.\u0026acute;s \u003csup\u003e53\u003c/sup\u003eobservation that LLMs can engage in psychological reasoning, our error analysis reveals more severe limitations in clinical logic than previously reported. Similarly, while we support Wu et al.\u0026acute;s\u003csup\u003e55\u003c/sup\u003e finding that chain-of-thought prompting can improve reasoning transparency, we found it actually decreased reliability at higher temperatures - a crucial distinction for clinical applications. Unlike Blanco-Cuaresma\u0026acute;s \u003csup\u003e17\u003c/sup\u003epromising results with social media data, our analysis of clinical transcripts showed substantially lower performance, particularly for moderate risk cases, highlighting the challenges of real-world clinical assessment versus public data analysis.\u003c/p\u003e\n\u003ch2\u003e5.4 Clinical Implications\u003c/h2\u003e\n\u003cp\u003eOur findings identify three promising clinical applications for LLMs in psychological assessment. First, LLMs can serve as preliminary screening tools in high-volume clinical settings, supporting initial triage decisions. Second, they can function as decision support systems, providing structured evaluations to complement clinical judgment. Third, LLMs can help standardize assessment approaches across different clinical contexts, improving multi-site consistency.\u003c/p\u003e\n\u003cp\u003eCurrent implementations require specific conditions for optimal performance. Temperature settings and prompting styles significantly influence assessment reliability, necessitating careful calibration. LLM performance varies across clinical indicators, performing best with concrete behavioral symptoms rather than complex clinical judgments. Performance on critical risk factors, particularly suicide-related items, remains insufficient for autonomous clinical use.\u003c/p\u003e\n\u003cp\u003eAdvancing clinical viability requires enhanced prompting strategies for consistent reasoning, robust RAG mechanisms for diverse cases, and optimized parameters and validation protocols for human-AI agreement. Implementation must address ethical considerations including informed consent, data privacy, and regulatory compliance. While our study demonstrates one possible approach, alternative implementations may yield improved performance. However, any clinical applications must be developed with careful attention to both technical performance and ethical implications, particularly in high-stakes domains like suicide risk assessment.\u003c/p\u003e\n\u003ch2\u003e5.5 Future Directions\u003c/h2\u003e\n\u003cp\u003eOur findings indicate key priorities for advancing LLM applications in mental health assessment. We need mental health-specific LLMs that better capture psychological nuances, supported by open-source development for scientific replication. Current models show basic clinical reasoning capabilities but require specialized architectures and training. Future research should consider using assessment instruments with more precise operational definitions and better validated for youth populations when evaluating AI performance in psychological assessment. Moreover, robust validation frameworks are essential, as our error analysis revealed that standard metrics may mask reasoning failures. Future protocols must detect logical inconsistencies, ensure diagnostic concept hierarchies are understood, and validate criterion consistency. The temporal aspects of assessment also need attention, as current LLMs lack mechanisms to model mental state progression over time.\u003c/p\u003e\n\u003cp\u003ePractical challenges include developing privacy-preserving clinical datasets for domain adaptation and addressing cultural-linguistic variations in psychological expression. LLM decision interpretability requires investigation, particularly regarding hallucinated criteria.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e5.6 Final Conclusion\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis study advances our understanding of LLM applications in psychological assessment through systematic evaluation of implementation parameters and real-world clinical data. While our findings demonstrate potential for supporting specific aspects of clinical work, particularly in initial screening and standardization of assessment procedures, they also reveal fundamental challenges in clinical reasoning that current implementations have yet to overcome. The observed pattern of decreasing sensitivity with increasing risk levels poses particular concerns for high-stakes clinical applications.\u003c/p\u003e\n\u003cp\u003eOur methodological framework, emphasizing comparison against base rates and comprehensive error analysis, provides valuable guidance for future evaluations of AI systems in clinical settings. The stark contrast between surface-level performance metrics and detailed reasoning analysis emphasizes the need for more sophisticated validation approaches in clinical AI research.\u003c/p\u003e\n\u003cp\u003eLooking forward, these findings suggest that advancing LLM applications in psychological assessment requires not just technical improvements, but fundamental reconsideration of how we implement and validate AI systems in clinical contexts. While current implementations are not ready for autonomous clinical application, they point toward promising directions for human-AI collaborative systems that leverage the strengths of both automated and human assessment.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eConflict of Interest\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eJ.T is employed and receives a salary from krisenchat. krisenchat had no impact on the design of this study and did not influence the collection, execution, analyses, interpretation of the data, or the decision to submit the article/contribution for publication.\u003c/p\u003e\n\u003cp\u003eGM received funding from the Stanley Thomas Johnson Stiftung \u0026amp; Gottfried und Julia Bangerter-Rhyner-Stiftung under projects no. PC 28/17 and PC 05/18, from Gesundheitsf\u0026ouml;rderung Schweiz under project no. 18.191/K50001, from the Swiss Heart Foundation under project no. FF21101, from the Research Foundation of the International Psychoanalytic University (IPU) Berlin under projects no. 5087 and 5217, from the Swiss National Science Foundation (SNSF) under project no. 100014_135328, from the German Federal Ministry of Education and Research under budget item 68606 in the context of an evaluation project conducted amongst others in collaboration with Krisenchat, from the Hasler Foundation under project No. 23004, in the context of a Horizon Europe project from the Swiss State Secretariat for Education, Research and lnnovation (SERI) under contract number 22.00094, and from Wings Health in the context of a proof-of-concept study. GM is a co-founder, and shareholder of Therayou AG, active in digital and blended mental healthcare. GM receives royalties from publishing companies as author, including a book published by Springer, and an honorarium from Lundbeck for speaking at a symposium. Furthermore, GM is compensated for providing psychotherapy to patients, acting as a supervisor, serving as a self-experience facilitator (\u0026acute;Selbsterfahrungsleiter\u0026acute;), and for postgraduate training of psychotherapists and supervisors.\u003c/p\u003e\n\u003cp\u003eL.K. and Z.E. have no competing interests.\u003cp\u003e\u003cp\u003eWe used artificial intelligence (AI)-based tools, including Claude and ChatGPT to support manuscript preparation. Further, we used publicly available search technologies, which we recognize likely utilise AI capabilities. We confirm that the contributions of AI were strictly in an assistive capacity. AI was not involved in conceptual tasks. Human oversight was continuously employed to ensure the accuracy of content and address any ethical concerns\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthor Contributions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eJT led the project, developed the study concept in collaboration with the co-authors, performed the data analysis, wrote the original manuscript, and created the visualizations. GM supervised the project, contributed to conceptualization, provided methodological guidance, and critically reviewed and edited the manuscript. ZE provided expertise in artificial intelligence methods and contributed to manuscript review and editing. LK provided methodological expertise in statistical analysis and contributed to manuscript review and editing. All authors contributed to manuscript revision, read, and approved the submitted version.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThere was no funding for this study.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n \u003cli\u003eSartori, G. \u0026amp; Orr\u0026ugrave;, G. Language models and psychological sciences. \u003cem\u003eFront. Psychol.\u0026nbsp;\u003c/em\u003e\u003cstrong\u003e14\u003c/strong\u003e, 1279317 (2023).\u003c/li\u003e\n \u003cli\u003eVaswani, A. \u003cem\u003eet al.\u003c/em\u003e Attention is All you Need. in \u003cem\u003eAdvances in Neural Information Processing Systems\u003c/em\u003e vol. 30 (Curran Associates, Inc., 2017).\u003c/li\u003e\n \u003cli\u003eBommasani, R. \u003cem\u003eet al.\u003c/em\u003e On the Opportunities and Risks of Foundation Models. Preprint at https://doi.org/10.48550/arXiv.2108.07258 (2022).\u003c/li\u003e\n \u003cli\u003eWei, J. \u003cem\u003eet al.\u003c/em\u003e Emergent Abilities of Large Language Models. Preprint at https://doi.org/10.48550/arXiv.2206.07682 (2022).\u003c/li\u003e\n \u003cli\u003eWei, J. \u003cem\u003eet al.\u003c/em\u003e Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Preprint at https://doi.org/10.48550/arXiv.2201.11903 (2023).\u003c/li\u003e\n \u003cli\u003eZhang, Z. \u003cem\u003eet al.\u003c/em\u003e Igniting Language Intelligence: The Hitchhiker\u0026rsquo;s Guide From Chain-of-Thought Reasoning to Language Agents. Preprint at https://doi.org/10.48550/arXiv.2311.11797 (2023).\u003c/li\u003e\n \u003cli\u003eKe, L., Tong, S., Cheng, P. \u0026amp; Peng, K. Exploring the Frontiers of LLMs in Psychological Applications: A Comprehensive Review. Preprint at https://doi.org/10.48550/arXiv.2401.01519 (2024).\u003c/li\u003e\n \u003cli\u003eWiest, I. C. \u003cem\u003eet al.\u003c/em\u003e Privacy-preserving large language models for structured medical information retrieval. \u003cem\u003eNPJ Digit. Med.\u0026nbsp;\u003c/em\u003e\u003cstrong\u003e7\u003c/strong\u003e, 257 (2024).\u003c/li\u003e\n \u003cli\u003eStade, E. C. \u003cem\u003eet al.\u003c/em\u003e Large language models could change the future of behavioral healthcare: a proposal for responsible development and evaluation. \u003cem\u003eNpj Ment. Health Res.\u0026nbsp;\u003c/em\u003e\u003cstrong\u003e3\u003c/strong\u003e, 1\u0026ndash;12 (2024).\u003c/li\u003e\n \u003cli\u003eVan Veen, D. \u003cem\u003eet al.\u003c/em\u003e Clinical Text Summarization: Adapting Large Language Models Can Outperform Human Experts. \u003cem\u003eRes. Sq.\u003c/em\u003e rs.3.rs-3483777 (2023) doi:10.21203/rs.3.rs-3483777/v1.\u003c/li\u003e\n \u003cli\u003eSinghal, K. \u003cem\u003eet al.\u003c/em\u003e Large Language Models Encode Clinical Knowledge. Preprint at https://doi.org/10.48550/arXiv.2212.13138 (2022).\u003c/li\u003e\n \u003cli\u003eEr-Rays, Y. \u0026amp; M\u0026rsquo;dioud, M. ChatGPT in Healthcare Facilities: An Overview and Innovations in Technical Efficiency Analysis. SSRN Scholarly Paper at https://doi.org/10.2139/ssrn.4771070 (2024).\u003c/li\u003e\n \u003cli\u003eKjell, O. N. E., Sikstr\u0026ouml;m, S., Kjell, K. \u0026amp; Schwartz, H. A. Natural language analyzed with AI-based transformers predict traditional subjective well-being measures approaching the theoretical upper limits in accuracy. \u003cem\u003eSci. Rep.\u0026nbsp;\u003c/em\u003e\u003cstrong\u003e12\u003c/strong\u003e, 3918 (2022).\u003c/li\u003e\n \u003cli\u003eKjell, O. N. E., Kjell, K. \u0026amp; Schwartz, H. A. Beyond rating scales: With targeted evaluation, large language models are poised for psychological assessment. \u003cem\u003ePsychiatry Res.\u0026nbsp;\u003c/em\u003e\u003cstrong\u003e333\u003c/strong\u003e, 1\u0026ndash;12 (2024).\u003c/li\u003e\n \u003cli\u003eGanesan, A. V., Matero, M., Ravula, A. R., Vu, H. \u0026amp; Schwartz, H. A. Empirical Evaluation of Pre-trained Transformers for Human-Level NLP: The Role of Sample Size and Dimensionality. \u003cem\u003eProc. Conf. Assoc. Comput. Linguist. North Am. Chapter Meet.\u0026nbsp;\u003c/em\u003e\u003cstrong\u003e2021\u003c/strong\u003e, 4515\u0026ndash;4532 (2021).\u003c/li\u003e\n \u003cli\u003eLevkovich, I., Shinan-Altman, S. \u0026amp; Elyoseph, Z. Can large language models be sensitive to culture suicide risk assessment? \u003cem\u003eJ. Cult. Cogn. Sci.\u003c/em\u003e No Pagination Specified-No Pagination Specified (2024) doi:10.1007/s41809-024-00151-9.\u003c/li\u003e\n \u003cli\u003eBlanco-Cuaresma, S. Psychological Assessments with Large Language Models: A Privacy-Focused and Cost-Effective Approach. Preprint at https://doi.org/10.48550/arXiv.2402.03435 (2024).\u003c/li\u003e\n \u003cli\u003eJi, Z. \u003cem\u003eet al.\u003c/em\u003e Survey of Hallucination in Natural Language Generation. \u003cem\u003eACM Comput. Surv.\u0026nbsp;\u003c/em\u003e\u003cstrong\u003e55\u003c/strong\u003e, 1\u0026ndash;38 (2023).\u003c/li\u003e\n \u003cli\u003eBishop, J. M. Artificial Intelligence Is Stupid and Causal Reasoning Will Not Fix It. \u003cem\u003eFront. Psychol.\u0026nbsp;\u003c/em\u003e\u003cstrong\u003e11\u003c/strong\u003e, (2021).\u003c/li\u003e\n \u003cli\u003eAmatriain, X. MEASURING AND MITIGATING HALLUCINATIONS IN LARGE LANGUAGE MODELS: A MULTIFACETED APPROACH. (2024).\u003c/li\u003e\n \u003cli\u003eAdolescent mortality ranking - top 5 causes (country). \u003cem\u003eWHO Data\u0026nbsp;\u003c/em\u003ehttps://platform.who.int/data/maternal-newborn-child-adolescent-ageing/indicator-explorer-new/mca/adolescent-mortality-ranking---top-5-causes-(country).\u003c/li\u003e\n \u003cli\u003eBernert, R. A. \u003cem\u003eet al.\u003c/em\u003e Artificial Intelligence and Suicide Prevention: A Systematic Review of Machine Learning Investigations. \u003cem\u003eInt. J. Environ. Res. Public. Health\u0026nbsp;\u003c/em\u003e\u003cstrong\u003e17\u003c/strong\u003e, 5929 (2020).\u003c/li\u003e\n \u003cli\u003eLejeune, A. \u003cem\u003eet al.\u003c/em\u003e Artificial intelligence and suicide prevention: A systematic review. \u003cem\u003eEur. Psychiatry\u0026nbsp;\u003c/em\u003e\u003cstrong\u003e65\u003c/strong\u003e, e19 (2022).\u003c/li\u003e\n \u003cli\u003eMenon, V. \u0026amp; Vijayakumar, L. Artificial intelligence-based approaches for suicide prediction: Hope or hype? \u003cem\u003eAsian J. Psychiatry\u0026nbsp;\u003c/em\u003e\u003cstrong\u003e88\u003c/strong\u003e, 103728 (2023).\u003c/li\u003e\n \u003cli\u003eElyoseph, Z., Levkovich, I., Haber, Y. \u0026amp; Levi-Belz, Y. \u003cem\u003eUsing GenAI to Train Mental Health Professionals in Suicide Risk Assessment: Preliminary Findings\u003c/em\u003e. (2024). doi:10.1101/2024.07.17.24310579.\u003c/li\u003e\n \u003cli\u003eShinan-Altman, S., Elyoseph, Z. \u0026amp; Levkovich, I. The impact of history of depression and access to weapons on suicide risk assessment: a comparison of ChatGPT-3.5 and ChatGPT-4. \u003cem\u003ePeerJ\u0026nbsp;\u003c/em\u003e\u003cstrong\u003e12\u003c/strong\u003e, (2024).\u003c/li\u003e\n \u003cli\u003eElyoseph, Z. \u003cem\u003eet al. Applying Language Models for Suicide Prevention: Evaluating News Article Adherence to WHO Reporting Guidelines\u003c/em\u003e. (2024). doi:10.21203/rs.3.rs-4180591/v1.\u003c/li\u003e\n \u003cli\u003eLee, C., Mohebbi, M., O\u0026rsquo;Callaghan, E. \u0026amp; Winsberg, M. Large Language Models Versus Expert Clinicians in Crisis Prediction Among Telemental Health Patients: Comparative Study. \u003cem\u003eJMIR Ment. Health\u0026nbsp;\u003c/em\u003e\u003cstrong\u003e11\u003c/strong\u003e, e58129 (2024).\u003c/li\u003e\n \u003cli\u003eChen, Y. \u003cem\u003eet al.\u003c/em\u003e Deep Learning and Large Language Models for Audio and Text Analysis in Predicting Suicidal Acts in Chinese Psychological Support Hotlines. Preprint at https://doi.org/10.48550/arXiv.2409.06164 (2024).\u003c/li\u003e\n \u003cli\u003eBaldofski, S. \u003cem\u003eet al.\u003c/em\u003e The Impact of a Messenger-Based Psychosocial Chat Counseling Service on Further Help-Seeking Among Children and Young Adults: Longitudinal Study. \u003cem\u003eJMIR Ment. Health\u0026nbsp;\u003c/em\u003e\u003cstrong\u003e10\u003c/strong\u003e, e43780 (2023).\u003c/li\u003e\n \u003cli\u003eCutcliffe, J. R. \u0026amp; Barker, P. The Nurses\u0026rsquo; Global Assessment of Suicide Risk (NGASR): developing a tool for clinical practice. \u003cem\u003eJ. Psychiatr. Ment. Health Nurs.\u0026nbsp;\u003c/em\u003e\u003cstrong\u003e11\u003c/strong\u003e, 393\u0026ndash;400 (2004).\u003c/li\u003e\n \u003cli\u003eKozel, B., Grieser, M., Rieder, P., Seifritz, E. \u0026amp; Abderhalden, C. Nurses`Global Assessment of Suicide Risk \u0026ndash; Skala (NGASR): Die Interrater - Reliabilit\u0026auml;t eines Instrumentes zur systematisierten pflegerischen Einsch\u0026auml;tzung der Suizidalit\u0026auml;t. \u003cem\u003eZ. F\u0026uuml;r Pflegewissenschaft Psych. Gesundh.\u0026nbsp;\u003c/em\u003e\u003cstrong\u003e1\u003c/strong\u003e, 17\u0026ndash;26 (2007).\u003c/li\u003e\n \u003cli\u003eKozel, B., Heged\u0026uuml;s, A., Dassen, T. \u0026amp; Abderhalden, C. Die Kriteriumsvalidit\u0026auml;t der deutschen Version der Nurses`Global Assessment of Suicide Risk Scale (NGASR-Scale). in 186\u0026ndash;191 (2012).\u003c/li\u003e\n \u003cli\u003eKohls, E. \u003cem\u003eet al.\u003c/em\u003e Suicidal Ideation Among Children and Young Adults in a 24/7 Messenger-Based Psychological Chat Counseling Service. (2022) doi:10.18452/24781.\u003c/li\u003e\n \u003cli\u003ekrippendorff, klaus. Computing Krippendorff\u0026rsquo;s Alpha-Reliability. (2011).\u003c/li\u003e\n \u003cli\u003eKrippendorff, K. Reliability in Content Analysis. \u003cem\u003eHum. Commun. Res.\u0026nbsp;\u003c/em\u003e\u003cstrong\u003e30\u003c/strong\u003e, 411\u0026ndash;433 (2004).\u003c/li\u003e\n \u003cli\u003eKazdin, A. E. ARTIFACT, BIAS, AND COMPLEXITY OF ASSESSMENT: THE ABCs OF RELIABILITY. \u003cem\u003eJ. Appl. Behav. Anal.\u0026nbsp;\u003c/em\u003e\u003cstrong\u003e10\u003c/strong\u003e, 141\u0026ndash;150 (1977).\u003c/li\u003e\n \u003cli\u003eJiang, A. Q. \u003cem\u003eet al.\u003c/em\u003e Mixtral of Experts. Preprint at https://doi.org/10.48550/arXiv.2401.04088 (2024).\u003c/li\u003e\n \u003cli\u003eSu, C. \u003cem\u003eet al.\u003c/em\u003e Machine learning for suicide risk prediction in children and adolescents with electronic health records. \u003cem\u003eTransl. Psychiatry\u0026nbsp;\u003c/em\u003e\u003cstrong\u003e10\u003c/strong\u003e, 1\u0026ndash;10 (2020).\u003c/li\u003e\n \u003cli\u003ePerković, G., Drobnjak, A. \u0026amp; Botički, I. Hallucinations in LLMs: Understanding and Addressing Challenges. in \u003cem\u003e2024 47th MIPRO ICT and Electronics Convention (MIPRO)\u003c/em\u003e 2084\u0026ndash;2088 (2024). doi:10.1109/MIPRO60963.2024.10569238.\u003c/li\u003e\n \u003cli\u003eHong, G. \u003cem\u003eet al.\u003c/em\u003e The Hallucinations Leaderboard -- An Open Effort to Measure Hallucinations in Large Language Models. Preprint at https://doi.org/10.48550/arXiv.2404.05904 (2024).\u003c/li\u003e\n \u003cli\u003eGao, Y. \u003cem\u003eet al.\u003c/em\u003e Retrieval-Augmented Generation for Large Language Models: A Survey. Preprint at https://doi.org/10.48550/arXiv.2312.10997 (2024).\u003c/li\u003e\n \u003cli\u003ePeeperkorn, M., Kouwenhoven, T., Brown, D. \u0026amp; Jordanous, A. Is Temperature the Creativity Parameter of Large Language Models? Preprint at http://arxiv.org/abs/2405.00492 (2024).\u003c/li\u003e\n \u003cli\u003eAckley, D. H., Hinton, G. E. \u0026amp; Sejnowski, T. J. A learning algorithm for boltzmann machines. \u003cem\u003eCogn. Sci.\u0026nbsp;\u003c/em\u003e\u003cstrong\u003e9\u003c/strong\u003e, 147\u0026ndash;169 (1985).\u003c/li\u003e\n \u003cli\u003eMin, S. \u003cem\u003eet al.\u003c/em\u003e Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? Preprint at https://doi.org/10.48550/arXiv.2202.12837 (2022).\u003c/li\u003e\n \u003cli\u003elangchain/docs/docs/introduction.mdx at master \u0026middot; langchain-ai/langchain. \u003cem\u003eGitHub\u0026nbsp;\u003c/em\u003ehttps://github.com/langchain-ai/langchain/blob/master/docs/docs/introduction.mdx.\u003c/li\u003e\n \u003cli\u003eMckinney, W. pandas: a Foundational Python Library for Data Analysis and Statistics. \u003cem\u003ePython High Perform. Sci. Comput.\u003c/em\u003e (2011).\u003c/li\u003e\n \u003cli\u003eInstallation \u0026mdash; pingouin 0.5.5 documentation. https://pingouin-stats.org/build/html/index.html.\u003c/li\u003e\n \u003cli\u003ekrippendorff: Fast computation of the Krippendorff\u0026rsquo;s alpha measure.\u003c/li\u003e\n \u003cli\u003eWaskom, M. L. seaborn: statistical data visualization. \u003cem\u003eJ. Open Source Softw.\u0026nbsp;\u003c/em\u003e\u003cstrong\u003e6\u003c/strong\u003e, 3021 (2021).\u003c/li\u003e\n \u003cli\u003eMatplotlib \u0026mdash; Visualization with Python. https://matplotlib.org/.\u003c/li\u003e\n \u003cli\u003ere \u0026mdash; Regular expression operations. \u003cem\u003ePython documentation\u0026nbsp;\u003c/em\u003ehttps://docs.python.org/3/library/re.html.\u003c/li\u003e\n \u003cli\u003eYang, T. \u003cem\u003eet al.\u003c/em\u003e PsyCoT: Psychological Questionnaire as Powerful Chain-of-Thought for Personality Detection. Preprint at https://doi.org/10.48550/arXiv.2310.20256 (2023).\u003c/li\u003e\n \u003cli\u003eChen, Z., Lu, Y. \u0026amp; Wang, W. Y. Empowering Psychotherapy with Large Language Models: Cognitive Distortion Detection through Diagnosis of Thought Prompting. Preprint at https://doi.org/10.48550/arXiv.2310.07146 (2023).\u003c/li\u003e\n \u003cli\u003eWu, C.-K., Chen, W.-L. \u0026amp; Chen, H.-H. Large Language Models Perform Diagnostic Reasoning. Preprint at https://doi.org/10.48550/arXiv.2307.08922 (2023).\u003c/li\u003e\n\u003c/ol\u003e"},{"header":"Tables","content":"\u003cp\u003e\u003cstrong\u003eTable 1.\u003c/strong\u003e Demographic and Clinical Characteristics of Crisis Helpline Users Stratified by Suicide Risk Level (N=100), assessing the presence of suicide risk factors based on n=100 session transcripts of German Youth Crisis Helpline Users between 2021-11-30 and 2022-04-30\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\" width=\"696\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003eNGASR\u003csup\u003ea\u003c/sup\u003e Risk Level\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003eAge in years (Mean \u0026plusmn; SD)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003eNGASR\u003csup\u003ea\u003c/sup\u003e Sum Score (Mean \u0026plusmn; SD)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003eNumber of Counselor Messages (Mean \u0026plusmn; SD)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003eNumber of Chatter Messages (Mean \u0026plusmn; SD)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003eNumber of Counseling Sessions (Mean \u0026plusmn; SD)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003eLow\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e16.8 \u0026plusmn; 2.66\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e2.12 \u0026plusmn; 1.58\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e143.32 \u0026plusmn; 208.68\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e206.52 \u0026plusmn; 414.45\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e16.00 \u0026plusmn; 26.00\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003eModerate\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e16.0 \u0026plusmn; 2.25\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e6.32 \u0026plusmn; 1.18\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e403.96 \u0026plusmn; 769.45\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e512.00 \u0026plusmn; 1071.12\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e40.00 \u0026plusmn; 75.59\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003eHigh\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e16.83 \u0026plusmn; 2.79\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e10.16 \u0026plusmn; 0.86\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e506.50 \u0026plusmn; 781.38\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e612.95 \u0026plusmn; 1004.71\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e47.62 \u0026plusmn; 68.96\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003eVery High\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e17.32 \u0026plusmn; 2.71\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e14.56 \u0026plusmn; 2.39\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e571.20 \u0026plusmn; 624.92\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e643.48 \u0026plusmn; 723.56\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e49.44 \u0026plusmn; 57.74\u003c/p\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u003cem\u003eNote\u003c/em\u003e: Values presented as Mean \u0026plusmn; Standard Deviation (SD) for age, NGASRa sum scores, message counts, and session counts. a NGASR = Nurses Global Assessment of Suicide Risk\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTable 2.\u003c/strong\u003e Human Inter-Rater Reliability Analysis Across Items and Risk Levels (N=400 ratings), assessing the presence of suicide risk factors based on n=100 session transcripts of German Youth Crisis Helpline Users between 2021-11-30 and 2022-04-30\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\" width=\"839\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.8057%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eMetric Type\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.8057%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eLow Risk\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6865%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eModerate Risk\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.5673%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eHigh Risk\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.5673%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eVery High Risk\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.5673%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eSum Score\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.8057%;\"\u003e\n \u003cp\u003eHuman Reliability\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.8057%;\"\u003e\n \u003cp\u003e0.91 95% CI: [0.85, 0.97]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6865%;\"\u003e\n \u003cp\u003e0.67 95% CI: [0.57, 0.78]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.5673%;\"\u003e\n \u003cp\u003e0.63 95% CI: [0.51, 0.74]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.5673%;\"\u003e\n \u003cp\u003e0.76 95% CI: [0.66, 0.87]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.5673%;\"\u003e\n \u003cp\u003e-0.02 95% CI: [-0.16, 0.11]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u003cem\u003eNote\u003c/em\u003e: Values represent Krippendorff\u0026apos;s \u0026alpha; coefficients shown as Mean with 95% Confidence Intervals. Analysis based on ratings from 4 independent clinical experts. Perfect agreement cases coded as 1.0. Negative values indicate systematic disagreement.\u003c/p\u003e\n\u003cp\u003eTable 3. LLMa Inter-Rating Reliability and Observer Agreement of Risk Levels and Sum Score Compared Across Prompting Style and Temperature (N=48,000 per configuration), assessing the presence of suicide risk factors based on n=100 session transcripts of German Youth Crisis Helpline Users between 2021-11-30 and 2022-04-30\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\" width=\"853\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eRisk Level\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eTemp\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eMetric\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eChain of Thought\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eFew-Shot\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eZero-Shot\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003eLow\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003eLLM Reliability\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.97 95% CI: [0.93, 1.01]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.98 95% CI: [0.95, 1.02]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.96 95% CI: [0.91, 1.01]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003eObserver Agreement\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.72 95% CI: [0.62, 0.83]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.78 95% CI: [0.68, 0.88]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.39 95% CI: [0.24, 0.53]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003eLLM Reliability\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.75 95% CI: [0.65, 0.85]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.93 95% CI: [0.87, 0.99]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.80 95% CI: [0.70, 0.91]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003eObserver Agreement\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.69 95% CI: [0.58, 0.81]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.78 95% CI: [0.68, 0.89]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.42 95% CI: [0.27, 0.57]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003eLLM Reliability\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.61 95% CI: [0.50, 0.72]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.87 95% CI: [0.80, 0.95]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.77 95% CI: [0.66, 0.87]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003eObserver Agreement\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.67 95% CI: [0.54, 0.79]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.78 95% CI: [0.67, 0.88]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.41 95% CI: [0.26, 0.56]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003eModerate\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003eLLM Reliability\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.98 95% CI: [0.94, 1.01]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.97 95% CI: [0.94, 1.01]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.96 95% CI: [0.91, 1.01]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003eObserver Agreement\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.33 95% CI: [0.19, 0.48]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.33 95% CI: [0.18, 0.47]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.30 95% CI: [0.15, 0.45]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003eLLM Reliability\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.40 95% CI: [0.28, 0.53]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.83 95% CI: [0.74, 0.92]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.78 95% CI: [0.68, 0.88]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003eObserver Agreement\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.23 95% CI: [0.06, 0.41]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.31 95% CI: [0.16, 0.46]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.25 95% CI: [0.10, 0.40]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003eLLM Reliability\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.27 95% CI: [0.15, 0.39]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.55 95% CI: [0.42, 0.69]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.68 95% CI: [0.56, 0.79]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003eObserver Agreement\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.20 95% CI: [0.03, 0.37]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.30 95% CI: [0.13, 0.46]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.26 95% CI: [0.10, 0.42]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003eHigh\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003eLLM Reliability\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.94 95% CI: [0.89, 1.00]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.96 95% CI: [0.92, 1.01]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e1.00 95% CI: [--, --]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003eObserver Agreement\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.55 95% CI: [0.41, 0.70]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.53 95% CI: [0.39, 0.67]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.67 95% CI: [0.55, 0.80]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003eLLM Reliability\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.58 95% CI: [0.46, 0.70]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.80 95% CI: [0.70, 0.90]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.95 95% CI: [0.90, 1.00]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003eObserver Agreement\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.34 95% CI: [0.17, 0.51]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.47 95% CI: [0.30, 0.64]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.66 95% CI: [0.53, 0.79]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003eLLM Reliability\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.50 95% CI: [0.38, 0.62]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.49 95% CI: [0.35, 0.62]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.88 95% CI: [0.80, 0.96]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003eObserver Agreement\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.35 95% CI: [0.17, 0.52]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.34 95% CI: [0.15, 0.52]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.65 95% CI: [0.52, 0.78]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003eVery High\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003eLLM Reliability\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.95 95% CI: [0.91, 1.00]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.97 95% CI: [0.92, 1.01]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e1.00 95% CI: [--, --]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003eObserver Agreement\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.73 95% CI: [0.61, 0.85]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.78 95% CI: [0.67, 0.89]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.75 95% CI: [0.64, 0.86]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003eLLM Reliability\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.77 95% CI: [0.68, 0.87]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.91 95% CI: [0.85, 0.98]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e1.00 95% CI: [--, --]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003eObserver Agreement\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.74 95% CI: [0.62, 0.86]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.71 95% CI: [0.58, 0.83]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.75 95% CI: [0.64, 0.86]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003eLLM Reliability\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.69 95% CI: [0.58, 0.80]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.80 95% CI: [0.72, 0.89]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e1.00 95% CI: [--, --]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003eObserver Agreement\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.72 95% CI: [0.60, 0.84]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.69 95% CI: [0.57, 0.81]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.75 95% CI: [0.64, 0.86]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003eSum Score\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003eLLM Reliability\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.78 95% CI: [0.66, 0.89]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.86 95% CI: [0.76, 0.95]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.90 95% CI: [0.81, 0.98]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003eObserver Agreement\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e-0.59 95% CI: [-0.71, -0.47]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e-0.58 95% CI: [-0.65, -0.50]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e-0.40 95% CI: [-0.58, -0.22]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003eLLM Reliability\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e-0.18 95% CI: [-0.30, -0.06]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.41 95% CI: [0.25, 0.56]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.34 95% CI: [0.19, 0.50]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003eObserver Agreement\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e-0.90 95% CI: [-0.94, -0.85]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e-0.76 95% CI: [-0.83, -0.69]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e-0.64 95% CI: [-0.76, -0.51]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003eLLM Reliability\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e-0.36 95% CI: [-0.41, -0.32]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e-0.12 95% CI: [-0.26, 0.01]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e0.11 95% CI: [-0.04, 0.26]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003eObserver Agreement\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e-0.90 95% CI: [-0.94, -0.86]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e-0.85 95% CI: [-0.94, -0.77]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6667%;\"\u003e\n \u003cp\u003e-0.74 95% CI: [-0.83, -0.64]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u003cem\u003eNote:\u0026nbsp;\u003c/em\u003eValues shown as Mean with 95% Confidence Intervals. LLM Reliability represents agreement among LLM ratings (llm_\u0026alpha;); Observer Agreement represents regression-bias corrected agreement between human and LLM ratings (corrected_\u0026alpha;). All metrics calculated using Krippendorff\u0026apos;s \u0026alpha; with perfect agreement coded as 1.0. Negative values indicate systematic disagreement.\u003c/p\u003e\n\u003cp\u003ea LLM = Large Language Model\u003c/p\u003e\n\u003cp\u003eTable 4. Balanced Accuracy, Sensitivity and Specificity per Operational Configuration Across Risk Levels (N=48,000 per configuration), assessing the presence of suicide risk factors based on n=100 session transcripts of German Youth Crisis Helpline Users between 2021-11-30 and 2022-04-30\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\" width=\"909\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eRisk Level\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eTemp\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eMetric\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eChain of Thought\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eFew-Shot\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eZero-Shot\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003eLow\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003eBalanced Accuracy\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.69 95% CI: [0.62-0.79]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.71 95% CI: [0.64-0.75]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.71 95% CI: [0.67-0.79]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003eSensitivity\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.50 95% CI: [0.20-0.66]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.50 95% CI: [0.35-0.77]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.93 95% CI: [0.81-1.00]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003eSpecificity\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.88 95% CI: [0.84-0.90]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.92 95% CI: [0.90-0.95]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.49 95% CI: [0.42-0.61]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e0.5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003eBalanced Accuracy\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.71 95% CI: [0.60-0.86]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.68 95% CI: [0.55-0.79]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.71 95% CI: [0.67-0.77]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003eSensitivity\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.56 95% CI: [0.21-0.78]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.43 95% CI: [0.16-0.63]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.93 95% CI: [0.83-1.00]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003eSpecificity\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.85 95% CI: [0.81-0.92]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.92 95% CI: [0.87-0.98]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.49 95% CI: [0.41-0.56]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003eBalanced Accuracy\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.66 95% CI: [0.57-0.74]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.69 95% CI: [0.56-0.76]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.72 95% CI: [0.70-0.78]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003eSensitivity\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.56 95% CI: [0.37-0.84]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.43 95% CI: [0.30-0.56]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.93 95% CI: [0.76-1.00]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003eSpecificity\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.76 95% CI: [0.71-0.83]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.95 95% CI: [0.90-0.97]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.52 95% CI: [0.46-0.60]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003eModerate\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003eBalanced Accuracy\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.53 95% CI: [0.41-0.67]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.54 95% CI: [0.49-0.59]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.44 95% CI: [0.33-0.51]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003eSensitivity\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.40 95% CI: [0.29-0.58]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.42 95% CI: [0.26-0.55]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.25 95% CI: [0.15-0.47]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003eSpecificity\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.66 95% CI: [0.59-0.69]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.67 95% CI: [0.58-0.74]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.63 95% CI: [0.56-0.69]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e0.5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003eBalanced Accuracy\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.51 95% CI: [0.43-0.58]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.56 95% CI: [0.49-0.65]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.44 95% CI: [0.40-0.60]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003eSensitivity\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.44 95% CI: [0.36-0.54]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.46 95% CI: [0.31-0.61]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.25 95% CI: [0.08-0.34]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003eSpecificity\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.58 95% CI: [0.53-0.72]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.66 95% CI: [0.59-0.73]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.63 95% CI: [0.58-0.69]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003eBalanced Accuracy\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.49 95% CI: [0.40-0.56]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.55 95% CI: [0.48-0.57]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.43 95% CI: [0.34-0.45]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003eSensitivity\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.32 95% CI: [0.29-0.49]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.46 95% CI: [0.37-0.57]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.25 95% CI: [0.12-0.47]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003eSpecificity\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.66 95% CI: [0.60-0.76]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.64 95% CI: [0.56-0.70]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.61 95% CI: [0.54-0.73]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003eHigh\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003eBalanced Accuracy\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.51 95% CI: [0.42-0.58]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.59 95% CI: [0.47-0.72]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.49 95% CI: [0.45-0.52]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003eSensitivity\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.23 95% CI: [0.09-0.33]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.48 95% CI: [0.33-0.65]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.05 95% CI: [0.00-0.11]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003eSpecificity\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.80 95% CI: [0.78-0.86]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.70 95% CI: [0.62-0.74]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.93 95% CI: [0.89-0.97]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e0.5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003eBalanced Accuracy\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.47 95% CI: [0.43-0.53]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.66 95% CI: [0.52-0.70]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.49 95% CI: [0.44-0.50]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003eSensitivity\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.14 95% CI: [0.05-0.17]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.57 95% CI: [0.42-0.76]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.05 95% CI: [0.00-0.12]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003eSpecificity\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.80 95% CI: [0.76-0.86]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.74 95% CI: [0.66-0.82]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.93 95% CI: [0.88-0.97]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003eBalanced Accuracy\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.49 95% CI: [0.39-0.54]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.67 95% CI: [0.55-0.74]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.48 95% CI: [0.44-0.55]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003eSensitivity\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.23 95% CI: [0.05-0.27]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.62 95% CI: [0.50-0.73]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.05 95% CI: [0.00-0.22]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003eSpecificity\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.74 95% CI: [0.69-0.82]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.71 95% CI: [0.66-0.82]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.91 95% CI: [0.85-0.95]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003eVery High\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003eBalanced Accuracy\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.60 95% CI: [0.52-0.66]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.61 95% CI: [0.58-0.65]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.53 95% CI: [0.52-0.58]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003eSensitivity\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.30 95% CI: [0.17-0.43]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.31 95% CI: [0.09-0.53]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.06 95% CI: [0.00-0.10]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003eSpecificity\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.89 95% CI: [0.81-0.92]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.92 95% CI: [0.86-0.96]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e1.00 95% CI: [1.00-1.00]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e0.5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003eBalanced Accuracy\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.58 95% CI: [0.55-0.66]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.60 95% CI: [0.58-0.67]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.53 95% CI: [0.52-0.58]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003eSensitivity\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.27 95% CI: [0.15-0.44]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.28 95% CI: [0.12-0.43]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.06 95% CI: [0.03-0.15]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003eSpecificity\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.89 95% CI: [0.84-0.95]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.92 95% CI: [0.88-0.94]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e1.00 95% CI: [1.00-1.00]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003eBalanced Accuracy\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.61 95% CI: [0.55-0.64]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.61 95% CI: [0.56-0.65]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.53 95% CI: [0.50-0.56]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003eSensitivity\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.27 95% CI: [0.15-0.37]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.28 95% CI: [0.28-0.46]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.06 95% CI: [0.00-0.11]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.7217%;\"\u003e\n \u003cp\u003eSpecificity\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.95 95% CI: [0.91-0.98]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e0.93 95% CI: [0.88-0.95]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 16.6117%;\"\u003e\n \u003cp\u003e1.00 95% CI: [1.00-1.00]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u003cem\u003eNote:\u0026nbsp;\u003c/em\u003eValues shown as Mean with 95% Confidence Intervals. Balanced Accuracy represents the mean of sensitivity and specificity. Best values per metric within each prompting style and temperature combination are shown in bold. Temperature affects model output randomness (0 = deterministic, 1 = most random).\u003c/p\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"natural language processing, retrieval augmentation, machine learning, psychometry, benchmarking","lastPublishedDoi":"10.21203/rs.3.rs-6210376/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-6210376/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003e\u003cstrong\u003eBackground\u003c/strong\u003e: Large Language Models´ (LLMs) potential for psychological diagnostics requires systematic evaluation.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eObjective\u003c/strong\u003e: To investigate conditions for reliable and valid psychological assessments, focusing on suicide risk evaluation in clinical data by comparing LLM-generated ratings with human expert ratings across across configurations.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eMethods\u003c/strong\u003e: We analyzed 100 youth crisis conversation transcripts rated by four experts using the Nurses Global Assessment of Suicide Scale (NGASR). Using Mixtral-7x8b-Instruct, we generated ratings across three temperature settings and prompting styles (zero-shot, few-shot, chain-of-thought). Across configurations we compared a) inter-rating-reliability for AI-generated NGASR risk and sum scores, b) LLM-to-human observer agreement regarding sum score, risk category, and item, using Krippendorff´s α, c) classification metrics of risk categories and individual items against human ratings.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eResults\u003c/strong\u003e: LLM configuration strongly influenced assessment reliability. Zero-shot prompting at temperature 0 yielded perfect inter-rating reliability (α=1.00, 95% CI: [1-1] for high \u0026amp; very high risk), while few-shot prompting showed best human-AI agreement for very high risk (α=0.78, 95% CI: [0.67-0.89]) and strongest classification performance (balanced accuracy 0.54-0.71). Lower temperatures consistently improved reliability and accuracy. However, critical clinical items showed poor validity.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eDiscussion\u003c/strong\u003e: Our findings establish optimal conditions (zero temperature, task-specific prompting) for LLM-based psychological assessment. However, inconsistent clinical item performance and only moderate to-human agreement limit LLMs to initial screening rather than detailed assessment, requiring careful parameter control and validation.\u003c/p\u003e","manuscriptTitle":"Automated Suicide Risk Factor Monitoring in Crisis Text Line Users: Comparative Study of AI and Human Ratings Using Large Language Models","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-05-13 07:27:13","doi":"10.21203/rs.3.rs-6210376/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2025-06-10T16:04:05+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-05-29T19:38:47+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-05-25T17:00:58+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-05-21T08:58:29+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"148960425618300237814440080070986748420","date":"2025-05-09T12:58:21+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"198616586878310287447959706354477849797","date":"2025-05-09T08:30:16+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"340186919615314683414047919658649075285","date":"2025-05-08T20:32:28+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"56388866891441873652346940108183731196","date":"2025-05-08T20:22:38+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2025-05-08T18:30:55+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2025-05-08T18:27:45+00:00","index":"","fulltext":""},{"type":"editorInvited","content":"","date":"2025-03-22T07:17:17+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2025-03-21T04:32:32+00:00","index":"","fulltext":""},{"type":"submitted","content":"Scientific Reports","date":"2025-03-12T08:43:33+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"b74d5735-032f-4e83-9be2-ffef5fa01da1","owner":[],"postedDate":"May 13th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[{"id":48313866,"name":"Biological sciences/Psychology"},{"id":48313867,"name":"Health sciences/Health care"}],"tags":[],"updatedAt":"2025-11-17T15:59:41+00:00","versionOfRecord":{"articleIdentity":"rs-6210376","link":"https://doi.org/10.1038/s41598-025-22402-7","journal":{"identity":"scientific-reports","isVorOnly":false,"title":"Scientific Reports"},"publishedOn":"2025-11-10 15:57:08","publishedOnDateReadable":"November 10th, 2025"},"versionCreatedAt":"2025-05-13 07:27:13","video":"","vorDoi":"10.1038/s41598-025-22402-7","vorDoiUrl":"https://doi.org/10.1038/s41598-025-22402-7","workflowStages":[]},"version":"v1","identity":"rs-6210376","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-6210376","identity":"rs-6210376","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.