A hybrid-reasoner LLM framework toward real-world clinical decision- making support in acute ischemic stroke | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article A hybrid-reasoner LLM framework toward real-world clinical decision- making support in acute ischemic stroke Bicong Yan, Ruipeng Zhang, Yanfeng Fan, Ying Li, Li Chen, Xinyu Song, and 8 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7998391/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Acute ischemic stroke (AIS) is a leading cause of mortality and disability worldwide, with outcomes critically dependent on timely and accurate treatment. Yet prognosis is undermined by uneven expertise, resource shortages, and diagnostic delays. Large language models (LLMs), through rapid and accurate interpretation, may bridge this gap and improve care delivery. Here, we developed a hybrid-reasoner framework that augments LLMs with structured clinical reasoning to reliably support time-critical, guideline-concordant AIS emergency decision-making, particularly the need for timely and accurate care in resource-limited settings. We present the first multicenter, multisource, cross-scenario evaluation encompassing both retrospective and prospective real-world clinical cases, as well as literature-derived cases. Across model scales, framework augmentation yielded consistent and substantial gains in treatment accuracy, with average improvements of 18.9% compared with standalone LLMs. Safety evaluation showed that the augmented DeepSeek-R1-671B achieved low hallucination (10.9%), omission (14.7%), and a high overall safety score (4.36/5). Notably, human–AI interaction experiments revealed that junior and non-specialist physicians benefited most, narrowing expertise gaps. Collectively, these findings demonstrate that hybrid-reasoner augmented LLMs enhance accuracy, safety, and guideline-concordant decision-making in AIS. This study marks the transition from technical optimization to real-world translation, laying the groundwork for lightweight, safe, and equitable integration of LLMs into stroke center networks, telemedicine, and resource-limited settings. Biological sciences/Computational biology and bioinformatics Health sciences/Health care Physical sciences/Mathematics and computing Health sciences/Medical research acute ischemic stroke large language model decision support clinical safety Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Introduction Stroke remains the second leading cause of both disability and death worldwide, with its greatest burden borne by low- and middle-income countries 1 , 2 . Acute ischemic stroke (AIS) requires ultra-rapid, precise decision-making within narrow therapeutic windows to optimize outcomes, yet stroke care resources remain profoundly uneven, reinforcing inequities in access to timely treatment. Global disparities in expertise and diagnostic efficiency lead to delays, misdiagnoses, and suboptimal care gaps driven by socioeconomic inequality, geographic barriers, and weak health systems in developing regions, while even high-income countries face workforce shortages that restrict timely access to care 3 – 6 . At the same time, emergency departments worldwide are increasingly overwhelmed 5 , placing physicians under intense pressure to integrate multimodal clinical and imaging information with therapeutic indications and contraindications. This process is inherently prone to error and inefficiency, ultimately worsening outcomes 7 . These systemic challenges not only compromise individual outcomes but also reinforce inequities in stroke care, making equitable access a pressing global priority. As underscored by the World Stroke Organization Global Declaration on Stroke, which calls for equitable, evidence-based stroke systems worldwide, efforts to expand manpower and training have so far yielded only modest and uneven progress 8 , 9 . Large language models (LLMs) integrated into clinical routines hold the potential to transform doctor–patient interactions 10 , offering a compelling solution to these critical challenges. LLMs have demonstrated strong capabilities in diagnostic reasoning, differential generation, and information synthesis 11 – 15 . However, the critical question remains: Can LLMs reliably perform in the high-stakes, time-pressured reality of emergency stroke care, thereby advancing equitable access and real-world clinical translation? Current evidence is insufficient. Most studies remain benchmark- or theory-driven—focused on tasks such as MedQA(United States Medical Licensing Examination) or other benchmark datasets that fail to reflect real-world, guideline-based clinical workflows 16 , 17 . This gap leaves LLM research promising in principle but unproven in practice, underscoring the urgent need for systematic, real-world evaluation of LLMs in AIS care. Our framework augments LLMs with structured clinical reasoning to enhances guideline-concordant therapy selection, supports accurate decision-making in time-critical emergency AIS care, and mitigates the effects of uneven resources and expertise that contribute to disparities in stroke outcomes. Our contribution is to move beyond technical benchmarks by presenting the first multicenter, multisource evaluation of LLMs on real-world AIS cases, strengthened through a hybrid-reasoner framework. Importantly, we also design a human—AI interaction experiment across physicians of varying seniority and specialties, demonstrating that LLM assistance narrows expertise gaps and enables less experienced clinicians to deliver care that is closer to specialist-level decision-making. Collectively, these advances mark an important step toward safe, lightweight, and equitable deployment of LLMs in AIS care, establishing a foundation for their integration into stroke center networks, telemedicine pathways, and underserved health systems. Results Study design The overall study design and data sources are summarized in Fig. 1 . This schematic highlights the integration of multicenter retrospective and prospective real-world clinical cases with PubMed case reports, the application of the hybrid-reasoner framework, and the evaluation of both model performance and human–AI interaction. Patient Characteristics We analyzed 2,081 clinical cases (mean age, 68.0 ± 13.6 years; 761 women) collected retrospectively and prospectively between January 2018 and May 2025, together with 144 PubMed case reports identified from January 2024 to January 2025 (Table 1 ). At Center A, 1,228 cases were screened retrospectively and 1,055 were included (Group A), whereas at Center B, 938 were screened and 721 were included (Group B). In addition, 213 prospective cases were screened at Center A between February and May 2025, of which 161 were included (Group D). Of 327 PubMed case reports retrieved, 144 met the inclusion criteria. Detailed inclusion and exclusion criteria are shown in Supplementary Figure S1 . Table 1 Characteristics of enrolled patients. Variables Group A (Center A) Group B (Center B) Group C (PubMed cases) Group D (Center A) Overall Case (n) 1055 721 144 161 2081 Age (years ± SD) 69.86 ± 13.26 66.55 ± 11.76 56.46 ± 18.73 72.32 ± 12.22 67.99 ± 13.62 Male/Female (n, %) 665/390 (63.03/36.97%) 485/236 (67.27/32.73%) 71/71 # (50/50%) 97/64 (60.25/39.75%) 1318/761 (63.40/36.60%) Disease categories (n, %) AIS 998 (94.60%) 610 (84.60%) 127 (89.44%) 159 (98.76%) 1894 (91.10%) Cerebral hemorrhage 26 (2.46%) 64 (8.88%) 1 (0.70%) 1 (0.62%) 92 (4.43%) Epilepsy 8 (0.76%) 18 (2.50%) 1 (0.65%) 0 27 (1.18%) Arterial aneurysm 4 (0.38%) 3 (0.42%) 0 1 (0.62%) 8 (0.38%) TIA 9 (0.85%) 22 (3.05%) 4 (2.82%) 0 35 (1.68%) Other non-AIS diseases 10 (0.95%) 4 (0.55%) 11 (7.75%) 0 25 (1.20%) AIS treatment (n, %) Thrombolysis 135 (13.54%) 135 (22. 13%) 17 (13.39%) 20 (12.58%) 307 (16.16%) Endovascular thrombectomy* 297 (29.79%) 125 (20.49%) 44 (34.56%) 27 (16.98%) 493 (26.04%) Standard medical management 565 (56.67%) 350 (57.38%) 66 (51.97%) 112 (70.44%) 1093 (57.74%) TOAST diagnose of AIS (n, %) Large artery atherosclerosis 760 (76.23%) 494 (80.98%) 41 (32.80%) 99 (62.26%) 1394 (73.72%) Cardioembolism 98 (9.83%) 40 (6.56%) 28 (22.40%) 21 (13.21%) 187 (9.89%) Small vessel occlusion 95 (9.53%) 37 (6.07%) 6 (4.8%) 37 (23.27%) 175 (9.25%) Other causes 26 (2.61%) 15 (2.46%) 45 (36.0%) 1 (0.63%) 87 (4.60%) Cryptogenic 18 (1.81%) 24 (3.93%) 5 (4.0%) 1 (0.63) 48 (2.54%) This table summarizes the standalone characteristics of patients across study groups. Group A and Group B represent retrospective datasets from Center A and Center B, respectively; Group C includes case reports extracted from PubMed; and Group D represents prospective cases enrolled at Center A. The “Overall” column presents aggregated data across all groups. Variables include demographic and clinical characteristics collected at baseline. * Including bridging therapy (thrombolysis + endovascular thrombectomy). # Two cases with an absence of gender reporting. AIS, Acute Ischemic Stroke; TOAST, Trial of Org 10172 in Acute Stroke Treatment; TIA, Transient Ischemic Attack Baseline disparities and framework-induced improvements Performance varied markedly by model scale. Figure 2 a,b and Supplementary Figure S2 illustrate the interactive interfaces for the standalone LLM and the hybrid-reasoner augmented LLM. Standalone larger-scale LLMs consistently outperformed smaller ones in both treatment recommendation and TOAST classification ( Supplementary Table S1 ). For example, GPT-OSS-120B achieved 0.737 accuracy for treatment recommendation, compared with 0.546 for Baichuan-M1-14B and 0.433 for GPT-OSS-20B (all adjusted P < 0.0001). Similar gaps were observed for TOAST classification, with DeepSeek-R1 surpassing all smaller models. These results establish model scale as a key determinant of baseline performance. Augmentation with the hybrid-reasoner framework substantially improved outcomes across all models (Figs. 2 c-h and Fig. 3 a; Table 2 ; Supplementary Figure S3 ), with average improvements are 18.9% compared to standalone LLMs. In Group A, Baichuan-M1-14B accuracy increased by 25.8% (F1 + 15.1%), while DeepSeek-R1-671B improved more modestly (+ 23.3%/+12.7%). For TOAST classification, GPT-OSS-20B accuracy rose by 39.0% (F1 + 4.9%), compared with only 2.5% (F1 − 2.1%) for GPT-OSS-120B. Notably, GPT-OSS-20B exhibited unstable outputs with fluctuating gains. Collectively, these findings show that while scale remains critical for baseline accuracy, the hybrid-reasoner framework reduces scale-related disparities, enabling smaller models to approach the clinical utility of their larger counterparts. Table 2 Accuracy and F1 scores of LLMs in treatment recommendation and TOAST classification. Models Accuracy Standalone LLM Hybrid-reasoner framework augmented LLM P value F1 Score Standalone LLM Hybrid-reasoner framework augmented LLM P value Baichuan-M1-14B Accuracy of treatment 0.570 (0.547, 0.591) 0.695 (0.675, 0.715) < 0.001 F1 score of treatment 0.726 (0.707, 0.743) 0.820 (0.806, 0.834) < 0.001 Accuracy of TOAST 0.623 (0.602, 0.645) 0.657 (0.637, 0.677) 0.10 F1 score of TOAST 0.365 (0.334, 0.396) 0.294 (0.273, 0.362) < 0.001 GPT-OSS-20B Accuracy of treatment 0.464 (0.441, 0.486) 0.541 (0.519, 0.56) < 0.001 F1 score of treatment 0.634 (0.612, 0.654) 0.702 (0.684, 0.718) < 0.001 Accuracy of TOAST 0.530 (0.507, 0.553) 0.623 (0.602, 0.645) < 0.001 F1 score of TOAST 0.261 (0.24, 0.282) 0.265 (0.244, 0.329) < 0.001 Qwen2.5-32B Accuracy of treatment 0.593 (0.574, 0.615) 0.693 (0.674, 0.713) < 0.001 F1 score of treatment 0.745 (0.729, 0.762) 0.819 (0.805, 0.832) < 0.001 Accuracy of TOAST 0.645 (0.622, 0.666) 0.706 (0.686, 0.726) < 0.001 F1 score of TOAST 0.316 (0.288, 0.343) 0.358 (0.328, 0.386) < 0.001 DeepSeek-R1-Distill-Qwen-32B Accuracy of treatment 0.599 (0.577, 0.62) 0.734 (0.716, 0.754) < 0.001 F1 score of treatment 0.749 (0.732, 0.765) 0.847 (0.834, 0.86) < 0.001 Accuracy of TOAST 0.643 (0.622, 0.663) 0.689 (0.669, 0.708) < 0.001 F1 score of TOAST 0.689 (0.669, 0.708) 0.360 (0.339, 0.38) < 0.001 GPT-OSS-120B Accuracy of treatment 0.724 (0.706, 0.744) 0.825 (0.808, 0.84) < 0.001 F1 score of treatment 0.84 (0.828, 0.853) 0.904 (0.894, 0.913) < 0.001 Accuracy of TOAST 0.690 (0.668, 0.71) 0.711 (0.69, 0.732) < 0.001 F1 score of TOAST 0.444 (0.412, 0.475) 0.436 (0.407, 0.462) < 0.001 DeepSeek-R1-671B Accuracy of treatment 0.685 (0.667, 0.704) 0.830 (0.813, 0.847) < 0.001 F1 score of treatment 0.813 (0.8, 0.827) 0.907 (0.897, 0.917) < 0.001 Accuracy of TOAST 0.680 (0.659, 0.701) 0.758 (0.738, 0.777) < 0.001 F1 score of TOAST 0.460 (0.429, 0.492) 0.494 (0.465, 0.522) < 0.001 The table reports the LLMs’ accuracy and F1 scores for standalone and framework-augmented LLMs, with corresponding P-values for paired comparisons. Values are presented as point estimates, with 95% confidence intervals shown in parentheses. Framework-driven gains and cross-group validation Generalized linear mixed model analysis in Group A confirmed that the hybrid-reasoner framework was independently associated with higher accuracy in AIS treatment recommendations (odds ratio = 1.72, P < 0.001) and increased selection of appropriate reperfusion strategies, thereby improving guideline concordance (Fig. 3 b). Validation in Groups B, C, and D demonstrated consistent performance gains across all groups (Figs. 2 c-h and Fig. 3 a; Supplementary Tables S2–S4 ). The strongest results were observed in Group D, where framework-augmented DeepSeek-R1-671B achieved the highest accuracy in both treatment recommendation and TOAST classification. In Group C, GPT-4o gained accuracy in treatment recommendation (+ 14.9%, 0.750 vs. 0.653) but showed a decline in TOAST classification (− 6.7%, 0.486 vs. 0.521), reflecting the predominance of “other causes” and cryptogenic subtypes. Collectively, these findings confirm that the framework provides reliable gains across heterogeneous cohorts. Qualified for LLM Reasoning: Ensure Safety Clinical Deployment To place performance beyond accuracy, we visualize step-by-step reasoning traces for typical success and failure cases ( Figure. 4a,b ). Beyond accuracy, which alone provides an incomplete view of clinical applicability, the framework improved output safety, yielding higher safety scores (4.36 vs. 4.02) and reduced hallucination (3.1% vs. 4.7%) and omission rates (10.3% vs. 16.6%) compared with the standalone model ( Fig. 4 c-e; Supplementary Table S5 ). These findings demonstrate that structured reasoning improves reliability and mitigates the risk of unsafe content entering clinical workflows. LLM Support Benefits Less Experienced Physicians Integration of LLM support substantially enhanced physician performance, with the most pronounced gains observed in less experienced doctors (0.667 to 0.833 in junior specialists; 0.600 to 0.846 in non-specialists). Physicians further rated the assistance positively (mean score 3.849/5), improvements in senior and specialist physicians were more modest, reflecting a ceiling effect at higher baseline performance levels (Fig. 5 ). Discussion This study marks a critical step toward the clinical translation of LLMs in AIS care. We present the first multicenter, multisource, cross-scenario evaluation of retrospective, prospective real-world clinical cases, and literature-derived cases, showing that our hybrid-reasoner framework augmented LLMs improve guideline-concordant therapy selection in emergency AIS care and help reduce disparities in treatment delivery. Hybrid-reasoner framework augmented LLMs produced more instruction-adherent, lower-error reasoning outputs and, in human–AI interaction experiments, narrowed expertise gaps by most benefiting junior and non-specialist physicians. Collectively, these findings highlight the potential of LLM-integrated systems to promote guideline-based practice, bridge expertise gaps, support stroke centers and telemedicine networks, and extend equitable decision support to low-resource settings. Taken together, our findings highlight a clear shift from performance gains to real-world application, supporting the practical and equitable clinical adoption of LLMs. Global inequities in stroke care remain profound, with low- and middle-income countries constrained by limited resources and specialist expertise, while high-income countries face overcrowded emergency services and workforce pressures. Existing evaluations of LLMs have been largely confined to simplified benchmark tasks that fail to capture the real-world complexity of disease management 16 , 20 , 21 . To address this gap, we applied LLMs to real-world clinical and literature-derived cases, thereby simulating authentic clinical scenarios and reflecting heterogeneous contexts. Performance declined in complex settings, partly due to limitations in handling long token inputs 18 , 19 . To address this challenge, our hybrid-reasoner framework incorporated a designed summarization agent that extracted salient features from clinical narratives, improving accuracy across diverse scenarios. Augmented LLMs achieved accuracies ranging from 0.541 to 0.830 across model scales—a level comparable to Q&A-style or simulated patient scenarios 13 , 20 , and consistently higher than standalone LLMs. This study marks an initial step toward the clinical translation of LLMs in AIS, aiming to reduce inequitable AIS care delivery. Prior work suggests that chain-of-thought (CoT) prompting and fixed-answer formats can partially mitigate these deficits 21 , 22 . Our hybrid-reasoner framework introduced a workflow-oriented, guideline-concordant structure that improved accuracy and maintained consistency with clinical recommendations for feasible clinical deployment. Unlike approaches that depend on ever-larger proprietary models 23 , our framework demonstrates that carefully designed, lightweight optimization can deliver substantial gains, lowering barriers to adoption and aligning with the goal of equitable access. Our evaluation offers one of the most comprehensive simulations of clinical practice to date, establishing a foundation for the deployment of LLMs in AIS workflows. While the promise is substantial, deployment at scale carries risks of unintended harmful consequences 24 , 25 . By integrating multicenter, multi-tier hospitals, retrospective and prospective real-world cases, literature-derived high-difficulty cases, and human–AI interactions across physicians of different levels and specialties, our evaluation captured the heterogeneity and complexity of AIS care. Notably, LLMs most benefited less-experienced physicians, narrowing expertise gaps across experience, specialty, and geography, and aligning with prior reports of near expert-level performance 26 , 27 . In alignment with the World Stroke Organization’s Global Stroke Declaration, which underscores that quality stroke care should be universal 28 , 29 , our results advance the case for AI-enabled strategies to promote equity in global stroke systems. This study systematically assessed safety, a critical prerequisite for clinical deployment. Because LLMs predict the next token without verifying evidence, they remain prone to hallucinations that can erode trust and generate harmful or misleading recommendations 30 , 31 . In our evaluation, hallucinations and omissions were not eliminated but occurred at relatively low frequencies, with the hybrid-reasoner DeepSeek-R1-671B achieving a hallucination rate of 10.9%, an omission rate of 14.7%, and an overall clinical safety score of 4.36/5. These findings indicate that while safety concerns remain, the error rates are within a range that may be acceptable for decision-support use, supporting the feasibility of cautious clinical integration 32 . Hybrid-reasoner augmented LLMs further demonstrated stronger instruction adherence, lower hallucination and omission rates, and more reliable structured outputs, thereby addressing a key barrier to real-world deployment. This study has several limitations. Most importantly, the evaluation did not involve real-time deployment in emergency-room workflows. Although diverse clinical scenarios were simulated, the absence of bedside testing limits assessment of usability, clinician trust, and patient outcomes in live practice. In addition, findings are restricted to AIS patients treated in accordance with guideline-based care, and generalizability to other emergency conditions requires further validation. Finally, rapid iteration of commercial LLMs and restricted access to proprietary systems may affect reproducibility and long-term stability, underscoring the need for open, continuously benchmarked platforms. This work represents a milestone in advancing the clinical translation of LLMs for AIS, providing empirical evidence for their safe, lightweight, and feasible deployment. Beyond stroke, it establishes a reproducible paradigm for real-world evaluation of medical foundation models. Future efforts should focus on real-world prospective validation, expansion to other acute care domains, and development of open-source, lightweight optimizations to lower adoption barriers. By laying the groundwork for sustainable clinical AI, our study contributes to the long-term goal of reducing global disparities in stroke care and building next-generation healthcare infrastructures. Methods Ethics statement This study was conducted in accordance with relevant ethical guidelines and regulations. Approval was obtained from the Institutional Review Board of the Medical Faculty of Ethics Committee of Shanghai Sixth People’s Hospital (approval no. 2024-KY-203), and The study was registered in the Chinese Clinical Trial Registry (ChiCTR2400092800, http://www.chictr.org.cn/ ) on November 22, 2024. Informed consent was obtained from all participants. This study was conducted in accordance with the Declaration of Helsinki. Data collection We retrospectively collected clinical cases from two tertiary care centers: 1,228 cases from Center A (Group A, January 2018–January 2025, tertiary grade A hospital) and 938 cases from Center B (Group B, May 2018–March 2025, tertiary grade B hospital). All cases were de-identified encounters of patients diagnosed with acute cerebrovascular disease. In addition, 327 stroke case reports were retrieved from PubMed between January 2024 and January 2025 (Group C). For prospective validation, 213 patients were consecutively enrolled at Center A between February and May 2025 (Group D). The inclusion criteria were: (1) patients aged ≥ 18 years-old who were clinically suspected of having acute cerebrovascular disease; (2) evaluation and management undertaken in participating hospitals, with neuroimaging performed on admission. The exclusion criteria were: (1) refusal or discontinuation of treatment (e.g. due to financial constraints, perceived risks, or transfer to another facility), or inability to provide informed consent for standardized, guideline-concordant therapy; (2) incomplete clinical information (e.g. missing chief complaint, auxiliary examinations); (3) urgent conditions requiring interventions more immediate than stroke; (4) non-acute cerebrovascular admissions in which stroke was identified only during hospitalization; and (5) patients in the chronic phase of cerebrovascular disease. Guideline adherence of all patient treatments was assessed based on expert evaluation. Detailed inclusion and exclusion criteria for each group are provided in the Supplementary Figure S1 . Experimental setups Patient Cases To ensure patient privacy, all personally identifiable information was removed. Each case was formatted as a single paragraph containing all or a subset of the following elements: patient age and sex, chief complaint, current symptoms, medical history (including illnesses and medications), relevant family history, physical examination findings, laboratory test results, and imaging reports. Evaluation of Standalone LLMs To evaluate the capacity of LLMs in generating treatment recommendations for AIS and in classifying stroke subtypes according to the TOAST system, each case requires a treatment recommendation and the corresponding TOAST classification conclusion. Seven models were tested: Baichuan-M1-14B 33 , GPT-OSS-20B 34 , Qwen2.5-32B 35 , DeepSeek-R1-Distill-Qwen2.5-32B 36 , GPT-OSS-120B 34 , DeepSeek-R1-671B 36 and GPT-4o 37 ( Supplementary Table S6) . Retrospective, real-world clinical cases from Group A were used for this assessment. Outputs were produced in free-text format without predefined options, reflecting the probabilistic nature of clinical reasoning. All LLMs were evaluated in single-turn interactions using their default parameter configurations, without additional manual tuning. Model inference was performed with the official vLLM framework, providing an optimized environment for efficient large-scale deployment. All models were executed on a cluster of eight NVIDIA H20-141GB GPUs using the official Docker release. For version control, vLLM v0.8.4 was applied to all models except GPT-OSS-20B and GPT-OSS-120B, which were deployed under v0.10.1 , thereby ensuring reproducibility and transparency across experiments. To ensure consistent and reproducible evaluation, an automated grader agent was employed to quantify accuracy across all LLMs. The grader agent, instantiated using DeepSeek-R1, operated in two sequential steps. First, it extracted and categorized each output into one of three formats: (1) single treatment recommendation/diagnosis, (2) multiple decisions, or (3) no decision. Second, extracted responses were compared against ground-truth references, with treatment mapped to four predefined categories: Thrombolysis, Mechanical Thrombectomy, Standard Medical Therapy , and Non-Acute Ischemic Stroke or Non-Stroke Conditions . This standardized pipeline minimized subjectivity and ensured consistency across experiments. Hybrid-reasoner Framework In addition to standalone evaluations, we implemented a hybrid-reasoner framework to examine whether structured reasoning and constrained outputs could enhance model performance in AIS-specific tasks. The framework integrates three components: (1) a workflow-oriented summary agent that extracts disease-relevant evidence from lengthy clinical narratives, (2) a guideline-concordant reasoning-path CoT module that enforces structured diagnostic steps, and (3) a clinically inspired multiple-choice constraint mechanism that standardizes outputs within evidence-based decision boundaries. Summarization agent. To mitigate performance degradation caused by lengthy case tokens and to ensure the extraction of salient details, a summary agent was implemented and workflow-oriented. Using DeepSeek-R1 with few-shot prompting, this agent generated structured case summaries for downstream reasoning. Example prompts are shown in Fig. 2 a,b, and further details are provided in the Supplementary Figure S2. Reasoning-path CoT. The guideline-concordant doctor’s reasoning-path CoT module was designed as a concise sequence of four critical diagnostic steps, mirroring clinical decision trees. It was developed collaboratively with neurologists, interventional radiologists, and emergency physicians by restructuring existing clinical guidelines. Multiple-choice constraint. For clinically inspired final decision-making, both standalone and framework-augmented models were evaluated using four predefined categories as answer options: Thrombolysis, Mechanical Thrombectomy, Standard Medical Therapy , and Non-Acute Ischemic Stroke or Non-Stroke Conditions . This constraint not only improved consistency but also aligned model outputs with clinically interpretable categories. Comprehensive Multicenter, Multisource, Cross-scenario Evaluation Validation using PubMed case reports and external cohorts To clinically validate the framework, we analyzed real-world cases from Groups B and D using six closed-source LLMs. To further test generalizability across model families, we also evaluated six open-source LLMs and an additional closed-source LLM in Group C (PubMed case reports). The PubMed case-report dataset is publicly accessible. The objective was to assess the generalizability of the hybrid-reasoner framework. Prospective clinical validation with multi-level and multi-specialty physicians To evaluate the clinical impact of LLM assistance, we conducted a prospective study at Center A. Twelve physicians with varying levels of AIS expertise were recruited from four cities (Hunan, Guangdong, Jilin, and Shanghai). Participants included 2 junior oncologists and 3 interventional neurologists ( 10 years of experience). Between February and May 2025, this study was conducted in a prospective design (approval no. 2024-KY-203). Consecutive patients who met the inclusion and exclusion criteria were randomly assigned at the time of enrollment to either receive LLM-assisted analysis (With AI) or standard review without LLM (Without AI). In the AI-assisted arm, the best-performing model generated treatment recommendations, TOAST classifications, and a structured reasoning process, which were combined with each patient’s admission history. These case packages, with or without AI reasoner information, were then sequentially and randomly allocated to physicians of different seniority levels (junior, senior, and expert) and specialties (stroke specialists and non-specialists) for independent interpretation. Physician decisions were subsequently compared with the treatments patients ultimately received. The study design thus enabled a paired comparison of physician performance with and without AI-assisted, allowing stratified analysis across different levels of clinical training and specialty background. Sample size estimation was based on the proportion of AIS patients and average weekly visits at the study site. Outcome Assessment Ground truth was defined as the treatment recommendation and corresponding TOAST classification documented in the clinical record, with independent verification for guideline concordance. The primary outcome was the overall therapeutic and TOAST diagnostic accuracy of LLMs. Secondary outcomes encompassed safety-related assessments, including instruction adherence, structured clinical safety evaluation (scored on a 1–5 Likert scale, anchored as follows: 1 = harmful/ineffective, 2 = minimal value, 3 = limited value [occasionally useful], 4 = useful [improves decision-making or efficiency], 5 = highly useful/transformative [consistently improves decision quality and/or workflow efficiency]), and the incidence of omissions and hallucinations. Statistical Analysis All analyses were performed in Python (v3.10). Statistical significance was set at two-sided p < 0.05. Binary and continuous variables were summarized descriptively. Accuracy differences between standalone and framework augmented LLMs were tested with McNemar’s test, and F1 score differences were estimated by bootstrap resampling (1,000 iterations) with 95% confidence intervals. Paired ordinal outcomes were compared using the Wilcoxon signed-rank test. To account for repeated measures, a generalized linear mixed-effects model with a binomial link was fitted to estimate the independent effect of framework augmentation on decision accuracy in the Group A dataset 38 . Event rates—including omission and hallucination frequencies—and differences between AI-assisted and non-assisted arms were analyzed using χ² or Fisher’s exact tests, as appropriate. Declarations Competing interests The authors declare no competing interests. Correspondence and requests for materials Correspondence and requests for materials should be addressed to Yuehua Li (email: [email protected] ) Funding This work was supported by Supported by National Natural Science Foundation of China (No. 8225024), Key R&D sub project of the Ministry of Science and Technology (No. 2023YFF1204804), Shanghai Pudong New Area Science and Technology Commission Project (No. PKJ2023-Y53), Shanghai Jiaotong University, Medicine and engineering interdisciplinary program (No. YG2024LC08), and Shanghai key discipline of medical imaging (No. 2017ZZ02005). Author Contribution Bicong Yan: study conception and design, study implementation, data collection, data screening, manuscript drafting and revision.Ruipeng Zhang: technical support, code development, program execution, protocol optimization, data analysis, manuscript revision.Yanfeng Fan: program execution, data collection, data screening and verification, data evaluation.Ying Li: data collection, data verification, data evaluation.Li Chen: data screening, data evaluation, protocol optimization and protocol assessment.Xinyu Song: statistical guidance, data evaluation.Yixiao Tang, Yifan Tu, Zhongzheng Cao: data collection and clinical evaluation.Li Shen: statistical guidance.Mengfei Wang: data screening, technical support for coding.Zhuo Li: data screening, technical support for coding.Yijia Xiong: data collection, data evaluation.Yuehua Li: project supervision, critical guidance, and funding acquisition (corresponding author). Acknowledgement This study was conducted over an extended period and required substantial human and material resources. We thank all patients and their families for their participation, as well as the open-source community for making LLMs publicly available. We are grateful to cooperators for assistance with data collection and coding, and to the biostatistics experts for their valuable guidance in statistical methodology. We also acknowledge the strong support in clinical validation from Tao Wang, Hongmei Song, Daqian Zhang, Yingying Lu, Tonglei Fang, Xingxing Sun, Lu Fei, Yixiao Tang, Yifan Tu, Zhongzheng Cao, Fasheng Peng, Mengfan Yan, and Yuxiang Zhou. Data Availability The raw data supporting the findings of this study are available from the corresponding author upon reasonable request. For the PubMed case cohort, the original data are not directly shared in this work; instead, we provide the references to the corresponding open-access publications in our released code repository, from which the source data can be obtained. Code availability All code for this study is publicly available. The source code for model deployment, inference scripts, and trained model weights will be released at https://anonymous.4open.science/r/HR-LLM-Stroke . The following LLMs were evaluated: Baichuan-M1-14B, GPT-OSS-20B, Qwen2.5-32B, DeepSeek-R1-Distill-Qwen-32B, GPT-OSS-120B, DeepSeek-R1-671B, and GPT-4o. All prompts used in this work are also included in https://anonymous.4open.science/r/HR-LLM-Stroke . References Vollset, S. E. et al. Burden of disease scenarios for 204 countries and territories, 2022–2050: a forecasting analysis for the Global Burden of Disease Study 2021. The Lancet 403, 2204–2256 (2024). Feigin, V. L. et al. World Stroke Organization: Global Stroke Fact Sheet 2025. International Journal of Stroke 20, 132–144 (2025). Shen, Y.-C., Sarkar, N. & Hsia, R. Y. Structural Inequities for Historically Underserved Communities in the Adoption of Stroke Certification in the United States. JAMA Neurol 79, 777 (2022). Pandian, J. D. et al. Stroke systems of care in low-income and middle-income countries: challenges and opportunities. The Lancet 396, 1443–1451 (2020). Avasarala, J. & Wesley, K. Optimization of acute stroke care in the emergency department: a call for better utilization of healthcare resources amid growing shortage of neurologists in the United States. CNS Spectr. 23, 248–250 (2018). Wang, H. et al. Burden of cardiovascular disease among the Western Pacific region and its association with human resources for health, 1990–2021: a systematic analysis of the Global Burden of Disease Study 2021. The Lancet Regional Health - Western Pacific 51, 101195 (2024). Nasreldein, A. et al. Pre- and in-hospital delays in the use of thrombolytic therapy for patients with acute ischemic stroke in rural and urban Egypt. Front. Neurol. 13, 1070523 (2023). Feigin, V. L. et al. Global burden of stroke and risk factors in 188 countries, during 1990–2013: a systematic analysis for the Global Burden of Disease Study 2013. The Lancet Neurology 15, 913–924 (2016). World Stroke Organization. Global Declaration on Stroke: Commitments for Facing Stroke. (2023). https://www.world-stroke.org/news-and-blog/news/global-declaration-on-stroke-commitments-for-facing-stroke . Liu, X. et al. A generalist medical language model for disease diagnosis assistance. Nat Med 31, 932–942 (2025). Kanjee, Z., Crowe, B. & Rodman, A. Accuracy of a Generative Artificial Intelligence Model in a Complex Diagnostic Challenge. JAMA 330, 78 (2023). Cabral, S. et al. Clinical Reasoning of a Generative Artificial Intelligence Model Compared With Physicians. JAMA Intern Med 184, 581 (2024). Tu, T. et al. Towards conversational diagnostic artificial intelligence. Nature 642, 442–450 (2025). McDuff, D. et al. Towards accurate differential diagnosis with large language models. Nature 642, 451–457 (2025). Goh, E. et al. Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial. JAMA Netw Open 7, e2440969 (2024). Goh, E. et al. Publisher Correction: GPT-4 assistance for improvement of physician performance on patient care tasks: a randomized controlled trial. Nat Med 31, 1370–1370 (2025). Kottlors, J. et al. Large Language Models–Supported Thrombectomy Decision-Making in Acute Ischemic Stroke Based on Radiology Reports: Feasibility Qualitative Study. J Med Internet Res 27, e48328 (2025). Li, T., Zhang, G., Do, Q. D., Yue, X. & Chen, W. Long-context LLMs Struggle with Long In-context Learning. Preprint at https://doi.org/10.48550/ARXIV.2404.02060 (2024). Zhang, G. et al. Leveraging long context in retrieval augmented language models for medical question answering. npj Digit. Med. 8, (2025). Wang, C. et al. Patient Triage and Guidance in Emergency Departments Using Large Language Models: Multimetric Study. J Med Internet Res 27, e71613 (2025). Johri, S. et al. An evaluation framework for clinical use of large language models in patient interaction tasks. Nat Med 31, 77–86 (2025). Zhong, W. et al. Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China’s Rare Disease Catalog: Comparative Study. J Med Internet Res 27, e69929–e69929 (2025). Li, J. et al. Integrated image-based deep learning and language models for primary diabetes care. Nat Med 30, 2886–2896 (2024). Habib, A. R., Lin, A. L. & Grant, R. W. The Epic Sepsis Model Falls Short—The Importance of External Validation. JAMA Intern Med 181, 1040 (2021). Wong, A. et al. External Validation of a Widely Implemented Proprietary Sepsis Prediction Model in Hospitalized Patients. JAMA Intern Med 181, 1065 (2021). Owens, D. et al. Accuracy of Large Language Models to Identify Stroke Subtypes Within Unstructured Electronic Health Record Data. Stroke (2025) doi: 10.1161/strokeaha.125.051993 . Van Veen, D. et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat Med 30, 1134–1142 (2024). Feigin, V. L. et al. World Stroke Organization: Global Stroke Fact Sheet 2025. International Journal of Stroke 20, 132–144 (2025). Lee, J. T. et al. Evaluation of performance of generative large language models for stroke care. npj Digit. Med. 8, 481 (2025). Beutel, G., Geerits, E. & Kielstein, J. T. Artificial hallucination: GPT on LSD? Crit Care 27, (2023). Farquhar, S., Kossen, J., Kuhn, L. & Gal, Y. Detecting hallucinations in large language models using semantic entropy. Nature 630, 625–630 (2024). Williams, C. Y. K. et al. Evaluating large language models for drafting emergency department encounter summaries. PLOS Digit Health 4, e0000899 (2025). Wang, B. et al. Baichuan-m1: Pushing the medical capability of large language models. arXiv preprint arXiv:2502.12671 (2025). Agarwal, S. et al. gpt-oss-120b & gpt-oss-20b Model Card. arXiv preprint arXiv:2508.10925 (2025). Yang, A. et al. Qwen2. 5 Technical Report. arXiv preprint arXiv:2412.15115 (2024). Guo, D. et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025). Hurst, A. et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024). Hedeker, D. A mixed-effects multinomial logistic regression model. Statistics in Medicine 22, 1433–1446 (2003). Additional Declarations No competing interests reported. Supplementary Files Supplementarymaterials.docx Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7998391","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":557594772,"identity":"d6a7d6ec-9b5f-4a14-a300-e4c5260971b7","order_by":0,"name":"Bicong Yan","email":"","orcid":"","institution":"Shanghai Sixth People's Hospital, Shanghai Jiao Tong University School of Medicine","correspondingAuthor":false,"prefix":"","firstName":"Bicong","middleName":"","lastName":"Yan","suffix":""},{"id":557594773,"identity":"27e2e413-5947-47c1-abbe-0f49b1dbc44d","order_by":1,"name":"Ruipeng Zhang","email":"","orcid":"","institution":"Shanghai Sixth People's Hospital, Shanghai Jiao Tong University School of Medicine","correspondingAuthor":false,"prefix":"","firstName":"Ruipeng","middleName":"","lastName":"Zhang","suffix":""},{"id":557594774,"identity":"e25473c4-dd46-458b-bc13-2249f1352d97","order_by":2,"name":"Yanfeng Fan","email":"","orcid":"","institution":"National Children's Medical Center, Shanghai Jiao Tong University School of Medicine","correspondingAuthor":false,"prefix":"","firstName":"Yanfeng","middleName":"","lastName":"Fan","suffix":""},{"id":557594775,"identity":"9b5d39e2-146e-41c6-a502-335bc5cdf64a","order_by":3,"name":"Ying Li","email":"","orcid":"","institution":"Jinshan Hospital of Affiliated to Fudan University","correspondingAuthor":false,"prefix":"","firstName":"Ying","middleName":"","lastName":"Li","suffix":""},{"id":557594776,"identity":"0421ba76-d136-47ee-8e7a-4ab3a6e2d709","order_by":4,"name":"Li Chen","email":"","orcid":"","institution":"Shanghai Sixth People's Hospital, Shanghai Jiao Tong University School of Medicine","correspondingAuthor":false,"prefix":"","firstName":"Li","middleName":"","lastName":"Chen","suffix":""},{"id":557594777,"identity":"6c97f8e1-c41a-4793-8ccb-b9fc486a62e9","order_by":5,"name":"Xinyu Song","email":"","orcid":"","institution":"Shanghai Sixth People's Hospital, Shanghai Jiao Tong University School of Medicine","correspondingAuthor":false,"prefix":"","firstName":"Xinyu","middleName":"","lastName":"Song","suffix":""},{"id":557594778,"identity":"6382d399-9726-4a59-b07b-d4b3e2330724","order_by":6,"name":"Yixiao Tang","email":"","orcid":"","institution":"Shanghai Sixth People's Hospital, Shanghai Jiao Tong University School of Medicine","correspondingAuthor":false,"prefix":"","firstName":"Yixiao","middleName":"","lastName":"Tang","suffix":""},{"id":557594779,"identity":"5940919e-8c81-4b58-8533-ce1488066124","order_by":7,"name":"Yifan Tu","email":"","orcid":"","institution":"Shanghai Sixth People's Hospital, Shanghai Jiao Tong University School of Medicine","correspondingAuthor":false,"prefix":"","firstName":"Yifan","middleName":"","lastName":"Tu","suffix":""},{"id":557594780,"identity":"5081e549-1e68-4e7a-a537-13ecf3f5c4dc","order_by":8,"name":"Zhongzheng Cao","email":"","orcid":"","institution":"Shanghai Sixth People's Hospital, Shanghai Jiao Tong University School of Medicine","correspondingAuthor":false,"prefix":"","firstName":"Zhongzheng","middleName":"","lastName":"Cao","suffix":""},{"id":557594781,"identity":"1b763276-871b-42b0-8ed1-f2e7202624b6","order_by":9,"name":"Li Shen","email":"","orcid":"","institution":"Shanghai Sixth People's Hospital, Shanghai Jiao Tong University School of Medicine","correspondingAuthor":false,"prefix":"","firstName":"Li","middleName":"","lastName":"Shen","suffix":""},{"id":557594782,"identity":"77e88edc-b807-4e87-9025-76babacfdad9","order_by":10,"name":"Mengfei Wang","email":"","orcid":"","institution":"Donghua University","correspondingAuthor":false,"prefix":"","firstName":"Mengfei","middleName":"","lastName":"Wang","suffix":""},{"id":557594783,"identity":"191d7ca4-e2f8-4be1-a9ce-eead91d1737e","order_by":11,"name":"Zhuo Li","email":"","orcid":"","institution":"Shenyang University of Technology","correspondingAuthor":false,"prefix":"","firstName":"Zhuo","middleName":"","lastName":"Li","suffix":""},{"id":557594784,"identity":"0d200f3f-ec97-4c98-b39c-b9a760f27603","order_by":12,"name":"Yijia Xiong","email":"","orcid":"","institution":"Shanghai Sixth People's Hospital, Shanghai Jiao Tong University School of Medicine","correspondingAuthor":false,"prefix":"","firstName":"Yijia","middleName":"","lastName":"Xiong","suffix":""},{"id":557594785,"identity":"6975a9ca-748a-4a62-aea2-688cac65a828","order_by":13,"name":"Yue-Hua LI","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA8UlEQVRIie3RMQrCMBSA4RcCrUO06xNBrxCXgqj0KpVCXRwELyAIcbG73kIXcaxk7QHcFApODhUXOwi2oh1TR8H8Q3iE90EgADrdD1aj2UGmAM4xDgFfd66aGB8Coe8Cut8QKMiIv7fLiMlsvO5ki0yjW9K5S7DMzKY71cOYXV9Fsj0jwRbRlVBfXDgJIjVpVIUkgla3kBN+GHFKRAl5COksDBYnOXG+IkTIwZIxeD2MYykxJp1ADL01GjaiP2QYncf7QEEsS24Oqej210jjG/a6TWvubU6pggBUeDHS7PdZPoQqAGCeipEk6lWdTqf7056RGEjsUvGJuAAAAABJRU5ErkJggg==","orcid":"","institution":"Shanghai Sixth People's Hospital","correspondingAuthor":true,"prefix":"","firstName":"Yue-Hua","middleName":"","lastName":"LI","suffix":""}],"badges":[],"createdAt":"2025-10-31 12:23:20","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-7998391/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7998391/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":98423659,"identity":"3932dfb0-6793-4399-a543-b65380536748","added_by":"auto","created_at":"2025-12-17 16:32:29","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":2511174,"visible":true,"origin":"","legend":"","description":"","filename":"LLMNPJ20251031.docx","url":"https://assets-eu.researchsquare.com/files/rs-7998391/v1/a7706fedcd4a8c0620224eed.docx"},{"id":97980540,"identity":"bea07067-6a27-4f93-a786-7745df5c2c19","added_by":"auto","created_at":"2025-12-11 12:40:18","extension":"json","order_by":1,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":14892,"visible":true,"origin":"","legend":"","description":"","filename":"d7bac8b47c944ec4a6903d5b43e4b46c.json","url":"https://assets-eu.researchsquare.com/files/rs-7998391/v1/d515b573bc456a675bea7906.json"},{"id":98423847,"identity":"59693a0c-ccc7-4bb2-b04f-d9371cc6e90e","added_by":"auto","created_at":"2025-12-17 16:32:41","extension":"docx","order_by":2,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":1223419,"visible":true,"origin":"","legend":"","description":"","filename":"Supplementarymaterials.docx","url":"https://assets-eu.researchsquare.com/files/rs-7998391/v1/02cb2d0adb65eaadac23849c.docx"},{"id":97980558,"identity":"3da1a035-3fdc-47b7-beb3-55e4f81866e1","added_by":"auto","created_at":"2025-12-11 12:40:19","extension":"xml","order_by":3,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":122250,"visible":true,"origin":"","legend":"","description":"","filename":"d7bac8b47c944ec4a6903d5b43e4b46c1enriched.xml","url":"https://assets-eu.researchsquare.com/files/rs-7998391/v1/1c5e1992aecf94ed3ca3a697.xml"},{"id":98423667,"identity":"15cb8cc0-6f58-49d9-9631-ed4ca0273651","added_by":"auto","created_at":"2025-12-17 16:32:30","extension":"jpeg","order_by":4,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":8060922,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage1.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7998391/v1/d03e9e8005b7eb4dbee1c17d.jpeg"},{"id":97980546,"identity":"43761e5a-8c78-4e07-a49d-7df2756b2955","added_by":"auto","created_at":"2025-12-11 12:40:18","extension":"jpeg","order_by":5,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":9106586,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage2.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7998391/v1/1d83d5845773c8aa89fb2721.jpeg"},{"id":98423204,"identity":"c5d2320f-6f64-4ec2-9272-acb21006d998","added_by":"auto","created_at":"2025-12-17 16:31:56","extension":"jpeg","order_by":6,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":7455262,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage3.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7998391/v1/b3af9be1499b3bbdc6adc442.jpeg"},{"id":98423653,"identity":"ea94d529-714c-4e32-8e19-c23cb1c27e7d","added_by":"auto","created_at":"2025-12-17 16:32:29","extension":"jpeg","order_by":7,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":6363974,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage4.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7998391/v1/340e039dd3be202f0f47e2fa.jpeg"},{"id":98423682,"identity":"ba21b200-dace-4df7-871a-a8ffc9b0f308","added_by":"auto","created_at":"2025-12-17 16:32:31","extension":"jpeg","order_by":8,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":5547454,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage5.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7998391/v1/76ab60a8fbf4da165444026b.jpeg"},{"id":98424879,"identity":"5672379e-2735-4a80-b062-2967dd7a6733","added_by":"auto","created_at":"2025-12-17 16:34:00","extension":"png","order_by":9,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":79598,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-7998391/v1/38c872d0daf78b8a18c27063.png"},{"id":97980555,"identity":"9919beec-6b46-44af-8e82-ed640f78fda6","added_by":"auto","created_at":"2025-12-11 12:40:19","extension":"png","order_by":10,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":114902,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-7998391/v1/23be30f370765e42cb1d1e39.png"},{"id":98423817,"identity":"16fcd904-18c3-4cd9-8347-9663b1d93fa2","added_by":"auto","created_at":"2025-12-17 16:32:38","extension":"png","order_by":11,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":60152,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-7998391/v1/c47ed8628333d8976a864523.png"},{"id":98423471,"identity":"b16866b6-d426-42ef-957d-2b0766889792","added_by":"auto","created_at":"2025-12-17 16:32:16","extension":"png","order_by":12,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":137221,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-7998391/v1/ada65b79d6bfbc02e812bf28.png"},{"id":98424301,"identity":"47231d42-a08d-4331-a5bd-a3a8e4e0dbcb","added_by":"auto","created_at":"2025-12-17 16:33:09","extension":"png","order_by":13,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":72326,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-7998391/v1/e5eb648414d7c6f80e5578d7.png"},{"id":98424554,"identity":"0a851d9b-9900-409a-a8bc-a782779ec153","added_by":"auto","created_at":"2025-12-17 16:33:28","extension":"xml","order_by":14,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":118804,"visible":true,"origin":"","legend":"","description":"","filename":"d7bac8b47c944ec4a6903d5b43e4b46c1structuring.xml","url":"https://assets-eu.researchsquare.com/files/rs-7998391/v1/abc2b2ef0187ea82ad0d4f81.xml"},{"id":98423106,"identity":"12235b34-7bca-4360-8f93-aa4d22c2df19","added_by":"auto","created_at":"2025-12-17 16:31:51","extension":"html","order_by":15,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":135820,"visible":true,"origin":"","legend":"","description":"","filename":"earlyproof.html","url":"https://assets-eu.researchsquare.com/files/rs-7998391/v1/93a7275455b7d252df8d266f.html"},{"id":98423492,"identity":"b797bb9a-01b2-42f7-8f58-f428fc67703c","added_by":"auto","created_at":"2025-12-17 16:32:17","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":612383,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eStudy framework.\u003c/strong\u003e Hybrid-reasoner framework that augments LLMs were benchmarked against standalone LLMs and quantitatively validated across retrospective, prospective, multi-center, and PubMed datasets. Safety analyses encompassed harmful content, hallucinations, and numerical robustness, while clinical validation assessed human–AI interactions by comparing physicians’ decisions across experience levels, specialties, and geographic locations.\u003c/p\u003e","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-7998391/v1/13899215ae91e74fa020cc29.png"},{"id":98423888,"identity":"ba00c602-4c77-4ad3-8273-9b7b0ee7bf02","added_by":"auto","created_at":"2025-12-17 16:32:42","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":723317,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eInterfaces and performance of standalone vs hybrid-reasoner framework augmented LLMs.\u003c/strong\u003e (a-b), interaction interface and workflow: constrained inputs use \u0026lt;think\u0026gt;, \u0026lt;treatment\u0026gt;, and \u0026lt;diagnosis\u0026gt; tags. (c-h), accuracy of AIS treatment recommendation for six LLMs (Baichuan-M1-14B, GPT-OSS-20B, Qwen2.5-32B, DeepSeek-R1-Distill-Qwen-32B, GPT-OSS-120B, DeepSeek-R1-671B) across Groups A–D; paired bars compare standalone with hybrid-reasoner, showing consistent gains.\u003c/p\u003e\n\u003cp\u003e\u003cem\u003eAIS, acute ischemic stroke; TOAST, Trial of ORG 10172 in Acute Stroke Treatment.\u003c/em\u003e\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-7998391/v1/07e4e60350d7614fad43e9e8.png"},{"id":98423843,"identity":"7156fff9-f9bb-4944-bb65-f080d361328d","added_by":"auto","created_at":"2025-12-17 16:32:41","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":479538,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eF1 score comparison of LLMs in AIS treatment recommendation (a) and treatment recommendation flows (b). \u003c/strong\u003e(a), F1 scores are shown for AIS treatment recommendation across four evaluation groups (A–D) and six LLMs of increasing scale (Baichuan-M1-14B, GPT-OSS-20B, Qwen2.5-32B, DeepSeek-R1-Distill-Qwen-32B, GPT-OSS-120B, and DeepSeek-R1-671B). Within each LLM, paired dots represent standalone performance (orange) and hybrid-reasoner framework augmented performance (blue), with error bars indicating confidence intervals. As a metric balancing precision and recall, the F1 score highlights the enhanced diagnostic reliability of hybrid-reasoner framework. (b) Sankey diagram comparing treatment recommendation flows between standalone and hybrid-reasoner framework augmented LLMs, showing that framework use shifted decisions toward reperfusion therapies, improving concordance with guideline-recommended and clinically chosen treatments. Asterisks denote within-group differences between conditions (*P\u0026lt;0.05; **P\u0026lt;0.01; ***P\u0026lt;0.001; two-sided paired tests).\u003c/p\u003e","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-7998391/v1/d629c09b4d0c1b6c9bc0d48d.png"},{"id":97980542,"identity":"cb5fdb5f-2f60-483a-9e1c-37c87a7e7431","added_by":"auto","created_at":"2025-12-11 12:40:18","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":791091,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eCase examples and safety profile of LLM outputs. (\u003c/strong\u003ea–b), Representative cases. (a), Correct recommendation with faithful reasoning and no hallucination/omission. (b), Incorrect recommendation: relevant details were identified but hallucinated content led to an erroneous conclusion. (c–e), Safety metrics across six LLMs. (c), Harmfulness ratings on a three-point scale (1 = not harmful; 3 = highly harmful) shown as a concentric doughnut with overlaid points. (d), Instruction-following compliances are shown as the bar plot of adherence proportions. (e), Incidences of hallucination and omission are shown as line plots. Collectively, panels of c–e indicate generally high instruction adherence alongside acceptable hallucination/omission rates.\u003c/p\u003e","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-7998391/v1/26b5033a6f84f8f1cd9b39b5.png"},{"id":98423807,"identity":"96be04d6-7c17-4101-8194-16aaf8d59b9d","added_by":"auto","created_at":"2025-12-17 16:32:38","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":518598,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eHuman–AI evaluation across physician groups. (\u003c/strong\u003ea), treatment-recommendation accuracy without AI (pink) and with AI (blue); bars show means with 95% confidence intervals (CIs). (b), individual physicians’ treatment F1 scores; lines connect the same physician from “without AI” to “with AI”. (c), TOAST classification accuracy by physician group, as in panel (a). (d), individual physicians’ TOAST F1 scores, as in panel (b). (e), perceived effectiveness of AI-assisted reasoning on a 5-point Likert scale (5 = highly useful/transformative; 1 = harmful/ineffective), stratified by physician group. (f), radar plots summarizing accuracy profiles for treatment recommendation (left) and TOAST classification (right), with and without AI. Physician groups: junior specialist, junior non-specialist, senior specialist, senior non-specialist, expert specialist.\u003c/p\u003e","description":"","filename":"floatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-7998391/v1/ebd67c7aac4b0c724cd81d49.png"},{"id":102619100,"identity":"b30c772b-796b-4650-b086-3c9473a3ebe6","added_by":"auto","created_at":"2026-02-13 16:26:10","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":4601476,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7998391/v1/a2697da0-96f4-45f1-83e2-402e7c58de82.pdf"},{"id":98424520,"identity":"a3aa0c37-58ed-417a-82d5-149147a9caeb","added_by":"auto","created_at":"2025-12-17 16:33:25","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"supplement","size":1223419,"visible":true,"origin":"","legend":"","description":"","filename":"Supplementarymaterials.docx","url":"https://assets-eu.researchsquare.com/files/rs-7998391/v1/c0de06e794cbdbdfbc97cd58.docx"}],"financialInterests":"No competing interests reported.","formattedTitle":"A hybrid-reasoner LLM framework toward real-world clinical decision- making support in acute ischemic stroke","fulltext":[{"header":"Introduction","content":"\u003cp\u003eStroke remains the second leading cause of both disability and death worldwide, with its greatest burden borne by low- and middle-income countries\u003csup\u003e\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e,\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e\u003c/sup\u003e. Acute ischemic stroke (AIS) requires ultra-rapid, precise decision-making within narrow therapeutic windows to optimize outcomes, yet stroke care resources remain profoundly uneven, reinforcing inequities in access to timely treatment. Global disparities in expertise and diagnostic efficiency lead to delays, misdiagnoses, and suboptimal care gaps driven by socioeconomic inequality, geographic barriers, and weak health systems in developing regions, while even high-income countries face workforce shortages that restrict timely access to care\u003csup\u003e\u003cspan additionalcitationids=\"CR4 CR5\" citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e\u003c/sup\u003e. At the same time, emergency departments worldwide are increasingly overwhelmed\u003csup\u003e\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e\u003c/sup\u003e, placing physicians under intense pressure to integrate multimodal clinical and imaging information with therapeutic indications and contraindications. This process is inherently prone to error and inefficiency, ultimately worsening outcomes\u003csup\u003e\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u003c/sup\u003e. These systemic challenges not only compromise individual outcomes but also reinforce inequities in stroke care, making equitable access a pressing global priority. As underscored by the World Stroke Organization Global Declaration on Stroke, which calls for equitable, evidence-based stroke systems worldwide, efforts to expand manpower and training have so far yielded only modest and uneven progress\u003csup\u003e\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e,\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e\u003cp\u003eLarge language models (LLMs) integrated into clinical routines hold the potential to transform doctor\u0026ndash;patient interactions\u003csup\u003e\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e\u003c/sup\u003e, offering a compelling solution to these critical challenges. LLMs have demonstrated strong capabilities in diagnostic reasoning, differential generation, and information synthesis\u003csup\u003e\u003cspan additionalcitationids=\"CR12 CR13 CR14\" citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e\u003c/sup\u003e. However, the critical question remains: Can LLMs reliably perform in the high-stakes, time-pressured reality of emergency stroke care, thereby advancing equitable access and real-world clinical translation? Current evidence is insufficient. Most studies remain benchmark- or theory-driven\u0026mdash;focused on tasks such as MedQA(United States Medical Licensing Examination) or other benchmark datasets that fail to reflect real-world, guideline-based clinical workflows\u003csup\u003e\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e,\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e\u003c/sup\u003e. This gap leaves LLM research promising in principle but unproven in practice, underscoring the urgent need for systematic, real-world evaluation of LLMs in AIS care.\u003c/p\u003e\u003cp\u003e Our framework augments LLMs with structured clinical reasoning to enhances guideline-concordant therapy selection, supports accurate decision-making in time-critical emergency AIS care, and mitigates the effects of uneven resources and expertise that contribute to disparities in stroke outcomes. Our contribution is to move beyond technical benchmarks by presenting the first multicenter, multisource evaluation of LLMs on real-world AIS cases, strengthened through a hybrid-reasoner framework. Importantly, we also design a human\u0026mdash;AI interaction experiment across physicians of varying seniority and specialties, demonstrating that LLM assistance narrows expertise gaps and enables less experienced clinicians to deliver care that is closer to specialist-level decision-making. Collectively, these advances mark an important step toward safe, lightweight, and equitable deployment of LLMs in AIS care, establishing a foundation for their integration into stroke center networks, telemedicine pathways, and underserved health systems.\u003c/p\u003e"},{"header":"Results","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e\u003ch2\u003eStudy design\u003c/h2\u003e\u003cp\u003eThe overall study design and data sources are summarized in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e. This schematic highlights the integration of multicenter retrospective and prospective real-world clinical cases with PubMed case reports, the application of the hybrid-reasoner framework, and the evaluation of both model performance and human\u0026ndash;AI interaction.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003c/div\u003e\n\u003ch3\u003ePatient Characteristics\u003c/h3\u003e\n\u003cp\u003eWe analyzed 2,081 clinical cases (mean age, 68.0\u0026thinsp;\u0026plusmn;\u0026thinsp;13.6 years; 761 women) collected retrospectively and prospectively between January 2018 and May 2025, together with 144 PubMed case reports identified from January 2024 to January 2025 (Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). At Center A, 1,228 cases were screened retrospectively and 1,055 were included (Group A), whereas at Center B, 938 were screened and 721 were included (Group B). In addition, 213 prospective cases were screened at Center A between February and May 2025, of which 161 were included (Group D). Of 327 PubMed case reports retrieved, 144 met the inclusion criteria. Detailed inclusion and exclusion criteria are shown in \u003cb\u003eSupplementary Figure \u003cspan refid=\"MOESM1\" class=\"InternalRef\"\u003eS1\u003c/span\u003e\u003c/b\u003e.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eCharacteristics of enrolled patients.\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"6\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eVariables\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eGroup A\u003c/p\u003e\u003cp\u003e(Center A)\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eGroup B\u003c/p\u003e\u003cp\u003e(Center B)\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eGroup C\u003c/p\u003e\u003cp\u003e(PubMed cases)\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003eGroup D\u003c/p\u003e\u003cp\u003e(Center A)\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c6\"\u003e\u003cp\u003eOverall\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eCase (n)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e1055\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e721\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e144\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e161\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e2081\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eAge (years\u0026thinsp;\u0026plusmn;\u0026thinsp;SD)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e69.86\u0026thinsp;\u0026plusmn;\u0026thinsp;13.26\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e66.55\u0026thinsp;\u0026plusmn;\u0026thinsp;11.76\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e56.46\u0026thinsp;\u0026plusmn;\u0026thinsp;18.73\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e72.32\u0026thinsp;\u0026plusmn;\u0026thinsp;12.22\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e67.99\u0026thinsp;\u0026plusmn;\u0026thinsp;13.62\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eMale/Female (n, %)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e665/390 (63.03/36.97%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e485/236 (67.27/32.73%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e71/71\u003csup\u003e#\u003c/sup\u003e\u003c/p\u003e\u003cp\u003e(50/50%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e97/64 (60.25/39.75%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e1318/761 (63.40/36.60%)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eDisease categories (n, %)\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u0026nbsp;\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eAIS\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e998 (94.60%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e610 (84.60%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e127 (89.44%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e159 (98.76%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e1894 (91.10%)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eCerebral hemorrhage\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e26 (2.46%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e64 (8.88%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e1 (0.70%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e1 (0.62%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e92 (4.43%)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eEpilepsy\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e8 (0.76%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e18 (2.50%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e1 (0.65%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e27 (1.18%)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eArterial aneurysm\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e4 (0.38%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e3 (0.42%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e1 (0.62%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e8 (0.38%)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eTIA\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e9 (0.85%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e22 (3.05%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e4 (2.82%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e35 (1.68%)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eOther non-AIS diseases\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e10 (0.95%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e4 (0.55%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e11 (7.75%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e25 (1.20%)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eAIS treatment (n, %)\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u0026nbsp;\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eThrombolysis\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e135 (13.54%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e135 (22. 13%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e17 (13.39%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e20 (12.58%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e307 (16.16%)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eEndovascular thrombectomy*\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e297 (29.79%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e125 (20.49%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e44 (34.56%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e27 (16.98%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e493 (26.04%)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eStandard medical management\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e565 (56.67%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e350 (57.38%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e66 (51.97%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e112 (70.44%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e1093 (57.74%)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eTOAST diagnose of AIS (n, %)\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u0026nbsp;\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eLarge artery atherosclerosis\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e760 (76.23%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e494 (80.98%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e41 (32.80%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e99 (62.26%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e1394 (73.72%)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eCardioembolism\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e98 (9.83%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e40 (6.56%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e28 (22.40%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e21 (13.21%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e187 (9.89%)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eSmall vessel occlusion\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e95 (9.53%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e37 (6.07%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e6 (4.8%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e37 (23.27%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e175 (9.25%)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eOther causes\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e26 (2.61%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e15 (2.46%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e45 (36.0%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e1 (0.63%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e87 (4.60%)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eCryptogenic\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e18 (1.81%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e24 (3.93%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e5 (4.0%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e1 (0.63)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e48 (2.54%)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003ctfoot\u003e\u003ctr\u003e\u003ctd colspan=\"6\"\u003eThis table summarizes the standalone characteristics of patients across study groups. Group A and Group B represent retrospective datasets from Center A and Center B, respectively; Group C includes case reports extracted from PubMed; and Group D represents prospective cases enrolled at Center A. The \u0026ldquo;Overall\u0026rdquo; column presents aggregated data across all groups. Variables include demographic and clinical characteristics collected at baseline.\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd colspan=\"6\"\u003e* Including bridging therapy (thrombolysis\u0026thinsp;+\u0026thinsp;endovascular thrombectomy).\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd colspan=\"6\"\u003e\u003csup\u003e#\u003c/sup\u003e Two cases with an absence of gender reporting.\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd colspan=\"6\"\u003e\u003cem\u003eAIS, Acute Ischemic Stroke; TOAST, Trial of Org 10172 in Acute Stroke Treatment; TIA, Transient Ischemic Attack\u003c/em\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tfoot\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\n\u003ch3\u003eBaseline disparities and framework-induced improvements\u003c/h3\u003e\n\u003cp\u003ePerformance varied markedly by model scale. Figure\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ea,b and \u003cb\u003eSupplementary Figure S2\u003c/b\u003e illustrate the interactive interfaces for the standalone LLM and the hybrid-reasoner augmented LLM. Standalone larger-scale LLMs consistently outperformed smaller ones in both treatment recommendation and TOAST classification (\u003cb\u003eSupplementary Table \u003cspan refid=\"MOESM1\" class=\"InternalRef\"\u003eS1\u003c/span\u003e\u003c/b\u003e). For example, GPT-OSS-120B achieved 0.737 accuracy for treatment recommendation, compared with 0.546 for Baichuan-M1-14B and 0.433 for GPT-OSS-20B (all adjusted P\u0026thinsp;\u0026lt;\u0026thinsp;0.0001). Similar gaps were observed for TOAST classification, with DeepSeek-R1 surpassing all smaller models. These results establish model scale as a key determinant of baseline performance.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eAugmentation with the hybrid-reasoner framework substantially improved outcomes across all models (Figs.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ec-h \u003cb\u003eand\u003c/b\u003e Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ea; Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e; \u003cb\u003eSupplementary Figure S3\u003c/b\u003e), with average improvements are 18.9% compared to standalone LLMs. In Group A, Baichuan-M1-14B accuracy increased by 25.8% (F1\u0026thinsp;+\u0026thinsp;15.1%), while DeepSeek-R1-671B improved more modestly (+\u0026thinsp;23.3%/+12.7%). For TOAST classification, GPT-OSS-20B accuracy rose by 39.0% (F1\u0026thinsp;+\u0026thinsp;4.9%), compared with only 2.5% (F1 \u0026minus;\u0026thinsp;2.1%) for GPT-OSS-120B. Notably, GPT-OSS-20B exhibited unstable outputs with fluctuating gains. Collectively, these findings show that while scale remains critical for baseline accuracy, the hybrid-reasoner framework reduces scale-related disparities, enabling smaller models to approach the clinical utility of their larger counterparts.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eAccuracy and F1 scores of LLMs in treatment recommendation and TOAST classification.\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"9\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c8\" colnum=\"8\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c9\" colnum=\"9\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eModels\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eAccuracy\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eStandalone LLM\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eHybrid-reasoner framework augmented LLM\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003eP value\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c6\"\u003e\u003cp\u003eF1 Score\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c7\"\u003e\u003cp\u003eStandalone LLM\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c8\"\u003e\u003cp\u003eHybrid-reasoner framework augmented LLM\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c9\"\u003e\u003cp\u003eP value\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e\u003cp\u003eBaichuan-M1-14B\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eAccuracy of treatment\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0.570\u003c/p\u003e\u003cp\u003e(0.547, 0.591)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e0.695\u003c/p\u003e\u003cp\u003e(0.675, 0.715)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003eF1 score of treatment\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e0.726\u003c/p\u003e\u003cp\u003e(0.707, 0.743)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c8\"\u003e\u003cp\u003e0.820\u003c/p\u003e\u003cp\u003e(0.806, 0.834)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e\u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eAccuracy of TOAST\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0.623\u003c/p\u003e\u003cp\u003e(0.602, 0.645)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e0.657\u003c/p\u003e\u003cp\u003e(0.637, 0.677)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0.10\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003eF1 score of TOAST\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e0.365\u003c/p\u003e\u003cp\u003e(0.334, 0.396)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c8\"\u003e\u003cp\u003e0.294\u003c/p\u003e\u003cp\u003e(0.273, 0.362)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e\u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e\u003cp\u003eGPT-OSS-20B\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eAccuracy of treatment\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0.464\u003c/p\u003e\u003cp\u003e(0.441, 0.486)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e0.541\u003c/p\u003e\u003cp\u003e(0.519, 0.56)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003eF1 score of treatment\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e0.634\u003c/p\u003e\u003cp\u003e(0.612, 0.654)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c8\"\u003e\u003cp\u003e0.702\u003c/p\u003e\u003cp\u003e(0.684, 0.718)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e\u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eAccuracy of TOAST\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0.530\u003c/p\u003e\u003cp\u003e(0.507, 0.553)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e0.623\u003c/p\u003e\u003cp\u003e(0.602, 0.645)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003eF1 score of TOAST\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e0.261\u003c/p\u003e\u003cp\u003e(0.24, 0.282)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c8\"\u003e\u003cp\u003e0.265\u003c/p\u003e\u003cp\u003e(0.244, 0.329)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e\u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e\u003cp\u003eQwen2.5-32B\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eAccuracy of treatment\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0.593\u003c/p\u003e\u003cp\u003e(0.574, 0.615)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e0.693\u003c/p\u003e\u003cp\u003e(0.674, 0.713)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003eF1 score of treatment\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e0.745\u003c/p\u003e\u003cp\u003e(0.729, 0.762)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c8\"\u003e\u003cp\u003e0.819\u003c/p\u003e\u003cp\u003e(0.805, 0.832)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e\u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eAccuracy of TOAST\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0.645\u003c/p\u003e\u003cp\u003e(0.622, 0.666)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e0.706\u003c/p\u003e\u003cp\u003e(0.686, 0.726)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003eF1 score of TOAST\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e0.316\u003c/p\u003e\u003cp\u003e(0.288, 0.343)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c8\"\u003e\u003cp\u003e0.358\u003c/p\u003e\u003cp\u003e(0.328, 0.386)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e\u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e\u003cp\u003eDeepSeek-R1-Distill-Qwen-32B\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eAccuracy of treatment\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0.599\u003c/p\u003e\u003cp\u003e(0.577, 0.62)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e0.734\u003c/p\u003e\u003cp\u003e(0.716, 0.754)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003eF1 score of treatment\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e0.749\u003c/p\u003e\u003cp\u003e(0.732, 0.765)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c8\"\u003e\u003cp\u003e0.847\u003c/p\u003e\u003cp\u003e(0.834, 0.86)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e\u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eAccuracy of TOAST\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0.643\u003c/p\u003e\u003cp\u003e(0.622, 0.663)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e0.689\u003c/p\u003e\u003cp\u003e(0.669, 0.708)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003eF1 score of TOAST\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e0.689\u003c/p\u003e\u003cp\u003e(0.669, 0.708)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c8\"\u003e\u003cp\u003e0.360\u003c/p\u003e\u003cp\u003e(0.339, 0.38)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e\u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e\u003cp\u003eGPT-OSS-120B\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eAccuracy of treatment\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0.724\u003c/p\u003e\u003cp\u003e(0.706, 0.744)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e0.825\u003c/p\u003e\u003cp\u003e(0.808, 0.84)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003eF1 score of treatment\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e0.84\u003c/p\u003e\u003cp\u003e(0.828, 0.853)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c8\"\u003e\u003cp\u003e0.904\u003c/p\u003e\u003cp\u003e(0.894, 0.913)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e\u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eAccuracy of TOAST\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0.690\u003c/p\u003e\u003cp\u003e(0.668, 0.71)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e0.711\u003c/p\u003e\u003cp\u003e(0.69, 0.732)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003eF1 score of TOAST\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e0.444\u003c/p\u003e\u003cp\u003e(0.412, 0.475)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c8\"\u003e\u003cp\u003e0.436\u003c/p\u003e\u003cp\u003e(0.407, 0.462)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e\u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e\u003cp\u003eDeepSeek-R1-671B\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eAccuracy of treatment\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0.685\u003c/p\u003e\u003cp\u003e(0.667, 0.704)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e0.830\u003c/p\u003e\u003cp\u003e(0.813, 0.847)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003eF1 score of treatment\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e0.813\u003c/p\u003e\u003cp\u003e(0.8, 0.827)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c8\"\u003e\u003cp\u003e0.907\u003c/p\u003e\u003cp\u003e(0.897, 0.917)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e\u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eAccuracy of TOAST\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0.680\u003c/p\u003e\u003cp\u003e(0.659, 0.701)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e0.758\u003c/p\u003e\u003cp\u003e(0.738, 0.777)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003eF1 score of TOAST\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e0.460\u003c/p\u003e\u003cp\u003e(0.429, 0.492)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c8\"\u003e\u003cp\u003e0.494\u003c/p\u003e\u003cp\u003e(0.465, 0.522)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e\u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003ctfoot\u003e\u003ctr\u003e\u003ctd colspan=\"9\"\u003eThe table reports the LLMs\u0026rsquo; accuracy and F1 scores for standalone and framework-augmented LLMs, with corresponding P-values for paired comparisons. Values are presented as point estimates, with 95% confidence intervals shown in parentheses.\u003c/td\u003e\u003c/tr\u003e\u003c/tfoot\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\n\u003ch3\u003eFramework-driven gains and cross-group validation\u003c/h3\u003e\n\u003cp\u003eGeneralized linear mixed model analysis in Group A confirmed that the hybrid-reasoner framework was independently associated with higher accuracy in AIS treatment recommendations (odds ratio\u0026thinsp;=\u0026thinsp;1.72, P\u0026thinsp;\u0026lt;\u0026thinsp;0.001) and increased selection of appropriate reperfusion strategies, thereby improving guideline concordance (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eb).\u003c/p\u003e\u003cp\u003eValidation in Groups B, C, and D demonstrated consistent performance gains across all groups (Figs.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ec-h \u003cb\u003eand\u003c/b\u003e Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ea; \u003cb\u003eSupplementary Tables S2\u0026ndash;S4\u003c/b\u003e). The strongest results were observed in Group D, where framework-augmented DeepSeek-R1-671B achieved the highest accuracy in both treatment recommendation and TOAST classification. In Group C, GPT-4o gained accuracy in treatment recommendation (+\u0026thinsp;14.9%, 0.750 vs. 0.653) but showed a decline in TOAST classification (\u0026minus;\u0026thinsp;6.7%, 0.486 vs. 0.521), reflecting the predominance of \u0026ldquo;other causes\u0026rdquo; and cryptogenic subtypes. Collectively, these findings confirm that the framework provides reliable gains across heterogeneous cohorts.\u003c/p\u003e\n\u003ch3\u003eQualified for LLM Reasoning: Ensure Safety Clinical Deployment\u003c/h3\u003e\n\u003cp\u003eTo place performance beyond accuracy, we visualize step-by-step reasoning traces for typical success and failure cases (\u003cb\u003eFigure. 4a,b\u003c/b\u003e). Beyond accuracy, which alone provides an incomplete view of clinical applicability, the framework improved output safety, yielding higher safety scores (4.36 vs. 4.02) and reduced hallucination (3.1% vs. 4.7%) and omission rates (10.3% vs. 16.6%) compared with the standalone model \u003cb\u003e(\u003c/b\u003eFig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ec-e; \u003cb\u003eSupplementary Table S5\u003c/b\u003e). These findings demonstrate that structured reasoning improves reliability and mitigates the risk of unsafe content entering clinical workflows.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cdiv id=\"Sec8\" class=\"Section2\"\u003e\u003ch2\u003eLLM Support Benefits Less Experienced Physicians\u003c/h2\u003e\u003cp\u003eIntegration of LLM support substantially enhanced physician performance, with the most pronounced gains observed in less experienced doctors (0.667 to 0.833 in junior specialists; 0.600 to 0.846 in non-specialists). Physicians further rated the assistance positively (mean score 3.849/5), improvements in senior and specialist physicians were more modest, reflecting a ceiling effect at higher baseline performance levels (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e).\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003c/div\u003e"},{"header":"Discussion","content":"\u003cp\u003eThis study marks a critical step toward the clinical translation of LLMs in AIS care. We present the first multicenter, multisource, cross-scenario evaluation of retrospective, prospective real-world clinical cases, and literature-derived cases, showing that our hybrid-reasoner framework augmented LLMs improve guideline-concordant therapy selection in emergency AIS care and help reduce disparities in treatment delivery. Hybrid-reasoner framework augmented LLMs produced more instruction-adherent, lower-error reasoning outputs and, in human\u0026ndash;AI interaction experiments, narrowed expertise gaps by most benefiting junior and non-specialist physicians. Collectively, these findings highlight the potential of LLM-integrated systems to promote guideline-based practice, bridge expertise gaps, support stroke centers and telemedicine networks, and extend equitable decision support to low-resource settings. Taken together, our findings highlight a clear shift from performance gains to real-world application, supporting the practical and equitable clinical adoption of LLMs.\u003c/p\u003e\u003cp\u003eGlobal inequities in stroke care remain profound, with low- and middle-income countries constrained by limited resources and specialist expertise, while high-income countries face overcrowded emergency services and workforce pressures. Existing evaluations of LLMs have been largely confined to simplified benchmark tasks that fail to capture the real-world complexity of disease management\u003csup\u003e\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e,\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e,\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e\u003c/sup\u003e. To address this gap, we applied LLMs to real-world clinical and literature-derived cases, thereby simulating authentic clinical scenarios and reflecting heterogeneous contexts. Performance declined in complex settings, partly due to limitations in handling long token inputs\u003csup\u003e\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e,\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e\u003c/sup\u003e. To address this challenge, our hybrid-reasoner framework incorporated a designed summarization agent that extracted salient features from clinical narratives, improving accuracy across diverse scenarios. Augmented LLMs achieved accuracies ranging from 0.541 to 0.830 across model scales\u0026mdash;a level comparable to Q\u0026amp;A-style or simulated patient scenarios\u003csup\u003e\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e,\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e\u003c/sup\u003e, and consistently higher than standalone LLMs.\u003c/p\u003e\u003cp\u003eThis study marks an initial step toward the clinical translation of LLMs in AIS, aiming to reduce inequitable AIS care delivery. Prior work suggests that chain-of-thought (CoT) prompting and fixed-answer formats can partially mitigate these deficits\u003csup\u003e\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e,\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e\u003c/sup\u003e. Our hybrid-reasoner framework introduced a workflow-oriented, guideline-concordant structure that improved accuracy and maintained consistency with clinical recommendations for feasible clinical deployment. Unlike approaches that depend on ever-larger proprietary models\u003csup\u003e\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e\u003c/sup\u003e, our framework demonstrates that carefully designed, lightweight optimization can deliver substantial gains, lowering barriers to adoption and aligning with the goal of equitable access.\u003c/p\u003e\u003cp\u003eOur evaluation offers one of the most comprehensive simulations of clinical practice to date, establishing a foundation for the deployment of LLMs in AIS workflows. While the promise is substantial, deployment at scale carries risks of unintended harmful consequences\u003csup\u003e\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e,\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e\u003c/sup\u003e. By integrating multicenter, multi-tier hospitals, retrospective and prospective real-world cases, literature-derived high-difficulty cases, and human\u0026ndash;AI interactions across physicians of different levels and specialties, our evaluation captured the heterogeneity and complexity of AIS care. Notably, LLMs most benefited less-experienced physicians, narrowing expertise gaps across experience, specialty, and geography, and aligning with prior reports of near expert-level performance\u003csup\u003e\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e,\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e\u003c/sup\u003e. In alignment with the World Stroke Organization\u0026rsquo;s Global Stroke Declaration, which underscores that quality stroke care should be universal\u003csup\u003e\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e,\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e\u003c/sup\u003e, our results advance the case for AI-enabled strategies to promote equity in global stroke systems.\u003c/p\u003e\u003cp\u003eThis study systematically assessed safety, a critical prerequisite for clinical deployment. Because LLMs predict the next token without verifying evidence, they remain prone to hallucinations that can erode trust and generate harmful or misleading recommendations\u003csup\u003e\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e,\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e\u003c/sup\u003e. In our evaluation, hallucinations and omissions were not eliminated but occurred at relatively low frequencies, with the hybrid-reasoner DeepSeek-R1-671B achieving a hallucination rate of 10.9%, an omission rate of 14.7%, and an overall clinical safety score of 4.36/5. These findings indicate that while safety concerns remain, the error rates are within a range that may be acceptable for decision-support use, supporting the feasibility of cautious clinical integration\u003csup\u003e\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e\u003c/sup\u003e. Hybrid-reasoner augmented LLMs further demonstrated stronger instruction adherence, lower hallucination and omission rates, and more reliable structured outputs, thereby addressing a key barrier to real-world deployment.\u003c/p\u003e\u003cp\u003eThis study has several limitations. Most importantly, the evaluation did not involve real-time deployment in emergency-room workflows. Although diverse clinical scenarios were simulated, the absence of bedside testing limits assessment of usability, clinician trust, and patient outcomes in live practice. In addition, findings are restricted to AIS patients treated in accordance with guideline-based care, and generalizability to other emergency conditions requires further validation. Finally, rapid iteration of commercial LLMs and restricted access to proprietary systems may affect reproducibility and long-term stability, underscoring the need for open, continuously benchmarked platforms.\u003c/p\u003e\u003cp\u003eThis work represents a milestone in advancing the clinical translation of LLMs for AIS, providing empirical evidence for their safe, lightweight, and feasible deployment. Beyond stroke, it establishes a reproducible paradigm for real-world evaluation of medical foundation models. Future efforts should focus on real-world prospective validation, expansion to other acute care domains, and development of open-source, lightweight optimizations to lower adoption barriers. By laying the groundwork for sustainable clinical AI, our study contributes to the long-term goal of reducing global disparities in stroke care and building next-generation healthcare infrastructures.\u003c/p\u003e"},{"header":"Methods","content":"\u003cdiv id=\"Sec11\" class=\"Section2\"\u003e\u003ch2\u003eEthics statement\u003c/h2\u003e\u003cp\u003e This study was conducted in accordance with relevant ethical guidelines and regulations. Approval was obtained from the Institutional Review Board of the Medical Faculty of Ethics Committee of Shanghai Sixth People\u0026rsquo;s Hospital (approval no. 2024-KY-203), and The study was registered in the Chinese Clinical Trial Registry (ChiCTR2400092800, \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttp://www.chictr.org.cn/\u003c/span\u003e\u003cspan address=\"http://www.chictr.org.cn/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e) on November 22, 2024. Informed consent was obtained from all participants. This study was conducted in accordance with the Declaration of Helsinki.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec12\" class=\"Section2\"\u003e\u003ch2\u003eData collection\u003c/h2\u003e\u003cp\u003e We retrospectively collected clinical cases from two tertiary care centers: 1,228 cases from Center A (Group A, January 2018\u0026ndash;January 2025, tertiary grade A hospital) and 938 cases from Center B (Group B, May 2018\u0026ndash;March 2025, tertiary grade B hospital). All cases were de-identified encounters of patients diagnosed with acute cerebrovascular disease. In addition, 327 stroke case reports were retrieved from PubMed between January 2024 and January 2025 (Group C). For prospective validation, 213 patients were consecutively enrolled at Center A between February and May 2025 (Group D).\u003c/p\u003e\u003cp\u003eThe inclusion criteria were: (1) patients aged\u0026thinsp;\u0026ge;\u0026thinsp;18 years-old who were clinically suspected of having acute cerebrovascular disease; (2) evaluation and management undertaken in participating hospitals, with neuroimaging performed on admission. The exclusion criteria were: (1) refusal or discontinuation of treatment (e.g. due to financial constraints, perceived risks, or transfer to another facility), or inability to provide informed consent for standardized, guideline-concordant therapy; (2) incomplete clinical information (e.g. missing chief complaint, auxiliary examinations); (3) urgent conditions requiring interventions more immediate than stroke; (4) non-acute cerebrovascular admissions in which stroke was identified only during hospitalization; and (5) patients in the chronic phase of cerebrovascular disease. Guideline adherence of all patient treatments was assessed based on expert evaluation. Detailed inclusion and exclusion criteria for each group are provided in the \u003cb\u003eSupplementary Figure \u003cspan refid=\"MOESM1\" class=\"InternalRef\"\u003eS1\u003c/span\u003e\u003c/b\u003e.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec13\" class=\"Section2\"\u003e\u003ch2\u003eExperimental setups\u003c/h2\u003e\u003cdiv id=\"Sec14\" class=\"Section3\"\u003e\u003ch2\u003ePatient Cases\u003c/h2\u003e\u003cp\u003eTo ensure patient privacy, all personally identifiable information was removed. Each case was formatted as a single paragraph containing all or a subset of the following elements: patient age and sex, chief complaint, current symptoms, medical history (including illnesses and medications), relevant family history, physical examination findings, laboratory test results, and imaging reports.\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Sec15\" class=\"Section2\"\u003e\u003ch2\u003eEvaluation of Standalone LLMs\u003c/h2\u003e\u003cp\u003eTo evaluate the capacity of LLMs in generating treatment recommendations for AIS and in classifying stroke subtypes according to the TOAST system, each case requires a treatment recommendation and the corresponding TOAST classification conclusion. Seven models were tested: Baichuan-M1-14B\u003csup\u003e\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e\u003c/sup\u003e, GPT-OSS-20B\u003csup\u003e\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e\u003c/sup\u003e, Qwen2.5-32B\u003csup\u003e\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e\u003c/sup\u003e, DeepSeek-R1-Distill-Qwen2.5-32B\u003csup\u003e\u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e\u003c/sup\u003e, GPT-OSS-120B\u003csup\u003e\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e\u003c/sup\u003e, DeepSeek-R1-671B\u003csup\u003e\u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e\u003c/sup\u003e and GPT-4o\u003csup\u003e37\u003c/sup\u003e (\u003cb\u003eSupplementary Table S6)\u003c/b\u003e. Retrospective, real-world clinical cases from Group A were used for this assessment. Outputs were produced in free-text format without predefined options, reflecting the probabilistic nature of clinical reasoning. All LLMs were evaluated in single-turn interactions using their default parameter configurations, without additional manual tuning. Model inference was performed with the official vLLM framework, providing an optimized environment for efficient large-scale deployment. All models were executed on a cluster of eight NVIDIA H20-141GB GPUs using the official Docker release. For version control, vLLM \u003cb\u003ev0.8.4\u003c/b\u003e was applied to all models except GPT-OSS-20B and GPT-OSS-120B, which were deployed under \u003cb\u003ev0.10.1\u003c/b\u003e, thereby ensuring reproducibility and transparency across experiments.\u003c/p\u003e\u003cp\u003eTo ensure consistent and reproducible evaluation, an automated grader agent was employed to quantify accuracy across all LLMs. The grader agent, instantiated using DeepSeek-R1, operated in two sequential steps. First, it extracted and categorized each output into one of three formats: (1) single treatment recommendation/diagnosis, (2) multiple decisions, or (3) no decision. Second, extracted responses were compared against ground-truth references, with treatment mapped to four predefined categories: \u003cem\u003eThrombolysis, Mechanical Thrombectomy, Standard Medical Therapy\u003c/em\u003e, and \u003cem\u003eNon-Acute Ischemic Stroke or Non-Stroke Conditions\u003c/em\u003e. This standardized pipeline minimized subjectivity and ensured consistency across experiments.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec16\" class=\"Section2\"\u003e\u003ch2\u003eHybrid-reasoner Framework\u003c/h2\u003e\u003cp\u003eIn addition to standalone evaluations, we implemented a hybrid-reasoner framework to examine whether structured reasoning and constrained outputs could enhance model performance in AIS-specific tasks. The framework integrates three components: (1) a workflow-oriented summary agent that extracts disease-relevant evidence from lengthy clinical narratives, (2) a guideline-concordant reasoning-path CoT module that enforces structured diagnostic steps, and (3) a clinically inspired multiple-choice constraint mechanism that standardizes outputs within evidence-based decision boundaries.\u003c/p\u003e\u003cp\u003e\u003cb\u003eSummarization agent.\u003c/b\u003e To mitigate performance degradation caused by lengthy case tokens and to ensure the extraction of salient details, a summary agent was implemented and workflow-oriented. Using DeepSeek-R1 with few-shot prompting, this agent generated structured case summaries for downstream reasoning. Example prompts are shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ea,b, and further details are provided in the \u003cb\u003eSupplementary Figure S2.\u003c/b\u003e\u003c/p\u003e\u003cp\u003e\u003cb\u003eReasoning-path CoT.\u003c/b\u003e The guideline-concordant doctor\u0026rsquo;s reasoning-path CoT module was designed as a concise sequence of four critical diagnostic steps, mirroring clinical decision trees. It was developed collaboratively with neurologists, interventional radiologists, and emergency physicians by restructuring existing clinical guidelines.\u003c/p\u003e\u003cp\u003e\u003cb\u003eMultiple-choice constraint.\u003c/b\u003e For clinically inspired final decision-making, both standalone and framework-augmented models were evaluated using four predefined categories as answer options: \u003cem\u003eThrombolysis, Mechanical Thrombectomy, Standard Medical Therapy\u003c/em\u003e, and \u003cem\u003eNon-Acute Ischemic Stroke or Non-Stroke Conditions\u003c/em\u003e. This constraint not only improved consistency but also aligned model outputs with clinically interpretable categories.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec17\" class=\"Section2\"\u003e\u003ch2\u003eComprehensive Multicenter, Multisource, Cross-scenario Evaluation\u003c/h2\u003e\u003cdiv id=\"Sec18\" class=\"Section3\"\u003e\u003ch2\u003eValidation using PubMed case reports and external cohorts\u003c/h2\u003e\u003cp\u003eTo clinically validate the framework, we analyzed real-world cases from Groups B and D using six closed-source LLMs. To further test generalizability across model families, we also evaluated six open-source LLMs and an additional closed-source LLM in Group C (PubMed case reports). The PubMed case-report dataset is publicly accessible. The objective was to assess the generalizability of the hybrid-reasoner framework.\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Sec19\" class=\"Section2\"\u003e\u003ch2\u003eProspective clinical validation with multi-level and multi-specialty physicians\u003c/h2\u003e\u003cp\u003eTo evaluate the clinical impact of LLM assistance, we conducted a prospective study at Center A. Twelve physicians with varying levels of AIS expertise were recruited from four cities (Hunan, Guangdong, Jilin, and Shanghai). Participants included 2 junior oncologists and 3 interventional neurologists (\u0026lt;\u0026thinsp;3 years of experience); 2 senior interventional neurologists and 3 general practitioners (~\u0026thinsp;5 years of experience); and 2 expert interventional neurologists (\u0026gt;\u0026thinsp;10 years of experience).\u003c/p\u003e\u003cp\u003eBetween February and May 2025, this study was conducted in a prospective design (approval no. 2024-KY-203). Consecutive patients who met the inclusion and exclusion criteria were randomly assigned at the time of enrollment to either receive LLM-assisted analysis (With AI) or standard review without LLM (Without AI). In the AI-assisted arm, the best-performing model generated treatment recommendations, TOAST classifications, and a structured reasoning process, which were combined with each patient\u0026rsquo;s admission history. These case packages, with or without AI reasoner information, were then sequentially and randomly allocated to physicians of different seniority levels (junior, senior, and expert) and specialties (stroke specialists and non-specialists) for independent interpretation. Physician decisions were subsequently compared with the treatments patients ultimately received.\u003c/p\u003e\u003cp\u003eThe study design thus enabled a paired comparison of physician performance with and without AI-assisted, allowing stratified analysis across different levels of clinical training and specialty background. Sample size estimation was based on the proportion of AIS patients and average weekly visits at the study site.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec20\" class=\"Section2\"\u003e\u003ch2\u003eOutcome Assessment\u003c/h2\u003e\u003cp\u003e Ground truth was defined as the treatment recommendation and corresponding TOAST classification documented in the clinical record, with independent verification for guideline concordance. The \u003cb\u003eprimary outcome\u003c/b\u003e was the overall therapeutic and TOAST diagnostic accuracy of LLMs. \u003cb\u003eSecondary outcomes\u003c/b\u003e encompassed safety-related assessments, including instruction adherence, structured clinical safety evaluation (scored on a 1\u0026ndash;5 Likert scale, anchored as follows: 1\u0026thinsp;=\u0026thinsp;harmful/ineffective, 2\u0026thinsp;=\u0026thinsp;minimal value, 3\u0026thinsp;=\u0026thinsp;limited value [occasionally useful], 4\u0026thinsp;=\u0026thinsp;useful [improves decision-making or efficiency], 5\u0026thinsp;=\u0026thinsp;highly useful/transformative [consistently improves decision quality and/or workflow efficiency]), and the incidence of omissions and hallucinations.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec21\" class=\"Section2\"\u003e\u003ch2\u003eStatistical Analysis\u003c/h2\u003e\u003cp\u003eAll analyses were performed in Python (v3.10). Statistical significance was set at two-sided p\u0026thinsp;\u0026lt;\u0026thinsp;0.05. Binary and continuous variables were summarized descriptively. Accuracy differences between standalone and framework augmented LLMs were tested with McNemar\u0026rsquo;s test, and F1 score differences were estimated by bootstrap resampling (1,000 iterations) with 95% confidence intervals. Paired ordinal outcomes were compared using the Wilcoxon signed-rank test. To account for repeated measures, a generalized linear mixed-effects model with a binomial link was fitted to estimate the independent effect of framework augmentation on decision accuracy in the Group A dataset\u003csup\u003e\u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e\u003c/sup\u003e. Event rates\u0026mdash;including omission and hallucination frequencies\u0026mdash;and differences between AI-assisted and non-assisted arms were analyzed using χ\u0026sup2; or Fisher\u0026rsquo;s exact tests, as appropriate.\u003c/p\u003e\u003c/div\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003ch2\u003eCompeting interests\u003c/h2\u003e\u003cp\u003eThe authors declare no competing interests.\u003c/p\u003e\u003c/p\u003e\u003cp\u003e\u003ch2\u003eCorrespondence and requests for materials\u003c/h2\u003e\u003cp\u003eCorrespondence and requests for materials should be addressed to Yuehua Li (email:
[email protected])\u003c/p\u003e\u003c/p\u003e\u003ch2\u003eFunding\u003c/h2\u003e\u003cp\u003eThis work was supported by Supported by National Natural Science Foundation of China (No. 8225024), Key R\u0026amp;D sub project of the Ministry of Science and Technology (No. 2023YFF1204804), Shanghai Pudong New Area Science and Technology Commission Project (No. PKJ2023-Y53), Shanghai Jiaotong University, Medicine and engineering interdisciplinary program (No. YG2024LC08), and Shanghai key discipline of medical imaging (No. 2017ZZ02005).\u003c/p\u003e\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003eBicong Yan: study conception and design, study implementation, data collection, data screening, manuscript drafting and revision.Ruipeng Zhang: technical support, code development, program execution, protocol optimization, data analysis, manuscript revision.Yanfeng Fan: program execution, data collection, data screening and verification, data evaluation.Ying Li: data collection, data verification, data evaluation.Li Chen: data screening, data evaluation, protocol optimization and protocol assessment.Xinyu Song: statistical guidance, data evaluation.Yixiao Tang, Yifan Tu, Zhongzheng Cao: data collection and clinical evaluation.Li Shen: statistical guidance.Mengfei Wang: data screening, technical support for coding.Zhuo Li: data screening, technical support for coding.Yijia Xiong: data collection, data evaluation.Yuehua Li: project supervision, critical guidance, and funding acquisition (corresponding author).\u003c/p\u003e\u003ch2\u003eAcknowledgement\u003c/h2\u003e\u003cp\u003eThis study was conducted over an extended period and required substantial human and material resources. We thank all patients and their families for their participation, as well as the open-source community for making LLMs publicly available. We are grateful to cooperators for assistance with data collection and coding, and to the biostatistics experts for their valuable guidance in statistical methodology. We also acknowledge the strong support in clinical validation from Tao Wang, Hongmei Song, Daqian Zhang, Yingying Lu, Tonglei Fang, Xingxing Sun, Lu Fei, Yixiao Tang, Yifan Tu, Zhongzheng Cao, Fasheng Peng, Mengfan Yan, and Yuxiang Zhou.\u003c/p\u003e\u003cdiv id=\"Sec22\" class=\"Section2\"\u003e\u003ch2\u003eData Availability\u003c/h2\u003e\u003cp\u003eThe raw data supporting the findings of this study are available from the corresponding author upon reasonable request. For the PubMed case cohort, the original data are not directly shared in this work; instead, we provide the references to the corresponding open-access publications in our released code repository, from which the source data can be obtained.\u003c/p\u003e\u003cdiv id=\"Sec23\" class=\"Section3\"\u003e\u003ch2\u003eCode availability\u003c/h2\u003e\u003cp\u003eAll code for this study is publicly available. The source code for model deployment, inference scripts, and trained model weights will be released at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://anonymous.4open.science/r/HR-LLM-Stroke\u003c/span\u003e\u003cspan address=\"https://anonymous.4open.science/r/HR-LLM-Stroke\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/p\u003e\u003cp\u003eThe following LLMs were evaluated: Baichuan-M1-14B, GPT-OSS-20B, Qwen2.5-32B, DeepSeek-R1-Distill-Qwen-32B, GPT-OSS-120B, DeepSeek-R1-671B, and GPT-4o. All prompts used in this work are also included in \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://anonymous.4open.science/r/HR-LLM-Stroke\u003c/span\u003e\u003cspan address=\"https://anonymous.4open.science/r/HR-LLM-Stroke\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eVollset, S. E. \u003cem\u003eet al.\u003c/em\u003e Burden of disease scenarios for 204 countries and territories, 2022\u0026ndash;2050: a forecasting analysis for the Global Burden of Disease Study 2021. \u003cem\u003eThe Lancet\u003c/em\u003e 403, 2204\u0026ndash;2256 (2024).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eFeigin, V. L. \u003cem\u003eet al.\u003c/em\u003e World Stroke Organization: Global Stroke Fact Sheet 2025. \u003cem\u003eInternational Journal of Stroke\u003c/em\u003e 20, 132\u0026ndash;144 (2025).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eShen, Y.-C., Sarkar, N. \u0026amp; Hsia, R. Y. Structural Inequities for Historically Underserved Communities in the Adoption of Stroke Certification in the United States. \u003cem\u003eJAMA Neurol\u003c/em\u003e 79, 777 (2022).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003ePandian, J. D. \u003cem\u003eet al.\u003c/em\u003e Stroke systems of care in low-income and middle-income countries: challenges and opportunities. \u003cem\u003eThe Lancet\u003c/em\u003e 396, 1443\u0026ndash;1451 (2020).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eAvasarala, J. \u0026amp; Wesley, K. Optimization of acute stroke care in the emergency department: a call for better utilization of healthcare resources amid growing shortage of neurologists in the United States. \u003cem\u003eCNS Spectr.\u003c/em\u003e 23, 248\u0026ndash;250 (2018).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eWang, H. \u003cem\u003eet al.\u003c/em\u003e Burden of cardiovascular disease among the Western Pacific region and its association with human resources for health, 1990\u0026ndash;2021: a systematic analysis of the Global Burden of Disease Study 2021. \u003cem\u003eThe Lancet Regional Health - Western Pacific\u003c/em\u003e 51, 101195 (2024).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eNasreldein, A. \u003cem\u003eet al.\u003c/em\u003e Pre- and in-hospital delays in the use of thrombolytic therapy for patients with acute ischemic stroke in rural and urban Egypt. \u003cem\u003eFront. Neurol.\u003c/em\u003e 13, 1070523 (2023).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eFeigin, V. L. \u003cem\u003eet al.\u003c/em\u003e Global burden of stroke and risk factors in 188 countries, during 1990\u0026ndash;2013: a systematic analysis for the Global Burden of Disease Study 2013. \u003cem\u003eThe Lancet Neurology\u003c/em\u003e 15, 913\u0026ndash;924 (2016).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eWorld Stroke Organization. \u003cem\u003eGlobal Declaration on Stroke: Commitments for Facing Stroke. (2023).\u003c/em\u003e \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.world-stroke.org/news-and-blog/news/global-declaration-on-stroke-commitments-for-facing-stroke\u003c/span\u003e\u003cspan address=\"https://www.world-stroke.org/news-and-blog/news/global-declaration-on-stroke-commitments-for-facing-stroke\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLiu, X. \u003cem\u003eet al.\u003c/em\u003e A generalist medical language model for disease diagnosis assistance. \u003cem\u003eNat Med\u003c/em\u003e 31, 932\u0026ndash;942 (2025).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKanjee, Z., Crowe, B. \u0026amp; Rodman, A. Accuracy of a Generative Artificial Intelligence Model in a Complex Diagnostic Challenge. \u003cem\u003eJAMA\u003c/em\u003e 330, 78 (2023).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eCabral, S. \u003cem\u003eet al.\u003c/em\u003e Clinical Reasoning of a Generative Artificial Intelligence Model Compared With Physicians. \u003cem\u003eJAMA Intern Med\u003c/em\u003e 184, 581 (2024).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eTu, T. \u003cem\u003eet al.\u003c/em\u003e Towards conversational diagnostic artificial intelligence. \u003cem\u003eNature\u003c/em\u003e 642, 442\u0026ndash;450 (2025).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMcDuff, D. \u003cem\u003eet al.\u003c/em\u003e Towards accurate differential diagnosis with large language models. \u003cem\u003eNature\u003c/em\u003e 642, 451\u0026ndash;457 (2025).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eGoh, E. \u003cem\u003eet al.\u003c/em\u003e Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial. \u003cem\u003eJAMA Netw Open\u003c/em\u003e 7, e2440969 (2024).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eGoh, E. \u003cem\u003eet al.\u003c/em\u003e Publisher Correction: GPT-4 assistance for improvement of physician performance on patient care tasks: a randomized controlled trial. \u003cem\u003eNat Med\u003c/em\u003e 31, 1370\u0026ndash;1370 (2025).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKottlors, J. \u003cem\u003eet al.\u003c/em\u003e Large Language Models\u0026ndash;Supported Thrombectomy Decision-Making in Acute Ischemic Stroke Based on Radiology Reports: Feasibility Qualitative Study. \u003cem\u003eJ Med Internet Res\u003c/em\u003e 27, e48328 (2025).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLi, T., Zhang, G., Do, Q. D., Yue, X. \u0026amp; Chen, W. Long-context LLMs Struggle with Long In-context Learning. Preprint at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.48550/ARXIV.2404.02060\u003c/span\u003e\u003cspan address=\"10.48550/ARXIV.2404.02060\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2024).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eZhang, G. \u003cem\u003eet al.\u003c/em\u003e Leveraging long context in retrieval augmented language models for medical question answering. \u003cem\u003enpj Digit. Med.\u003c/em\u003e 8, (2025).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eWang, C. \u003cem\u003eet al.\u003c/em\u003e Patient Triage and Guidance in Emergency Departments Using Large Language Models: Multimetric Study. \u003cem\u003eJ Med Internet Res\u003c/em\u003e 27, e71613 (2025).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eJohri, S. \u003cem\u003eet al.\u003c/em\u003e An evaluation framework for clinical use of large language models in patient interaction tasks. \u003cem\u003eNat Med\u003c/em\u003e 31, 77\u0026ndash;86 (2025).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eZhong, W. \u003cem\u003eet al.\u003c/em\u003e Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China\u0026rsquo;s Rare Disease Catalog: Comparative Study. \u003cem\u003eJ Med Internet Res\u003c/em\u003e 27, e69929\u0026ndash;e69929 (2025).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLi, J. \u003cem\u003eet al.\u003c/em\u003e Integrated image-based deep learning and language models for primary diabetes care. \u003cem\u003eNat Med\u003c/em\u003e 30, 2886\u0026ndash;2896 (2024).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHabib, A. R., Lin, A. L. \u0026amp; Grant, R. W. The Epic Sepsis Model Falls Short\u0026mdash;The Importance of External Validation. \u003cem\u003eJAMA Intern Med\u003c/em\u003e 181, 1040 (2021).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eWong, A. \u003cem\u003eet al.\u003c/em\u003e External Validation of a Widely Implemented Proprietary Sepsis Prediction Model in Hospitalized Patients. \u003cem\u003eJAMA Intern Med\u003c/em\u003e 181, 1065 (2021).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eOwens, D. \u003cem\u003eet al.\u003c/em\u003e Accuracy of Large Language Models to Identify Stroke Subtypes Within Unstructured Electronic Health Record Data. \u003cem\u003eStroke\u003c/em\u003e (2025) doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1161/strokeaha.125.051993\u003c/span\u003e\u003cspan address=\"10.1161/strokeaha.125.051993\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eVan Veen, D. \u003cem\u003eet al.\u003c/em\u003e Adapted large language models can outperform medical experts in clinical text summarization. \u003cem\u003eNat Med\u003c/em\u003e 30, 1134\u0026ndash;1142 (2024).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eFeigin, V. L. \u003cem\u003eet al.\u003c/em\u003e World Stroke Organization: Global Stroke Fact Sheet 2025. \u003cem\u003eInternational Journal of Stroke\u003c/em\u003e 20, 132\u0026ndash;144 (2025).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLee, J. T. \u003cem\u003eet al.\u003c/em\u003e Evaluation of performance of generative large language models for stroke care. \u003cem\u003enpj Digit. Med.\u003c/em\u003e 8, 481 (2025).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBeutel, G., Geerits, E. \u0026amp; Kielstein, J. T. Artificial hallucination: GPT on LSD? \u003cem\u003eCrit Care\u003c/em\u003e 27, (2023).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eFarquhar, S., Kossen, J., Kuhn, L. \u0026amp; Gal, Y. Detecting hallucinations in large language models using semantic entropy. \u003cem\u003eNature\u003c/em\u003e 630, 625\u0026ndash;630 (2024).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eWilliams, C. Y. K. \u003cem\u003eet al.\u003c/em\u003e Evaluating large language models for drafting emergency department encounter summaries. \u003cem\u003ePLOS Digit Health\u003c/em\u003e 4, e0000899 (2025).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eWang, B. \u003cem\u003eet al.\u003c/em\u003e Baichuan-m1: Pushing the medical capability of large language models. \u003cem\u003earXiv preprint arXiv:2502.12671\u003c/em\u003e (2025).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eAgarwal, S. \u003cem\u003eet al.\u003c/em\u003e gpt-oss-120b \u0026amp; gpt-oss-20b Model Card. \u003cem\u003earXiv preprint arXiv:2508.10925\u003c/em\u003e (2025).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eYang, A. \u003cem\u003eet al.\u003c/em\u003e Qwen2. 5 Technical Report. \u003cem\u003earXiv preprint arXiv:2412.15115\u003c/em\u003e (2024).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eGuo, D. \u003cem\u003eet al.\u003c/em\u003e Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. \u003cem\u003earXiv preprint arXiv:2501.12948\u003c/em\u003e (2025).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHurst, A. \u003cem\u003eet al.\u003c/em\u003e Gpt-4o system card. \u003cem\u003earXiv preprint arXiv:2410.21276\u003c/em\u003e (2024).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHedeker, D. A mixed-effects multinomial logistic regression model. \u003cem\u003eStatistics in Medicine\u003c/em\u003e 22, 1433\u0026ndash;1446 (2003).\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"acute ischemic stroke, large language model, decision support, clinical safety","lastPublishedDoi":"10.21203/rs.3.rs-7998391/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7998391/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eAcute ischemic stroke (AIS) is a leading cause of mortality and disability worldwide, with outcomes critically dependent on timely and accurate treatment. Yet prognosis is undermined by uneven expertise, resource shortages, and diagnostic delays. Large language models (LLMs), through rapid and accurate interpretation, may bridge this gap and improve care delivery. Here, we developed a hybrid-reasoner framework that augments LLMs with structured clinical reasoning to reliably support time-critical, guideline-concordant AIS emergency decision-making, particularly the need for timely and accurate care in resource-limited settings. We present the first multicenter, multisource, cross-scenario evaluation encompassing both retrospective and prospective real-world clinical cases, as well as literature-derived cases. Across model scales, framework augmentation yielded consistent and substantial gains in treatment accuracy, with average improvements of 18.9% compared with standalone LLMs. Safety evaluation showed that the augmented DeepSeek-R1-671B achieved low hallucination (10.9%), omission (14.7%), and a high overall safety score (4.36/5). Notably, human\u0026ndash;AI interaction experiments revealed that junior and non-specialist physicians benefited most, narrowing expertise gaps. Collectively, these findings demonstrate that hybrid-reasoner augmented LLMs enhance accuracy, safety, and guideline-concordant decision-making in AIS. This study marks the transition from technical optimization to real-world translation, laying the groundwork for lightweight, safe, and equitable integration of LLMs into stroke center networks, telemedicine, and resource-limited settings.\u003c/p\u003e","manuscriptTitle":"A hybrid-reasoner LLM framework toward real-world clinical decision- making support in acute ischemic stroke","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-12-11 12:40:14","doi":"10.21203/rs.3.rs-7998391/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"93149a9a-6273-4e56-856d-3db5febcf0ac","owner":[],"postedDate":"December 11th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":59346215,"name":"Biological sciences/Computational biology and bioinformatics"},{"id":59346216,"name":"Health sciences/Health care"},{"id":59346217,"name":"Physical sciences/Mathematics and computing"},{"id":59346218,"name":"Health sciences/Medical research"}],"tags":[],"updatedAt":"2026-02-13T16:25:44+00:00","versionOfRecord":[],"versionCreatedAt":"2025-12-11 12:40:14","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-7998391","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7998391","identity":"rs-7998391","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.