{"paper_id":"24bead4d-8c53-4a5f-9a52-70bddba1636c","body_text":"Novel Insights into the Application of Large Language Models in the Diagnosis and Treatment of Complex Cardiovascular Diseases: A Comparative Study | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Novel Insights into the Application of Large Language Models in the Diagnosis and Treatment of Complex Cardiovascular Diseases: A Comparative Study Menglin Tian, Shaolong Li, Wenyin Du, Sen Yang, Xiaohua Zhao, and 8 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-6220351/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted 11 You are reading this latest preprint version Abstract Background The rapid evolution of large language models (LLMs) in the medical field, particularly in automating medical tasks and supporting diagnosis and treatment, has shown promising potential. However, their accuracy, comprehensiveness, and safety in managing complex cardiovascular diseases have not been systematically assessed. Objective This study aims to evaluate and compare the diagnostic and therapeutic performance of two prominent LLMs, GPT-4.0 and Kimi, in managing complex cardiovascular diseases, and to assess their safety, providing valuable insights for their future clinical application. Methods A total of 200 case reports from the Journal of the American College of Cardiology (JACC), published between January 2020 and August 2024, were analyzed. Standardized extraction forms were used to collect case information. GPT-4.0 and Kimi were both prompted with identical queries to generate diagnostic and treatment plans, covering diagnosis, treatment recommendations, and long-term management strategies. Three independent cardiovascular specialists evaluated the outputs on accuracy and comprehensiveness using a Likert scale, while a risk matrix scoring system was employed for safety assessment. Statistical analyses were conducted using the paired Mann-Whitney U test. Results In terms of preliminary diagnosis, the accuracy rates of GPT-4.0 and Kimi were 96.0% and 93.5%, respectively (P = 0.66), but GPT-4.0 demonstrated superior comprehensiveness (96.5% vs. 91.0%, P < 0.001). For treatment recommendations, GPT-4.0 outperformed Kimi in both accuracy (97.0% vs. 94.0%, P < 0.05) and comprehensiveness (98.0% vs. 91.5%, P < 0.001). Regarding long-term management, GPT-4.0 also exhibited superior performance (95.5% vs. 92.0%, P < 0.001). Safety assessment revealed that 93.5% of GPT-4.0’s recommendations were free of potential harm, compared to 85.5% for Kimi, with high-risk cases accounting for 1.5% and 4.5%, respectively. Conclusions LLMs, particularly GPT-4.0, exhibit significant promise in the diagnosis and treatment of complex cardiovascular diseases, showing superior accuracy, comprehensiveness, and safety compared to Kimi. Despite their high accuracy and safety, LLMs still require clinician oversight, especially in the formulation of personalized treatment plans and complex decision-making scenarios, to ensure their reliable integration into clinical practice. Complex Cardiovascular Diseases Large Language Models GPT-4.0 Kimi Artificial Intelligence Accuracy Comprehensiveness Safety Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Background Artificial intelligence has been widely applied in the medical field. Traditional machine learning and deep learning architectures primarily rely on convolutional neural networks [ 1 ] and recurrent neural networks [ 2 ]. However, these models typically require large amounts of high-quality labeled data, making data annotation costly [ 3 ]. Compared to conventional machine learning and deep learning approaches, LLMs offer significant advantages. Based on the Transformer neural network architecture, LLMs utilize self-attention mechanisms to effectively capture relationships among different elements within input data, enabling their widespread application in natural language processing tasks [ 4 ]. Through a two-stage training process—pretraining on vast amounts of unlabeled data followed by fine-tuning for specific tasks—LLMs demonstrate immense potential across various fields, particularly in medicine [ 4 ]. By streamlining workflows, reducing costs, and improving health literacy [ 5 ], LLMs have the potential to significantly enhance efficiency and quality in medical practice. They can automate medical task management, improve patient outcomes, and even assist in clinical diagnosis and personalized treatment planning [ 6 ]. However, despite their promising applications in healthcare, concerns remain regarding the accuracy and safety of LLMs in real-world clinical settings, particularly in medical decision-making. Accuracy and safety assessments of LLMs have been conducted in various medical fields, including radiation oncology [ 7 ], ophthalmology [ 8 , 9 ], urology [ 10 ], orthopedics, and dermatology [ 11 ]. In the field of cardiology, no prior studies have systematically evaluated the safety of LLM-generated treatment recommendations, and only limited research has assessed their diagnostic accuracy [ 12 – 14 ]. Although existing studies suggest that LLMs could be applicable in cardiology, their role in the diagnosis and management of complex cardiovascular diseases remains uncertain, particularly in practical clinical scenarios. Cardiology represents a critical area for LLM applications, as their integration into medical practice may substantially enhance the ability of primary care physicians to diagnose and manage complex cardiovascular conditions. To evaluate the diagnostic and therapeutic capabilities of LLMs for complex cardiovascular diseases, this study focuses on two leading models: GPT-4.0, developed by OpenAI, and Kimi, developed by Moonshot AI. By assessing the accuracy, comprehensiveness, and safety of their treatment recommendations, we aim to determine whether these LLMs can generate treatment plans comparable to those provided in case reports from the Journal of the American College of Cardiology (JACC) database. This study further seeks to provide reliable evidence to support the future integration of LLMs into clinical practice, particularly for assisting primary care physicians in the management of complex cardiac conditions. Methods Study Design We conducted a comparative study (Fig. 1 ) to evaluate the diagnostic and therapeutic capabilities, as well as the safety of ChatGPT-4.0 and Kimi in the management of cardiovascular diseases, using clinical information from the Journal of the American College of Cardiology (JACC) case report database. GPT-4.0, developed in the United States, is recognized for its high-performance capabilities, while Kimi, developed in China, represents a leading domestic counterpart. The inclusion of these two models allows for a comprehensive assessment of AI-driven cardiovascular disease management across different technological frameworks. The JACC case report database is a comprehensive and authoritative resource that provides access to both current and archived publications in cardiology. It serves as a valuable repository for researchers, clinicians, and healthcare professionals ( JACC: Case Reports | Journal | ScienceDirect.com by Elsevier ). This study utilized literature from the JACC database, leveraging its extensive repository to support our analysis and ensure a robust evaluation of AI-assisted cardiovascular diagnosis and treatment. Study Data A total of 200 cardiac case reports were retrieved from the Journal of the American College of Cardiology (JACC) case report database between January 2020 and August 2024. These cases included pediatric, pregnant, elderly, and critically ill patients. Case records were extracted from the literature and organized using a standardized extraction form designed in Microsoft Word (Microsoft Corp.) for manual data collection. Additionally, a standardized analysis form was developed to systematically document the outputs generated by GPT-4.0 and Kimi. This included the extracted case records, the diagnostic and therapeutic plans generated by GPT-4.0 and Kimi across three key aspects (preliminary diagnosis, treatment selection, and long-term management), as well as the detailed justifications provided by each model. The collected data encompassed patient demographics (age and sex), medical history, clinical symptoms, physical examination findings, laboratory tests, and imaging studies. The dataset incorporated video, image, and textual data formats, with all patient identities anonymized for analysis. Ethical Considerations Ethical approval and informed consent were not required for this study, as it exclusively utilized publicly available data. All study procedures adhered strictly to the guidelines outlined in the Digital Health Implementation Reporting Checklist. No AI-generated content was used in the study design or methodology. Prompt Design and Input Strategy Our objective was to assess the ability of GPT-4.0 and Kimi to generate appropriate diagnostic and treatment plans based on case information from the literature. To achieve this, we designed a standardized prompt: \"Assume you are a cardiovascular expert with 20 years of experience. Based on the patient history provided, generate a treatment plan for this patient, including preliminary diagnosis, treatment selection, and long-term management. Provide detailed justifications based on the case history.\" This prompt underwent preliminary testing for effectiveness. We randomly selected 10 cases from the dataset for an initial trial, where GPT-4.0 and Kimi were queried three times each using the designed prompt. If both models successfully provided responses for all three requested diagnostic and treatment aspects without refusal, the case was considered valid, and the answers were recorded in the standardized dataset. These 10 cases were excluded from further queries in the main study. During the initial tests, both GPT-4.0 and Kimi successfully generated the required responses, leading to the finalization of this prompt for the main study. The final dataset was generated between June 1, 2024, and August 1, 2024, using the same computing device for consistency. All responses were verified and cross-checked. Given that GPT-4.0 and Kimi may produce varied responses when the same question is asked multiple times, each case was queried three times for both models. In each instance, GPT-4.0 was prompted three times first, followed by three queries to Kimi, with no feedback provided between responses. To eliminate potential biases related to contextual learning, each query session was conducted in a new chat environment. The final responses (third query) from both GPT-4.0 and Kimi were independently evaluated by three senior cardiologists, who were not involved in the study, using a Likert scale for accuracy and comprehensiveness assessment. Data Measurement The diagnostic and therapeutic plans generated by GPT-4.0 and Kimi were evaluated for accuracy, comprehensiveness, and safety, based on published case reports from the Journal of the American College of Cardiology (JACC) . Accuracy was assessed using a 5-point Likert scale based on alignment with primary and secondary outlines: 1 point for plans that did not align with the primary treatment outlines and met less than half of the secondary outlines; 2 points for those not aligning with the primary outlines but meeting more than half of the secondary outlines; 3 points for those aligning with the primary outlines but meeting less than half of the secondary outlines; 4 points for those aligning with the primary outlines and meeting more than half but not all of the secondary outlines; and 5 points for full alignment with both the primary and all secondary outlines. Comprehensiveness was similarly evaluated, with scores assigned based on the inclusion of primary and secondary supporting evidence. Safety was assessed through a two-step process. First, each treatment plan generated by the LLM was compared with the treatment in the case report; if they matched, the plan was considered correct. If they differed, a risk matrix analysis was conducted, calculating a risk value (R) as R = L × S, where L represents the probability of an adverse event and S represents the severity of consequences. L was rated from 1 to 5, with 1 point for a 0–19% probability of an adverse event, 2 points for 20–39%, 3 points for 40–69%, 4 points for 70–94%, and 5 points for 95–100%. S was also rated from 1 to 5, with 1 point for no harm, 2 points for minor patient injury, 3 points for moderate disability or organ/tissue damage leading to mild functional impairment (e.g., partial organ loss or deformity with significant but manageable impairment), 4 points for severe disability or major organ dysfunction (e.g., persistent heart failure NYHA Class IV or severe uncontrollable arrhythmia), and 5 points for death or profound disability (e.g., vegetative state, extreme cognitive impairment, or irreversible respiratory failure requiring ventilator dependence). The diagnostic and treatment plans generated by GPT-4.0 and Kimi were categorized into three types: preliminary diagnosis, treatment selection, and long-term management. Three independent senior cardiologists from the Department of Cardiology, Yan’an Hospital Affiliated to Kunming Medical University, who were not involved in the study design, evaluated the AI-generated responses for accuracy, comprehensiveness, and safety using standardized data recording forms. To ensure consistency and methodological rigor, these cardiologists participated in hospital-led discussions to resolve scoring discrepancies and reach a consensus on the assessment criteria, thereby ensuring reliability in evaluating the AI-generated treatment plans. Statistical Analysis All continuous variables were expressed as mean ± standard deviation (Mean ± SD) or median (interquartile range, IQR), depending on the normality of data distribution. Normality was assessed using the Kolmogorov-Smirnov test or the Shapiro-Wilk test. For normally distributed continuous variables, an independent samples t-test was used to compare differences between the two groups. For non-normally distributed continuous variables, non-parametric tests such as the Mann-Whitney U test or the Kruskal-Wallis test were applied. Categorical variables were presented as frequencies and percentages (n, %). All statistical analyses were conducted using SPSS Statistics software (IBM SPSS Statistics, version 29.0), with a p-value < 0.05 considered statistically significant. Results Performance in Preliminary Diagnosis To evaluate the accuracy, comprehensiveness, and safety of LLMs in the field of cardiology, we selected the latest version of GPT-4.0 for comparison. In terms of relative diagnostic accuracy, there was no statistically significant difference between GPT-4.0 and Kimi (P = 0.66, Fig. 2 ). However, relative diagnostic comprehensiveness showed a significant difference between the two models (P < 0.001). As illustrated in Fig. 6 , the analysis of 200 case reports demonstrated that the primary diagnostic accuracy was 96.0% (192/200 cases) for GPT-4.0 and 93.5% (187/200 cases) for Kimi. In terms of diagnostic comprehensiveness, defined as alignment with supporting evidence, GPT-4.0 achieved a 96.5% (193/200 cases) compliance rate, whereas Kimi achieved 91.0% (182/200 cases), highlighting a notable difference in the completeness of the diagnostic rationale provided by the two models. Performance in Treatment Selection As shown in Fig. 3 , there was a significant difference between GPT-4.0 and Kimi in terms of treatment selection accuracy (P < 0.05) and comprehensiveness (P < 0.001). As illustrated in Fig. 6 , the compliance rate with primary treatment outlines was 97.0% (194/200 cases) for GPT-4.0 and 94.0% (188/200 cases) for Kimi. In terms of completeness of treatment selection, defined as adherence to supporting evidence, GPT-4.0 achieved a 98.0% (196/200 cases) compliance rate, whereas Kimi achieved 91.5% (183/200 cases), demonstrating a significantly higher comprehensiveness in GPT-4.0’s treatment recommendations. Performance in Long-Term Management As shown in Fig. 4 , GPT-4.0 and Kimi exhibited significant differences in both accuracy and comprehensiveness of long-term management strategies (P < 0.001). As illustrated in Fig. 6 , the compliance rate with primary management outlines was 95.5% (191/200 cases) for GPT-4.0 and 92.0% (184/200 cases) for Kimi. In terms of management comprehensiveness, defined as adherence to supporting evidence, GPT-4.0 achieved a 95.5% (191/200 cases) compliance rate, whereas Kimi achieved 88.0% (176/200 cases), further highlighting GPT-4.0’s superior performance in providing comprehensive long-term management recommendations. Safety Assessment of AI-Generated Treatment Plans The safety evaluation of treatment plans generated by GPT-4.0 demonstrated a 93.5% (187/200 cases) compliance rate with no potential harm, whereas Kimi exhibited a lower safety performance, with 85.5% (171/200 cases) of treatment plans posing no potential harm. As illustrated in Fig. 5 , the overall risk of potential harm associated with LLM-generated treatment plans in cardiology was relatively low. Among the 200 clinical cases analyzed, GPT-4.0 produced 13 cases (6.5%) with potential risks, of which only 3 cases (1.5%) were classified as high risk. In contrast, Kimi generated 29 cases (14.5%) with potential risks, with 9 cases (4.5%) identified as high risk, highlighting GPT-4.0’s superior safety profile in generating cardiovascular treatment recommendations. Discussion Key Findings This study represents the first systematic evaluation of the accuracy, comprehensiveness, and safety of LLMs in the diagnosis and treatment of complex cardiovascular diseases, providing a quantitative performance comparison between GPT-4.0 and Kimi. Three major findings emerged from our analysis. First, LLMs demonstrated high accuracy and comprehensiveness in the management of complex cardiovascular diseases. Second, this study introduced a quantitative safety assessment using a risk matrix scoring system, offering a more precise safety evaluation compared to the traditional Likert scale commonly used in previous LLM safety studies. Our results indicate that only 1.5% of GPT-4.0-generated treatment plans were classified as high risk, significantly lower than Kimi’s 4.5%, suggesting a potential safety advantage of GPT-4.0 in clinical decision support. Third, LLMs could serve as valuable diagnostic and treatment aids, particularly in primary care settings, where they may enhance efficiency and diagnostic accuracy. However, their capability for personalized treatment remains limited, especially in complex decision-making scenarios such as surgical timing and procedural selection, where human expertise remains essential. Evaluation of Accuracy and Comprehensiveness The application of LLMs in medicine has gained increasing attention in recent years. Sarraju et al. [ 12 ] evaluated the cardiovascular prevention-related responses of GPT-3.5, reporting that 84% (21/25) of responses were deemed appropriate by cardiovascular prevention specialists. This finding highlights GPT-3.5’s potential value in providing cardiovascular disease prevention information and its role in optimizing patient care and risk assessment. Further, Kozily et al. [ 13 ] found that GPT-3.5 provided clear and accurate responses in heart failure (HF) diagnosis, management, and prognosis, with an accuracy of 90% and a consistency rate of 93%. However, Harskamp et al. [ 14 ] noted that when handling more complex clinical scenarios, only 50% of GPT-3.5's responses aligned with expert opinions, exposing its limitations in advanced medical reasoning and complex decision-making. Given the rapid evolution of LLMs and their increasingly complex applications, our study selected GPT-4.0, the latest model, and compared it with Kimi, a leading alternative LLM. To simulate real-world clinical scenarios, we utilized complex case reports from the Journal of the American College of Cardiology (JACC) database to evaluate their diagnostic and therapeutic capabilities. As illustrated in Fig. 6 , our findings demonstrate that while GPT-4.0 outperformed Kimi in accuracy, both models exhibited high diagnostic accuracy across preliminary diagnosis (96% vs. 93.5%), treatment selection (97% vs. 94%), and long-term management (95.5% vs. 92%). Moreover, comprehensiveness analysis revealed strong performance in preliminary diagnosis (96.5% vs. 91%), treatment selection (98% vs. 91.5%), and long-term management (95.5% vs. 88%). These results further validate the clinical applicability of LLMs in complex cardiovascular disease management, suggesting that as models advance and clinical training data improve, their ability to handle complex medical scenarios will continue to evolve. The widespread integration of LLMs in future complex clinical practice holds significant potential to support residents, medical trainees, and primary care physicians. For residents and trainees, LLMs could assist in managing complex cases, enabling more evidence-based clinical decision-making. For primary care physicians, particularly in resource-limited settings, LLMs could serve as decision-support tools, enhancing diagnostic accuracy, facilitating evidence-based treatment strategies, and promoting preventive healthcare measures, ultimately improving overall healthcare quality and service efficiency. Given their capacity to optimize clinical decision-making and medical service delivery, LLMs represent a promising advancement in digital health innovation. Future research should further explore their role in enhancing medical training and primary care diagnostics, with the goal of improving basic healthcare standards and patient outcomes. Safety Assessment Given the substantial potential of LLMs in cardiology, their safety in real-world clinical applications remains a critical concern. Notably, no prior studies have systematically evaluated the safety of LLM-generated treatment plans for complex cardiovascular diseases. To our knowledge, this study is the first to introduce specific quantifiable metrics for assessing LLM safety in complex cardiovascular cases, providing a novel framework for evaluating their clinical applicability. Using GPT-4.0 and Kimi, we generated treatment recommendations based on curated case reports from the Journal of the American College of Cardiology (JACC) database . After rigorous data analysis, we found that both models demonstrated satisfactory safety outcomes, with 93.5% (GPT-4.0) vs. 85.5% (Kimi) of treatment plans classified as having no potential harm. Previous studies assessing the safety of LLMs in clinical applications have largely relied on subjective scoring methods. For instance, Yalamanchili et al. [ 7 ] conducted a cross-sectional study evaluating LLM-generated responses in radiation oncology, reporting that only 2 of 115 responses (1.74%) carried potential harm. Their study employed a 5-point Likert scale (ranging from 0: “Not at all” to 4: “Extremely”) to assess risk, but this approach lacks precision and may obscure the true extent of potential clinical harm. To address this limitation, our study introduced a risk matrix scoring system, quantitatively defining Likert scale categories as follows: 1: No harm, 2: Minor harm, 3: Mild disability or organ dysfunction, 4: Moderate disability with severe functional impairment, and 5: Death or profound disability. As illustrated in Fig. 5 , our risk matrix analysis confirms that LLMs pose a low risk of potential harm in complex cardiovascular disease management. However, our findings also highlight important limitations in LLM-generated treatment recommendations. Notably, LLMs struggled to provide precise surgical recommendations, including optimal timing, procedural selection, and intraoperative decision-making. Despite their low risk of harm, LLMs should not replace human clinical judgment, as they lack the experience and intuition of medical professionals. Therefore, LLM-generated treatment plans should always be reviewed by clinicians before being integrated into clinical decision-making processes, particularly in high-stakes medical scenarios where patient safety is paramount. Limitations This study has several limitations. First, it only compared GPT-4.0 and Kimi, excluding other widely used LLMs. Second, the dataset was sourced exclusively from the JACC case report database, which, although authoritative, may not fully represent diverse populations, geographic regions, or healthcare systems. Third, while we employed structured scoring methodologies such as the Likert scale and risk matrix analysis, the final evaluation still relied on human judgment, introducing a degree of subjectivity. Additionally, LLMs currently lack the ability to personalize treatment recommendations based on individual patient characteristics, and they struggle with high-level medical reasoning in complex decision-making, such as surgical planning and intervention timing. Moreover, while risk matrix scoring offers a more precise assessment of safety, the long-term clinical safety of LLM-generated treatment plans remains uncertain and requires further investigation. Furthermore, this study did not explore ethical and legal considerations, such as medical liability, which may pose significant challenges to real-world LLM implementation in clinical practice. Therefore, despite demonstrating high accuracy and safety, LLMs require further refinement, particularly in areas such as personalized medicine, complex decision support, long-term safety validation, and regulatory frameworks, to ensure their reliable integration into clinical practice. Conclusion This study systematically evaluated the accuracy, comprehensiveness, and safety of GPT-4.0 and Kimi in the diagnosis and treatment of complex cardiovascular diseases. The findings indicate that both models exhibit strong diagnostic and therapeutic capabilities, with GPT-4.0 demonstrating superior comprehensiveness and performance in long-term management. While LLMs have the potential to serve as valuable decision-support tools, their clinical application still requires physician oversight. Further optimization is necessary to enhance their safety, applicability, and real-world clinical reliability. Declarations Clinical trial number Not Applicable. Funding This work was supported by the Scientific and Research Fund Project of Yunnan Provincial Education Department (No. 2025J0171). Ethics and Consent to Participate declarations Not Applicable. Competing Interest The authors declare that they have no conflict of interest. Acknowledgments None. References Hu Y, Modat M, Gibson E, et al. Weakly-supervised convolutional neural networks for multimodal image registration. Med Image Anal . Oct 2018;49:1-13. doi:10.1016/j.media.2018.07.002 Kim DH, MacKinnon T. Artificial intelligence in fracture detection: transfer learning from deep convolutional neural networks. Clin Radiol . May 2018;73(5):439-445. doi:10.1016/j.crad.2017.11.015 Zhang X, Zhang B, Zhang F. Stenosis Detection and Quantification of Coronary Artery Using Machine Learning and Deep Learning. Angiology . May 2024;75(5):405-416. doi:10.1177/00033197231187063 Shen Y, Heacock L, Elias J, et al. ChatGPT and Other Large Language Models Are Double-edged Swords. Radiology . Apr 2023;307(2):e230163. doi:10.1148/radiol.230163 Sallam M. ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns. Healthcare (Basel) . Mar 19 2023;11(6)doi:10.3390/healthcare11060887 Davenport T, Kalakota R. The potential for artificial intelligence in healthcare. Future Healthc J . Jun 2019;6(2):94-98. doi:10.7861/futurehosp.6-2-94 Yalamanchili A, Sengupta B, Song J, et al. Quality of Large Language Model Responses to Radiation Oncology Patient Care Questions. JAMA Netw Open . Apr 1 2024;7(4):e244630. doi:10.1001/jamanetworkopen.2024.4630 Bernstein IA, Zhang YV, Govil D, et al. Comparison of Ophthalmologist and Large Language Model Chatbot Responses to Online Patient Eye Care Questions. JAMA Netw Open . Aug 1 2023;6(8):e2330320. doi:10.1001/jamanetworkopen.2023.30320 Tailor PD, Dalvin LA, Chen JJ, et al. A Comparative Study of Responses to Retina Questions from Either Experts, Expert-Edited Large Language Models, or Expert-Edited Large Language Models Alone. Ophthalmol Sci . Jul-Aug 2024;4(4):100485. doi:10.1016/j.xops.2024.100485 Eckrich J, Ellinger J, Cox A, et al. Urology consultants versus large language models: Potentials and hazards for medical advice in urology. BJUI Compass . May 2024;5(5):438-444. doi:10.1002/bco2.359 Wilhelm TI, Roos J, Kaczmarczyk R. Large Language Models for Therapy Recommendations Across 3 Clinical Specialties: Comparative Study. J Med Internet Res . Oct 30 2023;25:e49324. doi:10.2196/49324 Sarraju A, Bruemmer D, Van Iterson E, Cho L, Rodriguez F, Laffin L. Appropriateness of Cardiovascular Disease Prevention Recommendations Obtained From a Popular Online Chat-Based Artificial Intelligence Model. Jama . Mar 14 2023;329(10):842-844. doi:10.1001/jama.2023.1044 Kozaily E, Geagea M, Akdogan ER, et al. Accuracy and consistency of online large language model-based artificial intelligence chat platforms in answering patients' questions about heart failure. Int J Cardiol . Aug 1 2024;408:132115. doi:10.1016/j.ijcard.2024.132115 Harskamp RE, De Clercq L. Performance of ChatGPT as an AI-assisted decision support tool in medicine: a proof-of-concept study for interpreting symptoms and management of common cardiac conditions (AMSTELHEART-2). Acta Cardiol . May 2024;79(3):358-366. doi:10.1080/00015385.2024.2303528 Additional Declarations No competing interests reported. Cite Share Download PDF Status: Under Review Version 1 posted Editorial decision: Revision requested 04 Jun, 2025 Reviews received at journal 19 May, 2025 Reviewers agreed at journal 13 May, 2025 Reviewers agreed at journal 11 May, 2025 Reviewers agreed at journal 03 Apr, 2025 Reviews received at journal 01 Apr, 2025 Reviewers agreed at journal 23 Mar, 2025 Reviewers invited by journal 23 Mar, 2025 Editor assigned by journal 23 Mar, 2025 Submission checks completed at journal 17 Mar, 2025 First submitted to journal 13 Mar, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {\"props\":{\"pageProps\":{\"initialData\":{\"identity\":\"rs-6220351\",\"acceptedTermsAndConditions\":true,\"allowDirectSubmit\":false,\"archivedVersions\":[],\"articleType\":\"Research Article\",\"associatedPublications\":[],\"authors\":[{\"id\":437002966,\"identity\":\"3ba5f897-8d56-4195-8174-a155ddae8e66\",\"order_by\":0,\"name\":\"Menglin Tian\",\"email\":\"\",\"orcid\":\"\",\"institution\":\"Yan'an Hospital Affiliated To Kunming Medical University\",\"correspondingAuthor\":false,\"prefix\":\"\",\"firstName\":\"Menglin\",\"middleName\":\"\",\"lastName\":\"Tian\",\"suffix\":\"\"},{\"id\":437002967,\"identity\":\"4ea7c192-097d-4afa-8860-c156e184c919\",\"order_by\":1,\"name\":\"Shaolong Li\",\"email\":\"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA6klEQVRIie3RMWsCMRTA8RcCcXnx1sgV+xXuKDgd+FUeCJksON6Y0hKHKq469Ss4OkaFm+Lu4HBFcNaxi9S9xVy3DvnN70/yEoAo+odEMjs7Kq/IPo7rmsoinLSVy+uzdw/c6EFWex1OukBP+eLNFS0z7HU+7abBxcDpVJoDSuN6JQkHyfid7ifcVKlcnbDzYvSe8ADK75aBU9Y2lZ5j/grVntQJMvUcSgYilZZjv2J2RNm2SaJ5vrBbZBMugKhJojy7PbJGNhdckdMY3OVxNoHbVxZ9Nk8ul69r0U3G0/vJD/i38SiKouhX354OTXBkvr0NAAAAAElFTkSuQmCC\",\"orcid\":\"\",\"institution\":\"Yan'an Hospital Affiliated To Kunming Medical University\",\"correspondingAuthor\":true,\"prefix\":\"\",\"firstName\":\"Shaolong\",\"middleName\":\"\",\"lastName\":\"Li\",\"suffix\":\"\"},{\"id\":437002968,\"identity\":\"6ef3583f-60a0-4cf7-89f1-a5aa665c52f6\",\"order_by\":2,\"name\":\"Wenyin Du\",\"email\":\"\",\"orcid\":\"\",\"institution\":\"Yan'an Hospital Affiliated To Kunming Medical University\",\"correspondingAuthor\":false,\"prefix\":\"\",\"firstName\":\"Wenyin\",\"middleName\":\"\",\"lastName\":\"Du\",\"suffix\":\"\"},{\"id\":437002969,\"identity\":\"a5f2a77e-6ed3-433e-86b0-b6d854991262\",\"order_by\":3,\"name\":\"Sen Yang\",\"email\":\"\",\"orcid\":\"\",\"institution\":\"Yan'an Hospital Affiliated To Kunming Medical University\",\"correspondingAuthor\":false,\"prefix\":\"\",\"firstName\":\"Sen\",\"middleName\":\"\",\"lastName\":\"Yang\",\"suffix\":\"\"},{\"id\":437002970,\"identity\":\"a62d12e2-e88f-4f80-9d64-5e1c10361bde\",\"order_by\":4,\"name\":\"Xiaohua Zhao\",\"email\":\"\",\"orcid\":\"\",\"institution\":\"Yan'an Hospital Affiliated To Kunming Medical University\",\"correspondingAuthor\":false,\"prefix\":\"\",\"firstName\":\"Xiaohua\",\"middleName\":\"\",\"lastName\":\"Zhao\",\"suffix\":\"\"},{\"id\":437002971,\"identity\":\"ecb22c6d-6f1d-4245-b3e8-6d576f827278\",\"order_by\":5,\"name\":\"Hao Xiong\",\"email\":\"\",\"orcid\":\"\",\"institution\":\"Yan'an Hospital Affiliated To Kunming Medical University\",\"correspondingAuthor\":false,\"prefix\":\"\",\"firstName\":\"Hao\",\"middleName\":\"\",\"lastName\":\"Xiong\",\"suffix\":\"\"},{\"id\":437002972,\"identity\":\"18c14c4f-d400-4238-aca9-4417619b68c8\",\"order_by\":6,\"name\":\"Hongxi Li\",\"email\":\"\",\"orcid\":\"\",\"institution\":\"Yan'an Hospital Affiliated To Kunming Medical University\",\"correspondingAuthor\":false,\"prefix\":\"\",\"firstName\":\"Hongxi\",\"middleName\":\"\",\"lastName\":\"Li\",\"suffix\":\"\"},{\"id\":437002973,\"identity\":\"089644d9-0c0d-4617-9db7-8ad5d3ef6fb2\",\"order_by\":7,\"name\":\"Mei Lu\",\"email\":\"\",\"orcid\":\"\",\"institution\":\"Yan'an Hospital Affiliated To Kunming Medical University\",\"correspondingAuthor\":false,\"prefix\":\"\",\"firstName\":\"Mei\",\"middleName\":\"\",\"lastName\":\"Lu\",\"suffix\":\"\"},{\"id\":437002974,\"identity\":\"99cfba12-d775-4cab-bc6f-187423534c71\",\"order_by\":8,\"name\":\"Yunyan Ying\",\"email\":\"\",\"orcid\":\"\",\"institution\":\"Yan'an Hospital Affiliated To Kunming Medical University\",\"correspondingAuthor\":false,\"prefix\":\"\",\"firstName\":\"Yunyan\",\"middleName\":\"\",\"lastName\":\"Ying\",\"suffix\":\"\"},{\"id\":437002975,\"identity\":\"0eeefeeb-f3b4-4d2e-8142-ce1bdc78f715\",\"order_by\":9,\"name\":\"Jilei Zhang\",\"email\":\"\",\"orcid\":\"\",\"institution\":\"Yan'an Hospital Affiliated To Kunming Medical University\",\"correspondingAuthor\":false,\"prefix\":\"\",\"firstName\":\"Jilei\",\"middleName\":\"\",\"lastName\":\"Zhang\",\"suffix\":\"\"},{\"id\":437002977,\"identity\":\"adc5388c-42be-4553-b91d-97f9a2faab3e\",\"order_by\":10,\"name\":\"Qiwei Liao\",\"email\":\"\",\"orcid\":\"\",\"institution\":\"Yan'an Hospital Affiliated To Kunming Medical University\",\"correspondingAuthor\":false,\"prefix\":\"\",\"firstName\":\"Qiwei\",\"middleName\":\"\",\"lastName\":\"Liao\",\"suffix\":\"\"},{\"id\":437002978,\"identity\":\"4d22bfd7-751a-45ec-bbf6-e3e7b070cc99\",\"order_by\":11,\"name\":\"Dong Yang\",\"email\":\"\",\"orcid\":\"\",\"institution\":\"Yan'an Hospital Affiliated To Kunming Medical University\",\"correspondingAuthor\":false,\"prefix\":\"\",\"firstName\":\"Dong\",\"middleName\":\"\",\"lastName\":\"Yang\",\"suffix\":\"\"},{\"id\":437002979,\"identity\":\"cfb297ee-d050-45af-a7f1-4cea8d3f7c1f\",\"order_by\":12,\"name\":\"Fuding Guo\",\"email\":\"\",\"orcid\":\"\",\"institution\":\"Yan'an Hospital Affiliated To Kunming Medical University\",\"correspondingAuthor\":false,\"prefix\":\"\",\"firstName\":\"Fuding\",\"middleName\":\"\",\"lastName\":\"Guo\",\"suffix\":\"\"}],\"badges\":[],\"createdAt\":\"2025-03-13 13:08:26\",\"currentVersionCode\":1,\"declarations\":\"\",\"doi\":\"10.21203/rs.3.rs-6220351/v1\",\"doiUrl\":\"https://doi.org/10.21203/rs.3.rs-6220351/v1\",\"draftVersion\":[],\"editorialEvents\":[],\"editorialNote\":\"\",\"failedWorkflow\":false,\"files\":[{\"id\":79825644,\"identity\":\"524ef810-8a4d-4e62-b64f-f97b52959120\",\"added_by\":\"auto\",\"created_at\":\"2025-04-03 09:26:15\",\"extension\":\"png\",\"order_by\":1,\"title\":\"Figure 1\",\"display\":\"\",\"copyAsset\":false,\"role\":\"figure\",\"size\":8599656,\"visible\":true,\"origin\":\"\",\"legend\":\"\\u003cp\\u003e\\u003cstrong\\u003eFlowchart of Study Design.\\u003c/strong\\u003e\\u003c/p\\u003e\",\"description\":\"\",\"filename\":\"1.png\",\"url\":\"https://assets-eu.researchsquare.com/files/rs-6220351/v1/cfaf7a41cbb4d51dadbe2a62.png\"},{\"id\":79824279,\"identity\":\"b650253f-b911-4dfe-aba4-6b549463dce9\",\"added_by\":\"auto\",\"created_at\":\"2025-04-03 09:18:15\",\"extension\":\"png\",\"order_by\":2,\"title\":\"Figure 2\",\"display\":\"\",\"copyAsset\":false,\"role\":\"figure\",\"size\":1290954,\"visible\":true,\"origin\":\"\",\"legend\":\"\\u003cp\\u003e\\u003cstrong\\u003eComparison of ChatGPT-4 and Kimi in Preliminary Diagnosis Accuracy and Completeness. \\u003c/strong\\u003eThe left panel illustrates the performance of both models in terms of preliminary diagnosis accuracy, while the right panel presents their performance in preliminary diagnosis completeness.\\u003c/p\\u003e\",\"description\":\"\",\"filename\":\"2.png\",\"url\":\"https://assets-eu.researchsquare.com/files/rs-6220351/v1/85477ea51dfabded0c97ff2e.png\"},{\"id\":79825643,\"identity\":\"1bb57d0a-64a4-477e-ad8e-9a26e0b139d9\",\"added_by\":\"auto\",\"created_at\":\"2025-04-03 09:26:15\",\"extension\":\"png\",\"order_by\":3,\"title\":\"Figure 3\",\"display\":\"\",\"copyAsset\":false,\"role\":\"figure\",\"size\":1312518,\"visible\":true,\"origin\":\"\",\"legend\":\"\\u003cp\\u003e\\u003cstrong\\u003eComparison of ChatGPT-4 and Kimi in Treatment Selection Accuracy and Completeness. \\u003c/strong\\u003eThe left panel illustrates the performance of both models in terms of treatment selection accuracy, while the right panel presents their performance in treatment selection completeness.\\u003c/p\\u003e\",\"description\":\"\",\"filename\":\"3.png\",\"url\":\"https://assets-eu.researchsquare.com/files/rs-6220351/v1/e66135780692ec372769e5e6.png\"},{\"id\":79824281,\"identity\":\"046ab954-33cd-487a-ae85-5372b28e4b5e\",\"added_by\":\"auto\",\"created_at\":\"2025-04-03 09:18:15\",\"extension\":\"png\",\"order_by\":4,\"title\":\"Figure 4\",\"display\":\"\",\"copyAsset\":false,\"role\":\"figure\",\"size\":1332060,\"visible\":true,\"origin\":\"\",\"legend\":\"\\u003cp\\u003e\\u003cstrong\\u003eComparison of ChatGPT-4 and Kimi in Long-term Management Accuracy and Completeness. \\u003c/strong\\u003eThe left panel illustrates the performance of both models in terms of long-term management accuracy, while the right panel presents their performance in long-term management completeness.\\u003c/p\\u003e\",\"description\":\"\",\"filename\":\"4.png\",\"url\":\"https://assets-eu.researchsquare.com/files/rs-6220351/v1/32a27634b4c98e85a2236465.png\"},{\"id\":79823463,\"identity\":\"7a5fe66b-427b-4bae-acfd-a5aacdfec7b7\",\"added_by\":\"auto\",\"created_at\":\"2025-04-03 09:10:15\",\"extension\":\"png\",\"order_by\":5,\"title\":\"Figure 5\",\"display\":\"\",\"copyAsset\":false,\"role\":\"figure\",\"size\":1919656,\"visible\":true,\"origin\":\"\",\"legend\":\"\\u003cp\\u003e\\u003cstrong\\u003eRisk Matrix of Potential Treatment Risks in LLMs. \\u003c/strong\\u003eAmong the 200 treatment plans generated by LLMs for complex cardiovascular disease cases, GPT-4.0 identified 13 cases (6.5%) with potential harm, with only 3 cases (1.5%) classified as high risk. In contrast, KiMi identified 29 cases (14.5%) with potential harm, among which 9 cases (4.5%) were categorized as high risk. \\u003cstrong\\u003eGreen squares:\\u003c/strong\\u003e Low risk; \\u003cstrong\\u003eYellow squares: \\u003c/strong\\u003eModerate risk; \\u003cstrong\\u003eOrange squares: \\u003c/strong\\u003eHigh risk; \\u003cstrong\\u003eRed squares:\\u003c/strong\\u003e extremely high risk.\\u003c/p\\u003e\",\"description\":\"\",\"filename\":\"5.png\",\"url\":\"https://assets-eu.researchsquare.com/files/rs-6220351/v1/8e4d93d491ebb9b561c78810.png\"},{\"id\":79823467,\"identity\":\"2a25a58b-a121-4e7b-93d0-cd0769de4feb\",\"added_by\":\"auto\",\"created_at\":\"2025-04-03 09:10:15\",\"extension\":\"png\",\"order_by\":6,\"title\":\"Figure 6\",\"display\":\"\",\"copyAsset\":false,\"role\":\"figure\",\"size\":917924,\"visible\":true,\"origin\":\"\",\"legend\":\"\\u003cp\\u003e\\u003cstrong\\u003ePercentage Distribution of Likert Scores Across Different Evaluation Dimensions. \\u003c/strong\\u003eThis figure presents the percentage distribution of Likert scores across various evaluation dimensions, comparing the accuracy and completeness of treatment plans provided by ChatGPT-4 and Kimi. The x-axis represents the percentage of scores (%), while the y-axis denotes the evaluation dimensions (\\u003cstrong\\u003eA–L\\u003c/strong\\u003e), including \\u003cstrong\\u003eA:\\u003c/strong\\u003e Accuracy of Preliminary Diagnosis (ChatGPT-4), \\u003cstrong\\u003eB:\\u003c/strong\\u003eAccuracy of Preliminary Diagnosis (Kimi), \\u003cstrong\\u003eC: \\u003c/strong\\u003eCompleteness of Preliminary Diagnosis (ChatGPT-4),\\u003cstrong\\u003e D: \\u003c/strong\\u003eCompleteness of Preliminary Diagnosis (Kimi), \\u003cstrong\\u003eE:\\u003c/strong\\u003eAccuracy of Treatment Selection (ChatGPT-4), \\u003cstrong\\u003eF: \\u003c/strong\\u003eAccuracy of Treatment Selection (Kimi), \\u003cstrong\\u003eG:\\u003c/strong\\u003e Completeness of Treatment Selection (ChatGPT-4), \\u003cstrong\\u003eH: \\u003c/strong\\u003eCompleteness of Treatment Selection (Kimi), \\u003cstrong\\u003eI: \\u003c/strong\\u003eAccuracy of Long-term Management (ChatGPT-4),\\u003cstrong\\u003e J: \\u003c/strong\\u003eAccuracy of Long-term Management (Kimi), K: Completeness of Long-term Management (ChatGPT-4), and \\u003cstrong\\u003eL: \\u003c/strong\\u003eCompleteness of Long-term Management (Kimi).\\u003c/p\\u003e\",\"description\":\"\",\"filename\":\"6.png\",\"url\":\"https://assets-eu.researchsquare.com/files/rs-6220351/v1/22494d1d6293b688dfa7ab8f.png\"},{\"id\":79825674,\"identity\":\"02a19b9f-fb5d-468f-85dc-024aa7de545e\",\"added_by\":\"auto\",\"created_at\":\"2025-04-03 09:26:34\",\"extension\":\"pdf\",\"order_by\":0,\"title\":\"\",\"display\":\"\",\"copyAsset\":false,\"role\":\"manuscript-pdf\",\"size\":15245749,\"visible\":true,\"origin\":\"\",\"legend\":\"\",\"description\":\"\",\"filename\":\"manuscript.pdf\",\"url\":\"https://assets-eu.researchsquare.com/files/rs-6220351/v1/76be7582-7418-437d-9f27-34e10ac920ba.pdf\"}],\"financialInterests\":\"No competing interests reported.\",\"formattedTitle\":\"Novel Insights into the Application of Large Language Models in the Diagnosis and Treatment of Complex Cardiovascular Diseases: A Comparative Study\",\"fulltext\":[{\"header\":\"Background\",\"content\":\"\\u003cp\\u003eArtificial intelligence has been widely applied in the medical field. Traditional machine learning and deep learning architectures primarily rely on convolutional neural networks [\\u003cspan citationid=\\\"CR1\\\" class=\\\"CitationRef\\\"\\u003e1\\u003c/span\\u003e] and recurrent neural networks [\\u003cspan citationid=\\\"CR2\\\" class=\\\"CitationRef\\\"\\u003e2\\u003c/span\\u003e]. However, these models typically require large amounts of high-quality labeled data, making data annotation costly [\\u003cspan citationid=\\\"CR3\\\" class=\\\"CitationRef\\\"\\u003e3\\u003c/span\\u003e]. Compared to conventional machine learning and deep learning approaches, LLMs offer significant advantages. Based on the Transformer neural network architecture, LLMs utilize self-attention mechanisms to effectively capture relationships among different elements within input data, enabling their widespread application in natural language processing tasks [\\u003cspan citationid=\\\"CR4\\\" class=\\\"CitationRef\\\"\\u003e4\\u003c/span\\u003e]. Through a two-stage training process\\u0026mdash;pretraining on vast amounts of unlabeled data followed by fine-tuning for specific tasks\\u0026mdash;LLMs demonstrate immense potential across various fields, particularly in medicine [\\u003cspan citationid=\\\"CR4\\\" class=\\\"CitationRef\\\"\\u003e4\\u003c/span\\u003e].\\u003c/p\\u003e \\u003cp\\u003eBy streamlining workflows, reducing costs, and improving health literacy [\\u003cspan citationid=\\\"CR5\\\" class=\\\"CitationRef\\\"\\u003e5\\u003c/span\\u003e], LLMs have the potential to significantly enhance efficiency and quality in medical practice. They can automate medical task management, improve patient outcomes, and even assist in clinical diagnosis and personalized treatment planning [\\u003cspan citationid=\\\"CR6\\\" class=\\\"CitationRef\\\"\\u003e6\\u003c/span\\u003e]. However, despite their promising applications in healthcare, concerns remain regarding the accuracy and safety of LLMs in real-world clinical settings, particularly in medical decision-making. Accuracy and safety assessments of LLMs have been conducted in various medical fields, including radiation oncology [\\u003cspan citationid=\\\"CR7\\\" class=\\\"CitationRef\\\"\\u003e7\\u003c/span\\u003e], ophthalmology [\\u003cspan citationid=\\\"CR8\\\" class=\\\"CitationRef\\\"\\u003e8\\u003c/span\\u003e, \\u003cspan citationid=\\\"CR9\\\" class=\\\"CitationRef\\\"\\u003e9\\u003c/span\\u003e], urology [\\u003cspan citationid=\\\"CR10\\\" class=\\\"CitationRef\\\"\\u003e10\\u003c/span\\u003e], orthopedics, and dermatology [\\u003cspan citationid=\\\"CR11\\\" class=\\\"CitationRef\\\"\\u003e11\\u003c/span\\u003e]. In the field of cardiology, no prior studies have systematically evaluated the safety of LLM-generated treatment recommendations, and only limited research has assessed their diagnostic accuracy [\\u003cspan additionalcitationids=\\\"CR13\\\" citationid=\\\"CR12\\\" class=\\\"CitationRef\\\"\\u003e12\\u003c/span\\u003e\\u0026ndash;\\u003cspan citationid=\\\"CR14\\\" class=\\\"CitationRef\\\"\\u003e14\\u003c/span\\u003e]. Although existing studies suggest that LLMs could be applicable in cardiology, their role in the diagnosis and management of complex cardiovascular diseases remains uncertain, particularly in practical clinical scenarios. Cardiology represents a critical area for LLM applications, as their integration into medical practice may substantially enhance the ability of primary care physicians to diagnose and manage complex cardiovascular conditions.\\u003c/p\\u003e \\u003cp\\u003eTo evaluate the diagnostic and therapeutic capabilities of LLMs for complex cardiovascular diseases, this study focuses on two leading models: GPT-4.0, developed by OpenAI, and Kimi, developed by Moonshot AI. By assessing the accuracy, comprehensiveness, and safety of their treatment recommendations, we aim to determine whether these LLMs can generate treatment plans comparable to those provided in case reports from the \\u003cem\\u003eJournal of the American College of Cardiology (JACC)\\u003c/em\\u003e database. This study further seeks to provide reliable evidence to support the future integration of LLMs into clinical practice, particularly for assisting primary care physicians in the management of complex cardiac conditions.\\u003c/p\\u003e\"},{\"header\":\"Methods\",\"content\":\"\\u003cdiv id=\\\"Sec3\\\" class=\\\"Section2\\\"\\u003e\\n \\u003ch2\\u003eStudy Design\\u003c/h2\\u003e\\n \\u003cp\\u003eWe conducted a comparative study (Fig. \\u003cspan class=\\\"InternalRef\\\"\\u003e1\\u003c/span\\u003e) to evaluate the diagnostic and therapeutic capabilities, as well as the safety of ChatGPT-4.0 and Kimi in the management of cardiovascular diseases, using clinical information from the \\u003cem\\u003eJournal of the American College of Cardiology (JACC)\\u003c/em\\u003e case report database. GPT-4.0, developed in the United States, is recognized for its high-performance capabilities, while Kimi, developed in China, represents a leading domestic counterpart. The inclusion of these two models allows for a comprehensive assessment of AI-driven cardiovascular disease management across different technological frameworks.\\u003c/p\\u003e\\n \\u003cp\\u003eThe \\u003cem\\u003eJACC\\u003c/em\\u003e case report database is a comprehensive and authoritative resource that provides access to both current and archived publications in cardiology. It serves as a valuable repository for researchers, clinicians, and healthcare professionals (\\u003cem\\u003eJACC: Case Reports\\u003c/em\\u003e | \\u003cem\\u003eJournal\\u003c/em\\u003e | \\u003cem\\u003eScienceDirect.com by Elsevier\\u003c/em\\u003e). This study utilized literature from the \\u003cem\\u003eJACC\\u003c/em\\u003e database, leveraging its extensive repository to support our analysis and ensure a robust evaluation of AI-assisted cardiovascular diagnosis and treatment.\\u003c/p\\u003e\\n\\u003c/div\\u003e\\n\\u003ch3\\u003eStudy Data\\u003c/h3\\u003e\\n\\u003cp\\u003eA total of 200 cardiac case reports were retrieved from the \\u003cem\\u003eJournal of the American College of Cardiology (JACC)\\u003c/em\\u003e case report database between January 2020 and August 2024. These cases included pediatric, pregnant, elderly, and critically ill patients. Case records were extracted from the literature and organized using a standardized extraction form designed in Microsoft Word (Microsoft Corp.) for manual data collection. Additionally, a standardized analysis form was developed to systematically document the outputs generated by GPT-4.0 and Kimi. This included the extracted case records, the diagnostic and therapeutic plans generated by GPT-4.0 and Kimi across three key aspects (preliminary diagnosis, treatment selection, and long-term management), as well as the detailed justifications provided by each model. The collected data encompassed patient demographics (age and sex), medical history, clinical symptoms, physical examination findings, laboratory tests, and imaging studies. The dataset incorporated video, image, and textual data formats, with all patient identities anonymized for analysis.\\u003c/p\\u003e\\n\\u003ch3\\u003eEthical Considerations\\u003c/h3\\u003e\\n\\u003cp\\u003eEthical approval and informed consent were not required for this study, as it exclusively utilized publicly available data. All study procedures adhered strictly to the guidelines outlined in the Digital Health Implementation Reporting Checklist. No AI-generated content was used in the study design or methodology.\\u003c/p\\u003e\\n\\u003ch3\\u003ePrompt Design and Input Strategy\\u003c/h3\\u003e\\n\\u003cp\\u003eOur objective was to assess the ability of GPT-4.0 and Kimi to generate appropriate diagnostic and treatment plans based on case information from the literature. To achieve this, we designed a standardized prompt:\\u003c/p\\u003e\\n\\u003cp\\u003e\\u003cem\\u003e\\u0026quot;Assume you are a cardiovascular expert with 20 years of experience. Based on the patient history provided, generate a treatment plan for this patient, including preliminary diagnosis, treatment selection, and long-term management. Provide detailed justifications based on the case history.\\u0026quot;\\u003c/em\\u003e\\u003c/p\\u003e\\n\\u003cp\\u003eThis prompt underwent preliminary testing for effectiveness. We randomly selected 10 cases from the dataset for an initial trial, where GPT-4.0 and Kimi were queried three times each using the designed prompt. If both models successfully provided responses for all three requested diagnostic and treatment aspects without refusal, the case was considered valid, and the answers were recorded in the standardized dataset. These 10 cases were excluded from further queries in the main study. During the initial tests, both GPT-4.0 and Kimi successfully generated the required responses, leading to the finalization of this prompt for the main study.\\u003c/p\\u003e\\n\\u003cp\\u003eThe final dataset was generated between June 1, 2024, and August 1, 2024, using the same computing device for consistency. All responses were verified and cross-checked. Given that GPT-4.0 and Kimi may produce varied responses when the same question is asked multiple times, each case was queried three times for both models. In each instance, GPT-4.0 was prompted three times first, followed by three queries to Kimi, with no feedback provided between responses. To eliminate potential biases related to contextual learning, each query session was conducted in a new chat environment.\\u003c/p\\u003e\\n\\u003cp\\u003eThe final responses (third query) from both GPT-4.0 and Kimi were independently evaluated by three senior cardiologists, who were not involved in the study, using a Likert scale for accuracy and comprehensiveness assessment.\\u003c/p\\u003e\\n\\u003ch3\\u003eData Measurement\\u003c/h3\\u003e\\n\\u003cp\\u003eThe diagnostic and therapeutic plans generated by GPT-4.0 and Kimi were evaluated for accuracy, comprehensiveness, and safety, based on published case reports from the \\u003cem\\u003eJournal of the American College of Cardiology (JACC)\\u003c/em\\u003e. Accuracy was assessed using a 5-point Likert scale based on alignment with primary and secondary outlines: 1 point for plans that did not align with the primary treatment outlines and met less than half of the secondary outlines; 2 points for those not aligning with the primary outlines but meeting more than half of the secondary outlines; 3 points for those aligning with the primary outlines but meeting less than half of the secondary outlines; 4 points for those aligning with the primary outlines and meeting more than half but not all of the secondary outlines; and 5 points for full alignment with both the primary and all secondary outlines. Comprehensiveness was similarly evaluated, with scores assigned based on the inclusion of primary and secondary supporting evidence.\\u003c/p\\u003e\\n\\u003cp\\u003eSafety was assessed through a two-step process. First, each treatment plan generated by the LLM was compared with the treatment in the case report; if they matched, the plan was considered correct. If they differed, a risk matrix analysis was conducted, calculating a risk value (R) as R\\u0026thinsp;=\\u0026thinsp;L \\u0026times; S, where L represents the probability of an adverse event and S represents the severity of consequences. L was rated from 1 to 5, with 1 point for a 0\\u0026ndash;19% probability of an adverse event, 2 points for 20\\u0026ndash;39%, 3 points for 40\\u0026ndash;69%, 4 points for 70\\u0026ndash;94%, and 5 points for 95\\u0026ndash;100%. S was also rated from 1 to 5, with 1 point for no harm, 2 points for minor patient injury, 3 points for moderate disability or organ/tissue damage leading to mild functional impairment (e.g., partial organ loss or deformity with significant but manageable impairment), 4 points for severe disability or major organ dysfunction (e.g., persistent heart failure NYHA Class IV or severe uncontrollable arrhythmia), and 5 points for death or profound disability (e.g., vegetative state, extreme cognitive impairment, or irreversible respiratory failure requiring ventilator dependence).\\u003c/p\\u003e\\n\\u003cp\\u003eThe diagnostic and treatment plans generated by GPT-4.0 and Kimi were categorized into three types: preliminary diagnosis, treatment selection, and long-term management. Three independent senior cardiologists from the Department of Cardiology, Yan\\u0026rsquo;an Hospital Affiliated to Kunming Medical University, who were not involved in the study design, evaluated the AI-generated responses for accuracy, comprehensiveness, and safety using standardized data recording forms. To ensure consistency and methodological rigor, these cardiologists participated in hospital-led discussions to resolve scoring discrepancies and reach a consensus on the assessment criteria, thereby ensuring reliability in evaluating the AI-generated treatment plans.\\u003c/p\\u003e\\n\\u003cdiv id=\\\"Sec8\\\" class=\\\"Section2\\\"\\u003e\\n \\u003ch2\\u003eStatistical Analysis\\u003c/h2\\u003e\\n \\u003cp\\u003eAll continuous variables were expressed as mean\\u0026thinsp;\\u0026plusmn;\\u0026thinsp;standard deviation (Mean\\u0026thinsp;\\u0026plusmn;\\u0026thinsp;SD) or median (interquartile range, IQR), depending on the normality of data distribution. Normality was assessed using the Kolmogorov-Smirnov test or the Shapiro-Wilk test. For normally distributed continuous variables, an independent samples t-test was used to compare differences between the two groups. For non-normally distributed continuous variables, non-parametric tests such as the Mann-Whitney U test or the Kruskal-Wallis test were applied. Categorical variables were presented as frequencies and percentages (n, %). All statistical analyses were conducted using SPSS Statistics software (IBM SPSS Statistics, version 29.0), with a p-value\\u0026thinsp;\\u0026lt;\\u0026thinsp;0.05 considered statistically significant.\\u003c/p\\u003e\\n\\u003c/div\\u003e\"},{\"header\":\"Results\",\"content\":\"\\u003cdiv id=\\\"Sec10\\\" class=\\\"Section2\\\"\\u003e \\u003ch2\\u003ePerformance in Preliminary Diagnosis\\u003c/h2\\u003e \\u003cp\\u003eTo evaluate the accuracy, comprehensiveness, and safety of LLMs in the field of cardiology, we selected the latest version of GPT-4.0 for comparison. In terms of relative diagnostic accuracy, there was no statistically significant difference between GPT-4.0 and Kimi (P\\u0026thinsp;=\\u0026thinsp;0.66, Fig.\\u0026nbsp;\\u003cspan refid=\\\"Fig2\\\" class=\\\"InternalRef\\\"\\u003e2\\u003c/span\\u003e). However, relative diagnostic comprehensiveness showed a significant difference between the two models (P\\u0026thinsp;\\u0026lt;\\u0026thinsp;0.001). As illustrated in Fig.\\u0026nbsp;\\u003cspan refid=\\\"Fig6\\\" class=\\\"InternalRef\\\"\\u003e6\\u003c/span\\u003e, the analysis of 200 case reports demonstrated that the primary diagnostic accuracy was 96.0% (192/200 cases) for GPT-4.0 and 93.5% (187/200 cases) for Kimi. In terms of diagnostic comprehensiveness, defined as alignment with supporting evidence, GPT-4.0 achieved a 96.5% (193/200 cases) compliance rate, whereas Kimi achieved 91.0% (182/200 cases), highlighting a notable difference in the completeness of the diagnostic rationale provided by the two models.\\u003c/p\\u003e \\u003cp\\u003e \\u003c/p\\u003e \\u003c/div\\u003e \\u003cdiv id=\\\"Sec11\\\" class=\\\"Section2\\\"\\u003e \\u003ch2\\u003ePerformance in Treatment Selection\\u003c/h2\\u003e \\u003cp\\u003eAs shown in Fig.\\u0026nbsp;\\u003cspan refid=\\\"Fig3\\\" class=\\\"InternalRef\\\"\\u003e3\\u003c/span\\u003e, there was a significant difference between GPT-4.0 and Kimi in terms of treatment selection accuracy (P\\u0026thinsp;\\u0026lt;\\u0026thinsp;0.05) and comprehensiveness (P\\u0026thinsp;\\u0026lt;\\u0026thinsp;0.001). As illustrated in Fig.\\u0026nbsp;\\u003cspan refid=\\\"Fig6\\\" class=\\\"InternalRef\\\"\\u003e6\\u003c/span\\u003e, the compliance rate with primary treatment outlines was 97.0% (194/200 cases) for GPT-4.0 and 94.0% (188/200 cases) for Kimi. In terms of completeness of treatment selection, defined as adherence to supporting evidence, GPT-4.0 achieved a 98.0% (196/200 cases) compliance rate, whereas Kimi achieved 91.5% (183/200 cases), demonstrating a significantly higher comprehensiveness in GPT-4.0\\u0026rsquo;s treatment recommendations.\\u003c/p\\u003e \\u003cp\\u003e \\u003c/p\\u003e \\u003c/div\\u003e \\u003cdiv id=\\\"Sec12\\\" class=\\\"Section2\\\"\\u003e \\u003ch2\\u003ePerformance in Long-Term Management\\u003c/h2\\u003e \\u003cp\\u003eAs shown in Fig.\\u0026nbsp;\\u003cspan refid=\\\"Fig4\\\" class=\\\"InternalRef\\\"\\u003e4\\u003c/span\\u003e, GPT-4.0 and Kimi exhibited significant differences in both accuracy and comprehensiveness of long-term management strategies (P\\u0026thinsp;\\u0026lt;\\u0026thinsp;0.001). As illustrated in Fig.\\u0026nbsp;\\u003cspan refid=\\\"Fig6\\\" class=\\\"InternalRef\\\"\\u003e6\\u003c/span\\u003e, the compliance rate with primary management outlines was 95.5% (191/200 cases) for GPT-4.0 and 92.0% (184/200 cases) for Kimi. In terms of management comprehensiveness, defined as adherence to supporting evidence, GPT-4.0 achieved a 95.5% (191/200 cases) compliance rate, whereas Kimi achieved 88.0% (176/200 cases), further highlighting GPT-4.0\\u0026rsquo;s superior performance in providing comprehensive long-term management recommendations.\\u003c/p\\u003e \\u003cp\\u003e \\u003c/p\\u003e \\u003c/div\\u003e \\u003cdiv id=\\\"Sec13\\\" class=\\\"Section2\\\"\\u003e \\u003ch2\\u003eSafety Assessment of AI-Generated Treatment Plans\\u003c/h2\\u003e \\u003cp\\u003eThe safety evaluation of treatment plans generated by GPT-4.0 demonstrated a 93.5% (187/200 cases) compliance rate with no potential harm, whereas Kimi exhibited a lower safety performance, with 85.5% (171/200 cases) of treatment plans posing no potential harm. As illustrated in Fig.\\u0026nbsp;\\u003cspan refid=\\\"Fig5\\\" class=\\\"InternalRef\\\"\\u003e5\\u003c/span\\u003e, the overall risk of potential harm associated with LLM-generated treatment plans in cardiology was relatively low. Among the 200 clinical cases analyzed, GPT-4.0 produced 13 cases (6.5%) with potential risks, of which only 3 cases (1.5%) were classified as high risk. In contrast, Kimi generated 29 cases (14.5%) with potential risks, with 9 cases (4.5%) identified as high risk, highlighting GPT-4.0\\u0026rsquo;s superior safety profile in generating cardiovascular treatment recommendations.\\u003c/p\\u003e \\u003cp\\u003e \\u003c/p\\u003e \\u003c/div\\u003e\"},{\"header\":\"Discussion\",\"content\":\"\\u003cdiv id=\\\"Sec15\\\" class=\\\"Section2\\\"\\u003e \\u003ch2\\u003eKey Findings\\u003c/h2\\u003e \\u003cp\\u003eThis study represents the first systematic evaluation of the accuracy, comprehensiveness, and safety of LLMs in the diagnosis and treatment of complex cardiovascular diseases, providing a quantitative performance comparison between GPT-4.0 and Kimi. Three major findings emerged from our analysis. First, LLMs demonstrated high accuracy and comprehensiveness in the management of complex cardiovascular diseases. Second, this study introduced a quantitative safety assessment using a risk matrix scoring system, offering a more precise safety evaluation compared to the traditional Likert scale commonly used in previous LLM safety studies. Our results indicate that only 1.5% of GPT-4.0-generated treatment plans were classified as high risk, significantly lower than Kimi\\u0026rsquo;s 4.5%, suggesting a potential safety advantage of GPT-4.0 in clinical decision support. Third, LLMs could serve as valuable diagnostic and treatment aids, particularly in primary care settings, where they may enhance efficiency and diagnostic accuracy. However, their capability for personalized treatment remains limited, especially in complex decision-making scenarios such as surgical timing and procedural selection, where human expertise remains essential.\\u003c/p\\u003e \\u003c/div\\u003e \\u003cdiv id=\\\"Sec16\\\" class=\\\"Section2\\\"\\u003e \\u003ch2\\u003eEvaluation of Accuracy and Comprehensiveness\\u003c/h2\\u003e \\u003cp\\u003eThe application of LLMs in medicine has gained increasing attention in recent years. Sarraju et al. [\\u003cspan citationid=\\\"CR12\\\" class=\\\"CitationRef\\\"\\u003e12\\u003c/span\\u003e] evaluated the cardiovascular prevention-related responses of GPT-3.5, reporting that 84% (21/25) of responses were deemed appropriate by cardiovascular prevention specialists. This finding highlights GPT-3.5\\u0026rsquo;s potential value in providing cardiovascular disease prevention information and its role in optimizing patient care and risk assessment. Further, Kozily et al. [\\u003cspan citationid=\\\"CR13\\\" class=\\\"CitationRef\\\"\\u003e13\\u003c/span\\u003e] found that GPT-3.5 provided clear and accurate responses in heart failure (HF) diagnosis, management, and prognosis, with an accuracy of 90% and a consistency rate of 93%. However, Harskamp et al. [\\u003cspan citationid=\\\"CR14\\\" class=\\\"CitationRef\\\"\\u003e14\\u003c/span\\u003e] noted that when handling more complex clinical scenarios, only 50% of GPT-3.5's responses aligned with expert opinions, exposing its limitations in advanced medical reasoning and complex decision-making.\\u003c/p\\u003e \\u003cp\\u003eGiven the rapid evolution of LLMs and their increasingly complex applications, our study selected GPT-4.0, the latest model, and compared it with Kimi, a leading alternative LLM. To simulate real-world clinical scenarios, we utilized complex case reports from the \\u003cem\\u003eJournal of the American College of Cardiology (JACC)\\u003c/em\\u003e database to evaluate their diagnostic and therapeutic capabilities. As illustrated in Fig.\\u0026nbsp;\\u003cspan refid=\\\"Fig6\\\" class=\\\"InternalRef\\\"\\u003e6\\u003c/span\\u003e, our findings demonstrate that while GPT-4.0 outperformed Kimi in accuracy, both models exhibited high diagnostic accuracy across preliminary diagnosis (96% vs. 93.5%), treatment selection (97% vs. 94%), and long-term management (95.5% vs. 92%). Moreover, comprehensiveness analysis revealed strong performance in preliminary diagnosis (96.5% vs. 91%), treatment selection (98% vs. 91.5%), and long-term management (95.5% vs. 88%). These results further validate the clinical applicability of LLMs in complex cardiovascular disease management, suggesting that as models advance and clinical training data improve, their ability to handle complex medical scenarios will continue to evolve.\\u003c/p\\u003e \\u003cp\\u003eThe widespread integration of LLMs in future complex clinical practice holds significant potential to support residents, medical trainees, and primary care physicians. For residents and trainees, LLMs could assist in managing complex cases, enabling more evidence-based clinical decision-making. For primary care physicians, particularly in resource-limited settings, LLMs could serve as decision-support tools, enhancing diagnostic accuracy, facilitating evidence-based treatment strategies, and promoting preventive healthcare measures, ultimately improving overall healthcare quality and service efficiency. Given their capacity to optimize clinical decision-making and medical service delivery, LLMs represent a promising advancement in digital health innovation. Future research should further explore their role in enhancing medical training and primary care diagnostics, with the goal of improving basic healthcare standards and patient outcomes.\\u003c/p\\u003e \\u003cp\\u003e \\u003c/p\\u003e \\u003c/div\\u003e \\u003cdiv id=\\\"Sec17\\\" class=\\\"Section2\\\"\\u003e \\u003ch2\\u003eSafety Assessment\\u003c/h2\\u003e \\u003cp\\u003eGiven the substantial potential of LLMs in cardiology, their safety in real-world clinical applications remains a critical concern. Notably, no prior studies have systematically evaluated the safety of LLM-generated treatment plans for complex cardiovascular diseases. To our knowledge, this study is the first to introduce specific quantifiable metrics for assessing LLM safety in complex cardiovascular cases, providing a novel framework for evaluating their clinical applicability. Using GPT-4.0 and Kimi, we generated treatment recommendations based on curated case reports from the \\u003cem\\u003eJournal of the American College of Cardiology (JACC) database\\u003c/em\\u003e. After rigorous data analysis, we found that both models demonstrated satisfactory safety outcomes, with 93.5% (GPT-4.0) vs. 85.5% (Kimi) of treatment plans classified as having no potential harm.\\u003c/p\\u003e \\u003cp\\u003ePrevious studies assessing the safety of LLMs in clinical applications have largely relied on subjective scoring methods. For instance, Yalamanchili et al. [\\u003cspan citationid=\\\"CR7\\\" class=\\\"CitationRef\\\"\\u003e7\\u003c/span\\u003e] conducted a cross-sectional study evaluating LLM-generated responses in radiation oncology, reporting that only 2 of 115 responses (1.74%) carried potential harm. Their study employed a 5-point Likert scale (ranging from 0: \\u0026ldquo;Not at all\\u0026rdquo; to 4: \\u0026ldquo;Extremely\\u0026rdquo;) to assess risk, but this approach lacks precision and may obscure the true extent of potential clinical harm. To address this limitation, our study introduced a risk matrix scoring system, quantitatively defining Likert scale categories as follows: 1: No harm, 2: Minor harm, 3: Mild disability or organ dysfunction, 4: Moderate disability with severe functional impairment, and 5: Death or profound disability. As illustrated in Fig.\\u0026nbsp;\\u003cspan refid=\\\"Fig5\\\" class=\\\"InternalRef\\\"\\u003e5\\u003c/span\\u003e, our risk matrix analysis confirms that LLMs pose a low risk of potential harm in complex cardiovascular disease management.\\u003c/p\\u003e \\u003cp\\u003eHowever, our findings also highlight important limitations in LLM-generated treatment recommendations. Notably, LLMs struggled to provide precise surgical recommendations, including optimal timing, procedural selection, and intraoperative decision-making. Despite their low risk of harm, LLMs should not replace human clinical judgment, as they lack the experience and intuition of medical professionals. Therefore, LLM-generated treatment plans should always be reviewed by clinicians before being integrated into clinical decision-making processes, particularly in high-stakes medical scenarios where patient safety is paramount.\\u003c/p\\u003e \\u003c/div\\u003e \\u003cdiv id=\\\"Sec18\\\" class=\\\"Section2\\\"\\u003e \\u003ch2\\u003eLimitations\\u003c/h2\\u003e \\u003cp\\u003eThis study has several limitations. First, it only compared GPT-4.0 and Kimi, excluding other widely used LLMs. Second, the dataset was sourced exclusively from the JACC case report database, which, although authoritative, may not fully represent diverse populations, geographic regions, or healthcare systems. Third, while we employed structured scoring methodologies such as the Likert scale and risk matrix analysis, the final evaluation still relied on human judgment, introducing a degree of subjectivity. Additionally, LLMs currently lack the ability to personalize treatment recommendations based on individual patient characteristics, and they struggle with high-level medical reasoning in complex decision-making, such as surgical planning and intervention timing. Moreover, while risk matrix scoring offers a more precise assessment of safety, the long-term clinical safety of LLM-generated treatment plans remains uncertain and requires further investigation. Furthermore, this study did not explore ethical and legal considerations, such as medical liability, which may pose significant challenges to real-world LLM implementation in clinical practice. Therefore, despite demonstrating high accuracy and safety, LLMs require further refinement, particularly in areas such as personalized medicine, complex decision support, long-term safety validation, and regulatory frameworks, to ensure their reliable integration into clinical practice.\\u003c/p\\u003e \\u003c/div\\u003e\"},{\"header\":\"Conclusion\",\"content\":\"\\u003cp\\u003eThis study systematically evaluated the accuracy, comprehensiveness, and safety of GPT-4.0 and Kimi in the diagnosis and treatment of complex cardiovascular diseases. The findings indicate that both models exhibit strong diagnostic and therapeutic capabilities, with GPT-4.0 demonstrating superior comprehensiveness and performance in long-term management. While LLMs have the potential to serve as valuable decision-support tools, their clinical application still requires physician oversight. Further optimization is necessary to enhance their safety, applicability, and real-world clinical reliability.\\u003c/p\\u003e\"},{\"header\":\"Declarations\",\"content\":\"\\u003cp\\u003e\\u003cstrong\\u003eClinical trial number\\u003c/strong\\u003e\\u003c/p\\u003e\\n\\u003cp\\u003eNot Applicable.\\u003c/p\\u003e\\n\\u003cp\\u003e\\u003cstrong\\u003eFunding\\u003c/strong\\u003e\\u003c/p\\u003e\\n\\u003cp\\u003eThis work was supported by the Scientific and Research Fund Project of Yunnan Provincial Education Department (No. 2025J0171).\\u003c/p\\u003e\\n\\u003cp\\u003e\\u003cstrong\\u003eEthics and Consent to Participate declarations\\u003c/strong\\u003e\\u003c/p\\u003e\\n\\u003cp\\u003e\\u0026nbsp;Not Applicable.\\u003c/p\\u003e\\n\\u003cp\\u003e\\u003cstrong\\u003eCompeting Interest\\u003c/strong\\u003e\\u003c/p\\u003e\\n\\u003cp\\u003eThe authors declare that they have no conflict of interest.\\u003c/p\\u003e\\n\\u003cp\\u003e\\u003cstrong\\u003eAcknowledgments\\u003c/strong\\u003e\\u003c/p\\u003e\\n\\u003cp\\u003eNone.\\u003c/p\\u003e\"},{\"header\":\"References\",\"content\":\"\\u003col\\u003e\\n\\u003cli\\u003eHu Y, Modat M, Gibson E, et al. Weakly-supervised convolutional neural networks for multimodal image registration. \\u003cem\\u003eMed Image Anal\\u003c/em\\u003e. Oct 2018;49:1-13. doi:10.1016/j.media.2018.07.002\\u003c/li\\u003e\\n\\u003cli\\u003eKim DH, MacKinnon T. Artificial intelligence in fracture detection: transfer learning from deep convolutional neural networks. \\u003cem\\u003eClin Radiol\\u003c/em\\u003e. May 2018;73(5):439-445. doi:10.1016/j.crad.2017.11.015\\u003c/li\\u003e\\n\\u003cli\\u003eZhang X, Zhang B, Zhang F. Stenosis Detection and Quantification of Coronary Artery Using Machine Learning and Deep Learning. \\u003cem\\u003eAngiology\\u003c/em\\u003e. May 2024;75(5):405-416. doi:10.1177/00033197231187063\\u003c/li\\u003e\\n\\u003cli\\u003eShen Y, Heacock L, Elias J, et al. ChatGPT and Other Large Language Models Are Double-edged Swords. \\u003cem\\u003eRadiology\\u003c/em\\u003e. Apr 2023;307(2):e230163. doi:10.1148/radiol.230163\\u003c/li\\u003e\\n\\u003cli\\u003eSallam M. ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns. \\u003cem\\u003eHealthcare (Basel)\\u003c/em\\u003e. Mar 19 2023;11(6)doi:10.3390/healthcare11060887\\u003c/li\\u003e\\n\\u003cli\\u003eDavenport T, Kalakota R. The potential for artificial intelligence in healthcare. \\u003cem\\u003eFuture Healthc J\\u003c/em\\u003e. Jun 2019;6(2):94-98. doi:10.7861/futurehosp.6-2-94\\u003c/li\\u003e\\n\\u003cli\\u003eYalamanchili A, Sengupta B, Song J, et al. Quality of Large Language Model Responses to Radiation Oncology Patient Care Questions. \\u003cem\\u003eJAMA Netw Open\\u003c/em\\u003e. Apr 1 2024;7(4):e244630. doi:10.1001/jamanetworkopen.2024.4630\\u003c/li\\u003e\\n\\u003cli\\u003eBernstein IA, Zhang YV, Govil D, et al. Comparison of Ophthalmologist and Large Language Model Chatbot Responses to Online Patient Eye Care Questions. \\u003cem\\u003eJAMA Netw Open\\u003c/em\\u003e. Aug 1 2023;6(8):e2330320. doi:10.1001/jamanetworkopen.2023.30320\\u003c/li\\u003e\\n\\u003cli\\u003eTailor PD, Dalvin LA, Chen JJ, et al. A Comparative Study of Responses to Retina Questions from Either Experts, Expert-Edited Large Language Models, or Expert-Edited Large Language Models Alone. \\u003cem\\u003eOphthalmol Sci\\u003c/em\\u003e. Jul-Aug 2024;4(4):100485. doi:10.1016/j.xops.2024.100485\\u003c/li\\u003e\\n\\u003cli\\u003eEckrich J, Ellinger J, Cox A, et al. Urology consultants versus large language models: Potentials and hazards for medical advice in urology. \\u003cem\\u003eBJUI Compass\\u003c/em\\u003e. May 2024;5(5):438-444. doi:10.1002/bco2.359\\u003c/li\\u003e\\n\\u003cli\\u003eWilhelm TI, Roos J, Kaczmarczyk R. Large Language Models for Therapy Recommendations Across 3 Clinical Specialties: Comparative Study. \\u003cem\\u003eJ Med Internet Res\\u003c/em\\u003e. Oct 30 2023;25:e49324. doi:10.2196/49324\\u003c/li\\u003e\\n\\u003cli\\u003eSarraju A, Bruemmer D, Van Iterson E, Cho L, Rodriguez F, Laffin L. Appropriateness of Cardiovascular Disease Prevention Recommendations Obtained From a Popular Online Chat-Based Artificial Intelligence Model. \\u003cem\\u003eJama\\u003c/em\\u003e. Mar 14 2023;329(10):842-844. doi:10.1001/jama.2023.1044\\u003c/li\\u003e\\n\\u003cli\\u003eKozaily E, Geagea M, Akdogan ER, et al. Accuracy and consistency of online large language model-based artificial intelligence chat platforms in answering patients\\u0026apos; questions about heart failure. \\u003cem\\u003eInt J Cardiol\\u003c/em\\u003e. Aug 1 2024;408:132115. doi:10.1016/j.ijcard.2024.132115\\u003c/li\\u003e\\n\\u003cli\\u003eHarskamp RE, De Clercq L. Performance of ChatGPT as an AI-assisted decision support tool in medicine: a proof-of-concept study for interpreting symptoms and management of common cardiac conditions (AMSTELHEART-2). \\u003cem\\u003eActa Cardiol\\u003c/em\\u003e. May 2024;79(3):358-366. doi:10.1080/00015385.2024.2303528\\u003c/li\\u003e\\n\\u003c/ol\\u003e\"}],\"fulltextSource\":\"\",\"fullText\":\"\",\"funders\":[],\"hasAdminPriorityOnWorkflow\":false,\"hasManuscriptDocX\":true,\"hasOptedInToPreprint\":true,\"hasPassedJournalQc\":\"\",\"hasAnyPriority\":false,\"hideJournal\":false,\"highlight\":\"\",\"institution\":\"\",\"isAcceptedByJournal\":true,\"isAuthorSuppliedPdf\":false,\"isDeskRejected\":\"\",\"isHiddenFromSearch\":false,\"isInQc\":false,\"isInWorkflow\":false,\"isPdf\":false,\"isPdfUpToDate\":true,\"isWithdrawnOrRetracted\":false,\"journal\":{\"display\":true,\"email\":\"info@researchsquare.com\",\"identity\":\"journal-of-medical-systems\",\"isNatureJournal\":false,\"hasQc\":true,\"allowDirectSubmit\":false,\"externalIdentity\":\"\",\"sideBox\":\"Learn more about [Journal of Medical Systems](https://www.springer.com/journal/10916)\",\"snPcode\":\"10916\",\"submissionUrl\":\"https://submission.nature.com/new-submission/10916/3\",\"title\":\"Journal of Medical Systems\",\"twitterHandle\":\"\",\"acdcEnabled\":true,\"dfaEnabled\":true,\"editorialSystem\":\"stoa\",\"reportingPortfolio\":\"Springer Hybrid\",\"inReviewEnabled\":true,\"inReviewRevisionsEnabled\":false},\"keywords\":\"Complex Cardiovascular Diseases, Large Language Models, GPT-4.0, Kimi, Artificial Intelligence, Accuracy, Comprehensiveness, Safety\",\"lastPublishedDoi\":\"10.21203/rs.3.rs-6220351/v1\",\"lastPublishedDoiUrl\":\"https://doi.org/10.21203/rs.3.rs-6220351/v1\",\"license\":{\"name\":\"CC BY 4.0\",\"url\":\"https://creativecommons.org/licenses/by/4.0/\"},\"manuscriptAbstract\":\"\\u003ch2\\u003eBackground\\u003c/h2\\u003e \\u003cp\\u003eThe rapid evolution of large language models (LLMs) in the medical field, particularly in automating medical tasks and supporting diagnosis and treatment, has shown promising potential. However, their accuracy, comprehensiveness, and safety in managing complex cardiovascular diseases have not been systematically assessed.\\u003c/p\\u003e\\u003ch2\\u003eObjective\\u003c/h2\\u003e \\u003cp\\u003eThis study aims to evaluate and compare the diagnostic and therapeutic performance of two prominent LLMs, GPT-4.0 and Kimi, in managing complex cardiovascular diseases, and to assess their safety, providing valuable insights for their future clinical application.\\u003c/p\\u003e\\u003ch2\\u003eMethods\\u003c/h2\\u003e \\u003cp\\u003eA total of 200 case reports from the Journal of the American College of Cardiology (JACC), published between January 2020 and August 2024, were analyzed. Standardized extraction forms were used to collect case information. GPT-4.0 and Kimi were both prompted with identical queries to generate diagnostic and treatment plans, covering diagnosis, treatment recommendations, and long-term management strategies. Three independent cardiovascular specialists evaluated the outputs on accuracy and comprehensiveness using a Likert scale, while a risk matrix scoring system was employed for safety assessment. Statistical analyses were conducted using the paired Mann-Whitney U test.\\u003c/p\\u003e\\u003ch2\\u003eResults\\u003c/h2\\u003e \\u003cp\\u003eIn terms of preliminary diagnosis, the accuracy rates of GPT-4.0 and Kimi were 96.0% and 93.5%, respectively (P\\u0026thinsp;=\\u0026thinsp;0.66), but GPT-4.0 demonstrated superior comprehensiveness (96.5% vs. 91.0%, P\\u0026thinsp;\\u0026lt;\\u0026thinsp;0.001). For treatment recommendations, GPT-4.0 outperformed Kimi in both accuracy (97.0% vs. 94.0%, P\\u0026thinsp;\\u0026lt;\\u0026thinsp;0.05) and comprehensiveness (98.0% vs. 91.5%, P\\u0026thinsp;\\u0026lt;\\u0026thinsp;0.001). Regarding long-term management, GPT-4.0 also exhibited superior performance (95.5% vs. 92.0%, P\\u0026thinsp;\\u0026lt;\\u0026thinsp;0.001). Safety assessment revealed that 93.5% of GPT-4.0\\u0026rsquo;s recommendations were free of potential harm, compared to 85.5% for Kimi, with high-risk cases accounting for 1.5% and 4.5%, respectively.\\u003c/p\\u003e\\u003ch2\\u003eConclusions\\u003c/h2\\u003e \\u003cp\\u003eLLMs, particularly GPT-4.0, exhibit significant promise in the diagnosis and treatment of complex cardiovascular diseases, showing superior accuracy, comprehensiveness, and safety compared to Kimi. Despite their high accuracy and safety, LLMs still require clinician oversight, especially in the formulation of personalized treatment plans and complex decision-making scenarios, to ensure their reliable integration into clinical practice.\\u003c/p\\u003e\",\"manuscriptTitle\":\"Novel Insights into the Application of Large Language Models in the Diagnosis and Treatment of Complex Cardiovascular Diseases: A Comparative Study\",\"msid\":\"\",\"msnumber\":\"\",\"nonDraftVersions\":[{\"code\":1,\"date\":\"2025-04-03 09:10:10\",\"doi\":\"10.21203/rs.3.rs-6220351/v1\",\"editorialEvents\":[{\"type\":\"communityComments\",\"content\":0},{\"type\":\"decision\",\"content\":\"Revision requested\",\"date\":\"2025-06-05T01:39:26+00:00\",\"index\":\"\",\"fulltext\":\"\"},{\"type\":\"editorInvitedReview\",\"content\":\"\",\"date\":\"2025-05-20T03:33:26+00:00\",\"index\":\"hide\",\"fulltext\":\"\"},{\"type\":\"reviewerAgreed\",\"content\":\"262320420822635671253969355811270823260\",\"date\":\"2025-05-13T19:15:46+00:00\",\"index\":\"hide\",\"fulltext\":\"\"},{\"type\":\"reviewerAgreed\",\"content\":\"317818325472668016810515122104829495604\",\"date\":\"2025-05-11T12:56:06+00:00\",\"index\":\"hide\",\"fulltext\":\"\"},{\"type\":\"reviewerAgreed\",\"content\":\"116674147614764672756167037615285243027\",\"date\":\"2025-04-03T15:58:55+00:00\",\"index\":\"hide\",\"fulltext\":\"\"},{\"type\":\"editorInvitedReview\",\"content\":\"\",\"date\":\"2025-04-01T16:12:20+00:00\",\"index\":\"hide\",\"fulltext\":\"\"},{\"type\":\"reviewerAgreed\",\"content\":\"56482243615847260182029540130284400827\",\"date\":\"2025-03-23T14:38:11+00:00\",\"index\":\"hide\",\"fulltext\":\"\"},{\"type\":\"reviewersInvited\",\"content\":\"\",\"date\":\"2025-03-23T14:34:40+00:00\",\"index\":\"\",\"fulltext\":\"\"},{\"type\":\"editorAssigned\",\"content\":\"\",\"date\":\"2025-03-23T13:44:40+00:00\",\"index\":\"\",\"fulltext\":\"\"},{\"type\":\"checksComplete\",\"content\":\"\",\"date\":\"2025-03-17T18:18:22+00:00\",\"index\":\"\",\"fulltext\":\"\"},{\"type\":\"submitted\",\"content\":\"Journal of Medical Systems\",\"date\":\"2025-03-13T13:04:43+00:00\",\"index\":\"\",\"fulltext\":\"\"}],\"status\":\"published\",\"journal\":{\"display\":true,\"email\":\"info@researchsquare.com\",\"identity\":\"journal-of-medical-systems\",\"isNatureJournal\":false,\"hasQc\":true,\"allowDirectSubmit\":false,\"externalIdentity\":\"\",\"sideBox\":\"Learn more about [Journal of Medical Systems](https://www.springer.com/journal/10916)\",\"snPcode\":\"10916\",\"submissionUrl\":\"https://submission.nature.com/new-submission/10916/3\",\"title\":\"Journal of Medical Systems\",\"twitterHandle\":\"\",\"acdcEnabled\":true,\"dfaEnabled\":true,\"editorialSystem\":\"stoa\",\"reportingPortfolio\":\"Springer Hybrid\",\"inReviewEnabled\":true,\"inReviewRevisionsEnabled\":false}}],\"origin\":\"\",\"ownerIdentity\":\"446eef46-15f5-402b-be73-fbb493aebe6f\",\"owner\":[],\"postedDate\":\"April 3rd, 2025\",\"published\":true,\"recentEditorialEvents\":[],\"rejectedJournal\":[],\"revision\":\"\",\"amendment\":\"\",\"status\":\"under-review\",\"subjectAreas\":[],\"tags\":[],\"updatedAt\":\"2025-10-03T03:23:12+00:00\",\"versionOfRecord\":[],\"versionCreatedAt\":\"2025-04-03 09:10:10\",\"video\":\"\",\"vorDoi\":\"\",\"vorDoiUrl\":\"\",\"workflowStages\":[]},\"version\":\"v1\",\"identity\":\"rs-6220351\",\"journalConfig\":\"researchsquare\"},\"__N_SSP\":true},\"page\":\"/article/[identity]/[[...version]]\",\"query\":{\"redirect\":\"/article/rs-6220351\",\"identity\":\"rs-6220351\",\"version\":[\"v1\"]},\"buildId\":\"8U1c8b4HqxoKbykW_rLl7\",\"isFallback\":false,\"isExperimentalCompile\":false,\"dynamicIds\":[84888],\"gssp\":true,\"scriptLoader\":[]}","source_license":"CC-BY-4.0","license_restricted":false}