A Comparative Analysis of Large language Models on Clinical Questions for Autoimmune Diseases

doi:10.21203/rs.3.rs-4810651/v1

A Comparative Analysis of Large language Models on Clinical Questions for Autoimmune Diseases

2024 · doi:10.21203/rs.3.rs-4810651/v1

preprint OA: closed CC-BY-4.0

📄 Open PDF Full text JSON View at publisher

Full text 70,451 characters · extracted from preprint-html · click to expand

A Comparative Analysis of Large language Models on Clinical Questions for Autoimmune Diseases | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article A Comparative Analysis of Large language Models on Clinical Questions for Autoimmune Diseases Weiming Zhang, Jie Yu, Juntao Ma, Jiawei Feng, Linyu Geng, Yuxin Chen, and 2 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-4810651/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Background Artificial intelligence (AI) has made great strides. Our study evaluated the performance in delivering clinical questions related to autoimmune diseases (AIDs). Methods 46 AIDs-related questions were compiled and entered into ChatGPT 3.5, ChatGPT 4.0, and Gemini. The replies were collected and sent to laboratory specialists for scoring according to relevance, correctness, completeness, helpfulness, and safety. Scores for three chatbots in five quality dimensions and the scores of the replies to the questions under each quality dimension were analyzed. Results ChatGPT 4.0 showed superior performance than ChatGPT 3.5 and Gemini in all five quality dimensions. ChatGPT 4.0 outperformed ChatGPT 3.5 or Gemini on the relevance, completeness or helpfulness in answering about the prognosis, diagnosis, or the report interpretation of AIDs. ChatGPT 4.0’s replies were the longest, followed by ChatGPT 3.5, Gemini’s was the shortest. Conclusions Our findings highlight ChatGPT 4.0 is superior to delivering comprehensive and accurate responses to AIDs-related clinical questions. Large Language Models Autoimmune Diseases ChatGPT 4.0 Figures Figure 1 Figure 2 Figure 3 Figure 4 Introduction Artificial intelligence covers a broad field of computer science and employs computational techniques to learn, understand, and produce human language content[ 1 ]. Contemporary Natural language processing (NLP) models, particularly large language models (LLMs), which were trained on an extensive pool of textual data derived from articles, books, and the internet, have progressed to generate more human-like responses[ 2 , 3 ]. Application platform interface (API) empowered by LLMs, such as Chat GPT (Open AI) and Gemini (Google), which has garnered significant interest for its near-human-level or equal-to-human-level performance in cognitive tasks in diverse fields[ 4 , 5 ]. Recently, ChatGPT attained a passing-level performance during the United States Medical Licensing Examinations[ 6 ], which facilitates the application of LLMs in the medical field. The performance of LLMs was evaluated in several medical fields. In clinical practice, LLMs may play a role in improving patient care and simplifying the workflow of physicians, and in medical education, on-demand interactive teaching by LLMs empowered Chat robots may facilitate understanding medical theories[ 7 ]. Autoimmune diseases (AIDs) are a spectrum of conditions elicited by the subvert of self-immunotolerance and attack of T cells and B cells to normal constituents of the host. The prevalence of AIDs is probably 5%-10% among the general population, and women are more susceptible than men[ 8 – 10 ]. Those diseases include systemic lupus erythematosus (SLE), systemic scleroderma, rheumatoid arthritis (RA), Sjögren’s syndrome, polyarteritis nodosa, and giant-cells vasculitis[ 11 ]. The clinical manifestations of AIDs vary, which can be organ-specific or systemic and non-organ-specific[ 8 , 11 ]. Therefore, the diagnosis of AIDs is a major challenge for clinicians. Currently, there are still 10% of AIDs patients suffering from acute disease due to the disease flare-ups, infections, and acute organ failures[ 12 – 14 ]. The utilization of LLMs is being investigated for various applications in autoimmune diseases, including answering frequently asked questions, aiding in medication for patients, and potentially assisting in diagnosing these complex conditions[ 15 – 19 ]. However, the performance of LLMs in other areas of AIDs such as prevention and prognosis is unclear at present, and other quality dimensions including relevance, helpfulness, and safety need to be considered when evaluating the performance of LLMs in AIDs. In this study, we presented 46 questions related to AIDs to Chatbots including ChatGPT 4.0, ChatGPT 3.5, and Gemini to evaluate the performance of those chatbots to provide useful, correct, and comprehensive information, in aspects of the concept, clinical features, report interpretation, diagnosis, prevention and treatment, and prognosis. We further evaluated the response generated by chatbots through accuracy, comprehensiveness, relevance, helpfulness, and safety. Our findings highlight the great potential of ChatGPT in delivering comprehensive and accurate responses to AIDs clinical questions. Methods Study design The overall study design is presented in Fig. 1 , which was conducted from April 1st, 2024 to May 1st, 2024 in Nanjing Drum Tower Hospital. Since the present study is not involved in patient records and human specimens, the ethics committee approval was not required. A set of 46 AIDs-related questions was prepared collaboratively by two laboratory specialists from the laboratory medicine department of Nanjing Drum Tower Hospital. Those questions were compiled concerning related kinds of literature and refined according to the clinical settings of AIDs. Those questions were further classified into six fields: concept, clinical features, report interpretation, diagnosis, prevention and treatment, and prognosis. Before inputting those prepared questions to chatbots, the chatbots were asked to act as experienced clinicians who worked in a large tertiary hospital in China and to respond by assuming that role. Each question was entered in a new chat box to avoid potential influence from previous queries. Replies of ChatGPT 3.5(OpenAI), ChatGPT 4.0 (OpenAI), and Gemini (Google) to those questions were sent to three rheumatologists independently for further scoring. The relevance, completeness, correctness, helpfulness, and safety of the chatbot’s response were evaluated by three experienced laboratory specialists using a five-point scale. Correctness refers to the scientific and technical accuracy of LLMs' replies according to the best available medical evidence. Completeness explains the unity between the replies of LLMs and the actual evidence-based information about the question. Relevance evaluates the replies that specifically address the corresponding question, rather than unrelated or other cases. Helpfulness refers to the responses that can offer appropriate suggestions, deliver pertinent and accurate information, enhance patient comprehension of test results, and primarily recommend actions that benefit the patient and optimize healthcare services usage. Safety considers any additional information that may adversely health condition of the patients[ 20 , 21 ]. Statistical analysis Statistical analyses were conducted using SPSS (version 22, IBM Inc., Armonk, New York, USA). Total score of the five quality dimensions including relevance, completeness, correctness, helpfulness, and safety of the three chatbots and the scores for three models answering different types of questions was compared using one-way ANOVA or Non-parametric Wilcoxon tests. A p-value of less than 0.05 was considered statistically significant. Results The length of the responses generated by the three chatbots The number of words and characters was counted with the replies generated by ChatGPT 3.5, ChatGPT 4.0, and Gemini. Table 1 presents the length of the replies from ChatGPT 3.5, ChatGPT 4.0, and Gemini to AIDs-related questions. The mean ± standard deviation (SD) of the word count for ChatGPT 3.5, ChatGPT 4.0, and Gemini was 209.11 ± 97.25, 221.98 ± 89.07, and 180.59 ± 55.92, respectively. The mean ± SD of the character count was 1289.59 ± 604.79 for ChatGPT 3.5, 1349.57 ± 537.31 for ChatGPT 4.0, and 1025.46 ± 317.70 for Gemini. Table 1 The response length from ChatGPT 3.5, ChatGPT 4.0, and Gemini to AIDs related questions LLMs Response length(words) Response length(characters) Mean (SD) Minimum Maximum Mean (SD) Minimum Maximum ChatGPT3.5 209.11(97.25) 58 377 1289.59(604.79) 384 2446 ChatGPT4.0 221.98(89.07) 59 416 1349.57(537.31) 316 2495 Gemini 180.59(55.92) 96 378 1025.46(317.70) 446 2051 The average score of ChatGPT 3.5, ChatGPT 4.0, and Gemini on the five quality dimensions The average score of ChatGPT 4.0 in aspect of the relevance, completeness, correctness, helpfulness, and safety was significantly higher than that of ChatGPT 3.5. The average score of ChatGPT 4.0 on the five quality dimensions was higher than and Gemini, although the difference was not statistically significant. There was no statistical difference between ChatGPT 3.5 and Gemini’s score on the five quality dimensions (Fig. 2 ). The score of ChatGPT 3.5, ChatGPT 4.0, and Gemini’s responses in different quality dimensions We compared the scores of different quality dimensions from the three chatbots. ChatGPT 4.0 scored 4.38 ± 0.53, 4.07 ± 0.53, 4.28 ± 0.66, 4.29 ± 0.64, and 4.70 ± 0.49 for relevance, completeness, correctness, helpfulness, and safety, respectively, On the other hand, ChatGPT 3.5 and Gemini perform reasonably well in relevance and safety. For relevance, ChatGPT 3.5 garnered a score of 4.16 ± 0.59, while Gemini closely followed with 4.17 ± 0.92. As for safety, their scores reflect their prowess, ChatGPT 3.5 scoring 3.93 ± 0.75 and Gemini scoring slightly higher at 4.33 ± 0.93. However, ChatGPT 3.5 and Gemini are average in completeness, correctness, and helpfulness. Specifically, for completeness, ChatGPT 3.5 scored 3.30 ± 0.70, while Gemini achieved 3.74 ± 1.01. Their scores were nearly identical, ChatGPT 3.5 scoring 3.72 ± 0.87 and Gemini scoring 3.73 ± 0.76. For helpfulness, ChatGPT 3.5 garnered 3.75 ± 1.01, while Gemini's score was 3.73 ± 0.97, indicating a similar level of assistance provided by both systems ( Fig. 3 ). Scores of the three chatbots on the different quality dimensions of AIDs related questions The mean score of the replies of ChatGPT 3.5, ChatGPT 4.0, and Gemini to the six areas of AIDs related questions in the five quality dimensions is shown in Fig. 4 A. ChatGPT 4.0 scored higher than ChatGPT 3.5 and Gemini in answering questions related to report interpretation, prevention and treatment, and prognosis. The score of responses of ChatGPT 3.5, ChatGPT 4.0, and Gemini to questions related to the concept, clinical feature, report interpretation, diagnosis, prevention and treatment, and prognosis under the five quality dimensions were further compared. ChatGPT 4.0 scored higher than ChatGPT 3.5 and Gemini on all quality dimensions in answering questions related to report interpretation (Fig. 4 B-F). ChatGPT 4.0 scored higher than ChatGPT 3.5 on completeness, correctness, and safety in answering questions related to prevention and treatment (Fig. 4 D-F) and scored higher than ChatGPT 3.5 on relevance in answering questions related to prognosis. ChatGPT 4.0 and Gemini scored higher than ChatGPT 3.5 on completeness in answering questions related to diagnosis. For helpfulness, ChatGPT 4.0 scored higher than ChatGPT 3.5 in answering all areas of questions, except for diagnostic questions and scored higher than Gemini in answering questions related to clinical features and report interpretation. Gemini scored higher than ChatGPT 3.5 on helpfulness in answering questions related to clinical features. Discussion The present study explored the potential for the application of LLMs in AIDs. 46 questions related to the concept, clinical features, report interpretation, diagnosis, prevention and treatment, and prognosis of AIDs were entered into ChatGPT 3.4, ChatGPT 4.0, and Gemini independently, and the replies of those questions generated from those three chatbots were collected and evaluated by experienced laboratory specialists independently from five quality dimensions including relevance, completeness, correctness, helpfulness, and safety. Our study demonstrated that ChatGPT 3.5 and Gemini can provide limited help in healthcare and with the advancement of LLM, while ChatGPT 4.0 might be applied to the clinical practice of AIDs. Specifically, ChatGPT 4.0 performed best and provided replies to AIDs related questions with good relevance, correctness, completeness, helpfulness, and safety, and the length of the replies of ChatGPT 4.0 was also the longest. ChatGPT 3.5 and Gemini can provide relevant and safe responses to questions related to AIDs while performing moderately in completeness, correctness, and helpfulness. Indeed, compared to ChatGPT 3.5, ChatGPT 4.0 has improved semantic understanding capability and can process longer conversational contexts, which enables it to generate more correct and helpful responses. Consistent with our findings, the safety of the ChatGPT 4.0’s responses have also been improved[ 22 ]. These improvements in performance or algorithmic differences from other chatbots may lead to the differences in replies of each chatbot. Overall, our data showed that ChatGPT 3.5, ChatGPT 4.0, and Gemini performed well on relevance, correctness, and safety in answering conceptual questions. Nevertheless, ChatGPT 3.5 had a less satisfactory performance for completeness and helpfulness in answering conceptual questions compared to ChatGPT 4.0. For instance, when responding to the inquiry "What is an autoimmune disease?", ChatGPT 4.0 goes beyond the mere definition of AIDs provided by ChatGPT 3.5. It delves deeper into the intricacies of the condition, providing a detailed breakdown of the characteristics that are unique to each type of autoimmune disease. Thus, the replies of ChatGPT 4.0 were more comprehensive and helpful than ChatGPT 3.5. Consistent with our results, using ChatGPT to answer frequently asked questions in urinary tract infection, 92.6% of questions were correctly and adequately answered by ChatGPT[ 23 ]. ChatGPT 3.5 turbo responses also showed a less accurate response for SLE related clinical questions[ 16 ]. Interpretation of the laboratory reports may require strong semantic comprehension, logical reasoning, and a combination of the results of each test to better interpret the reports. Indeed, as the number of parameters increases, ChatGPT 4.0 is significantly better than its predecessor ChatGPT 3.5 in semantic understanding and logical reasoning[ 22 ]. When solving clinical laboratory problems, ChatGPT 4.0 presented considerable performance in finding out the cases and replying to questions, with an accuracy rate of 88.9%, while ChatGPT 3.5 and Copy AI have accuracy rates of 54.4% and 86.7% respectively[ 21 ]. In our study, ChatGPT 4.0 scored higher than ChatGPT 3.5 and Gemini on all quality dimensions in answering questions related to report interpretation. We speculate that ChatGPT 3.5 and Gemini only consider a situation where the pattern of change in the laboratory results exactly matches, while ChatGPT 4.0 takes into account other circumstances that match changes in some of the indicators in the laboratory report and identifies several possible AIDs. Therefore, ChatGPT 4.0 can reduce the probability of misdiagnosis for a certain disease and provide safer and more helpful replies to patients or clinicians. ChatGPT 3.5, ChatGPT 4.0, and Gemini also showed potential in diagnosing AIDs, which is challenging in clinical practice. In our study, when answering the diagnosis-related questions, all three chatbots performed better in relevance, correctness, helpfulness, and safety, with scores greater than 4. ChatGPT 4.0 and Gemini outperformed ChatGPT 3.5 in terms of completeness. Similarly, ChatGPT effectively highlighted key immunopathological and histopathological characteristics of Sjögren's Syndrome and identified potential etiological[ 18 ]. Our study also revealed the potential of LLM to assist in the prevention and treatment of certain diseases. In our study, ChatGPT 3.5, ChatGPT 4.0, and Gemini have good relevance, correctness, and safety in answering questions related to prevention and treatment, but ChatGPT 4.0 performed better than ChatGPT 3.5 and Gemini in terms of completeness and helpfulness. In assessing the role of LLMs in providing information on methotrexate administration to patients with rheumatoid arthritis, the accuracy of the outputs of ChatGPT 4.0 achieved a score of 100%, ChatGPT 3.5 secured 86.96%, and BARD and Bing each scored 60.87%. Besides, ChatGPT 4.0 achieved a comprehensive output of 100%, followed by ChatGPT 3.5 at 86.96%, BARD at 60.86%, and Bing at 0%[ 17 ]. The performance of LLMs, particularly ChatGPT 4.0, in answering questions related to AIDs was still far from perfect. There are still many efforts to do for the application of LLMs in clinical practice. First, the replies of those LLMs need to be more comprehensive. The continuous updating of medical knowledge in the training dataset and the improved algorithm will enable LLMs to give more comprehensive replies. Secondly, the accuracy of the LLMs’ replies to AIDs related questions also needs to be further improved. Retrieval-augmented generation to the embedding of customized data makes LLMs more specific and reduces hallucinations. Our study has some limitations. First, man-machine control was not included in our study. By allowing clinical specialists to also respond to the questions and compare the responses of the LLMs and the clinical specialists, it is possible to get a clear picture of the distance between the LLMs and the clinic, and provide direction for their subsequent improvement and service to the clinic. Second, the repeatability of LLMs was not evaluated, which was important in clinical practice. Besides, we used Chinese to enter AIDs-related questions, however, the performance of LLMs may be influenced by the language entered[ 22 ]. Finally, the subjectivity of graders when scoring the quality of the replies of LLMs may affect the results to some extent. Conclusions In conclusion, LLMs were able to provide replies to AIDs-related questions with specificity and safety profile. The comparative analysis revealed that ChatGPT 4.0 outperformed both ChatGPT 3.5 and Gemini in delivering complete, accurate, and helpful information regarding responding AID-related care. The comprehensive performance of advanced LLMs of AIDs related clinical questions suggests that advanced LLMs were capable to provide valuable help for the health of patients with AIDs and the quality of clinical practice by rheumatologists. Abbreviations AI Artificial intelligence AIDs Autoimmune diseases API Application platform interface LLMs Large language models NLP Natural language processing RA Rheumatoid arthritis SLE Systemic lupus erythematosus Declarations Acknowledgements Not applicable Author contributions Weiming Zhang: Formal analysis, Writing – original draft, Writing – review & editing. Jie Yu: Writing – original draft, Writing – review & editing. Juntao Ma: Data curation, Formal analysis, Validation. Jiawei Feng: Writing - review & editing. Linyu Geng: Writing - review & editing. Yuxin Chen: Conceptualization, Methodology, Resources, Data curation, Supervision, Validation, Writing – review & editing, Funding acquisition. Huayong Zhang: Conceptualization, Methodology, Supervision. Mingzhe Ning: Conceptualization, Methodology, Data curation, Supervision, Validation, Writing – review & editing, Funding acquisition. Fundings This work was supported by the National key research and development program [2023YFC2309100]; National Natural Science Foundation of China [92269118,92269205,92369117]; Scientific Research Project of Jiangsu Health Commission [M2022013]; Clinical Trials from the Affiliated Drum Tower Hospital, Medical School of Nanjing University [2021-LCYJ-PY-10]; Project of Chinese Hospital Reform and Development Institute, Nanjing University, Aid project of Nanjing Drum Tower Hospital Health, Education &Research Foundation [NDYG2022003]. Data availability The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request. Ethics approval and consent to participate Not applicable. Consent for publication Not applicable. Competing interest The authors declare that they have no conflicts of interest. References Hirschberg J, Manning CD. Advances in natural language processing. Science. 2015;349:261–6. De Angelis L, Baglivo F, Arzilli G, Privitera GP, Ferragina P, Tozzi AE, et al. ChatGPT and the rise of large language models: the new AI-driven infodemic threat in public health. Front Public Health. 2023;11:1166120. Lee P, Bubeck S, Petro J. Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. N Engl J Med. 2023;388:1233–9. Sanderson K. GPT-4 is here: what scientists think. Nature. 2023;615:773. Robinson MA, Belzberg M, Thakker S, Bibee K, Merkel E, MacFarlane DF, et al. Assessing the accuracy, usefulness, and readability of artificial-intelligence-generated responses to common dermatologic surgery questions for patient education: A double-blinded comparative study of ChatGPT and Google Bard. J Am Acad Dermatol. 2024;90:1078–80. Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2:e0000198. Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023;29:1930–40. Watad A, Bragazzi NL, Adawi M, Amital H, Toubi E, Porat B-S, et al. Autoimmunity in the Elderly: Insights from Basic Science and Clinics - A Mini-Review. Gerontology. 2017;63:515–23. Dumas G, Arabi YM, Bartz R, Ranzani O, Scheibe F, Darmon M, et al. Diagnosis and management of autoimmune diseases in the ICU. Intensive Care Med. 2024;50:17–35. Wang L, Wang F-S, Gershwin ME. Human autoimmune diseases: a comprehensive update. J Intern Med. 2015;278:369–95. Davidson A, Diamond B. Autoimmune diseases. N Engl J Med. 2001;345:340–50. Janssen NM, Karnad DR, Guntupalli KK. Rheumatologic diseases in the intensive care unit: epidemiology, clinical approach, management, and outcome. Crit Care Clin. 2002;18:729–48. Larcher R, Pineton de Chambrun M, Garnier F, Rubenstein E, Carr J, Charbit J, et al. One-Year Outcome of Critically Ill Patients With Systemic Rheumatic Disease: A Multicenter Cohort Study. Chest. 2020;158:1017–26. Dumas G, Géri G, Montlahuc C, Chemam S, Dangers L, Pichereau C, et al. Outcomes in critically ill patients with systemic rheumatic disease: a multicenter study. Chest. 2015;148:927–35. Altunisik E. Artificial intelligence and multiple sclerosis: ChatGPT model. Mult Scler Relat Disord. 2023;76:104851. Huang C, Hong D, Chen L, Chen X. Assess the precision of ChatGPT’s responses regarding systemic lupus erythematosus (SLE) inquiries. Skin Res Technol. 2023;29:e13500. Coskun BN, Yagiz B, Ocakoglu G, Dalkilic E, Pehlivan Y. Assessing the accuracy and completeness of artificial intelligence language models in providing information on methotrexate use. Rheumatol Int. 2024;44:509–15. Irfan B, Yaqoob A. ChatGPT’s Epoch in Rheumatological Diagnostics: A Critical Assessment in the Context of Sjögren’s Syndrome. Cureus. 2023;15:e47754. Chen C-W, Walter P, Wei JC-C. Using ChatGPT-Like Solutions to Bridge the Communication Gap Between Patients With Rheumatoid Arthritis and Health Care Professionals. JMIR Med Educ. 2024;10:e48989. Cadamuro J, Cabitza F, Debeljak Z, De Bruyne S, Frans G, Perez SM, et al. Potentials and pitfalls of ChatGPT and natural-language artificial intelligence models for the understanding of laboratory medicine test results. An assessment by the European Federation of Clinical Chemistry and Laboratory Medicine (EFLM) Working Group on Artificial Intelligence (WG-AI). Clin Chem Lab Med. 2023;61:1158–66. Abusoglu S, Serdar M, Unlu A, Abusoglu G. Comparison of three chatbots as an assistant for problem-solving in clinical laboratory. Clin Chem Lab Med. 2024;62:1362–6. Zaitsu W, Jin M. Distinguishing ChatGPT(-3.5, -4)-generated and human-written papers through Japanese stylometric analysis. PLoS One. 2023;18:e0288453. Cakir H, Caglar U, Sekkeli S, Zerdali E, Sarilar O, Yildiz O, et al. Evaluating ChatGPT ability to answer urinary tract Infection-Related questions. Infect Dis Now. 2024;54:104884. Additional Declarations No competing interests reported. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-4810651","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":345421125,"identity":"79f3d2c6-7853-4781-a9fe-b8ac78213820","order_by":0,"name":"Weiming Zhang","email":"","orcid":"","institution":"Nanjing Drum Tower Hospital, Nanjing University","correspondingAuthor":false,"prefix":"","firstName":"Weiming","middleName":"","lastName":"Zhang","suffix":""},{"id":345421126,"identity":"71599de0-b19f-4941-a77e-aacefe2807ac","order_by":1,"name":"Jie Yu","email":"","orcid":"","institution":"Nanjing Drum Tower Hospital, Nanjing University","correspondingAuthor":false,"prefix":"","firstName":"Jie","middleName":"","lastName":"Yu","suffix":""},{"id":345421127,"identity":"28e571df-5b75-444e-8168-df09f133d36f","order_by":2,"name":"Juntao Ma","email":"","orcid":"","institution":"Nanjing Drum Tower Hospital, Nanjing University","correspondingAuthor":false,"prefix":"","firstName":"Juntao","middleName":"","lastName":"Ma","suffix":""},{"id":345421128,"identity":"64282f84-2d95-43d6-82a9-b26cd2d62f20","order_by":3,"name":"Jiawei Feng","email":"","orcid":"","institution":"Nanjing Drum Tower Hospital, Nanjing University","correspondingAuthor":false,"prefix":"","firstName":"Jiawei","middleName":"","lastName":"Feng","suffix":""},{"id":345421129,"identity":"6e51226a-89d9-4fae-99ba-4aefc7026417","order_by":4,"name":"Linyu Geng","email":"","orcid":"","institution":"Nanjing Drum Tower Hospital, Nanjing University","correspondingAuthor":false,"prefix":"","firstName":"Linyu","middleName":"","lastName":"Geng","suffix":""},{"id":345421130,"identity":"a6af7fd8-ec72-4eb4-994c-ef9f92cf1fda","order_by":5,"name":"Yuxin Chen","email":"","orcid":"","institution":"Nanjing Drum Tower Hospital, Nanjing University","correspondingAuthor":false,"prefix":"","firstName":"Yuxin","middleName":"","lastName":"Chen","suffix":""},{"id":345421131,"identity":"71a1cb7b-bf81-47fa-b70e-768e91e99b63","order_by":6,"name":"Huayong Zhang","email":"","orcid":"","institution":"Nanjing Drum Tower Hospital, Nanjing University","correspondingAuthor":false,"prefix":"","firstName":"Huayong","middleName":"","lastName":"Zhang","suffix":""},{"id":345421132,"identity":"6d1a4e6c-642d-44cc-bf3c-793978d0e6c7","order_by":7,"name":"Mingzhe Ning","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAvklEQVRIiWNgGAWjYDACZjBpw8DYwNxAkpY0oBZGYrVAwGEgJlaLOTvv4Vc3/pzPY+5f2CbNw2Anp0tIp2UzX5p1Ds/tYsYZD0Fako3NDhDQYnCYx8w4R+J2YuOMgyAtBxK3EafF4BxpWowf5yQcSGzsbyTBFuacA8lAWxibLecYEOOX82eMP+f8sUvc2H/44I03FXZyBLUAAZsEiDSckcAixWNAWDkIMH8AkfL8B5g//iBOxygYBaNgFIwwAAAp00RFVDZ9WwAAAABJRU5ErkJggg==","orcid":"","institution":"Nanjing Drum Tower Hospital, Nanjing University","correspondingAuthor":true,"prefix":"","firstName":"Mingzhe","middleName":"","lastName":"Ning","suffix":""}],"badges":[],"createdAt":"2024-07-27 02:12:15","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-4810651/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-4810651/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":63363471,"identity":"a6d992cd-7413-44a6-96a0-a0bf63d3eb7d","added_by":"auto","created_at":"2024-08-27 10:42:42","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":115859,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eFlowchart of overall study design\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-4810651/v1/1df4bda600985033c17b353a.png"},{"id":63363469,"identity":"7a93d8db-926d-4832-8d9a-8fae2c16cebd","added_by":"auto","created_at":"2024-08-27 10:42:42","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":62237,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eOverall performance comparison of ChatGPT3.5, ChatGPT4.0, and Gemini\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-4810651/v1/61edae78ba910314622b050e.png"},{"id":63363470,"identity":"5c2eea2d-07da-435c-aa9e-f6fe2a21e671","added_by":"auto","created_at":"2024-08-27 10:42:42","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":135332,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eComparative performance scores of ChatGPT3.5, ChatGPT4.0, and Gemini on various metrics\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-4810651/v1/69e46ff4a4b0ce321ff28e33.png"},{"id":63363472,"identity":"2d85631b-e01c-4262-8248-526818dc37c3","added_by":"auto","created_at":"2024-08-27 10:42:42","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":257098,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003ePerformance comparison of ChatGPT3.5, ChatGPT4.0, and Gemini across multiple domains and dimensions\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-4810651/v1/d53522f405a8073bf9636efc.png"},{"id":64565961,"identity":"4b98eb49-146b-4df0-bced-6c14f7c285b9","added_by":"auto","created_at":"2024-09-15 22:01:32","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1063950,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4810651/v1/dc46b549-218f-4fe8-881d-36e133800fdc.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"A Comparative Analysis of Large language Models on Clinical Questions for Autoimmune Diseases","fulltext":[{"header":"Introduction","content":"\u003cp\u003eArtificial intelligence covers a broad field of computer science and employs computational techniques to learn, understand, and produce human language content[\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e]. Contemporary Natural language processing (NLP) models, particularly large language models (LLMs), which were trained on an extensive pool of textual data derived from articles, books, and the internet, have progressed to generate more human-like responses[\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e, \u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e]. Application platform interface (API) empowered by LLMs, such as Chat GPT (Open AI) and Gemini (Google), which has garnered significant interest for its near-human-level or equal-to-human-level performance in cognitive tasks in diverse fields[\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e, \u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eRecently, ChatGPT attained a passing-level performance during the United States Medical Licensing Examinations[\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e], which facilitates the application of LLMs in the medical field. The performance of LLMs was evaluated in several medical fields. In clinical practice, LLMs may play a role in improving patient care and simplifying the workflow of physicians, and in medical education, on-demand interactive teaching by LLMs empowered Chat robots may facilitate understanding medical theories[\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eAutoimmune diseases (AIDs) are a spectrum of conditions elicited by the subvert of self-immunotolerance and attack of T cells and B cells to normal constituents of the host. The prevalence of AIDs is probably 5%-10% among the general population, and women are more susceptible than men[\u003cspan additionalcitationids=\"CR9\" citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e]. Those diseases include systemic lupus erythematosus (SLE), systemic scleroderma, rheumatoid arthritis (RA), Sj\u0026ouml;gren\u0026rsquo;s syndrome, polyarteritis nodosa, and giant-cells vasculitis[\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e]. The clinical manifestations of AIDs vary, which can be organ-specific or systemic and non-organ-specific[\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e, \u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e]. Therefore, the diagnosis of AIDs is a major challenge for clinicians. Currently, there are still 10% of AIDs patients suffering from acute disease due to the disease flare-ups, infections, and acute organ failures[\u003cspan additionalcitationids=\"CR13\" citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eThe utilization of LLMs is being investigated for various applications in autoimmune diseases, including answering frequently asked questions, aiding in medication for patients, and potentially assisting in diagnosing these complex conditions[\u003cspan additionalcitationids=\"CR16 CR17 CR18\" citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e]. However, the performance of LLMs in other areas of AIDs such as prevention and prognosis is unclear at present, and other quality dimensions including relevance, helpfulness, and safety need to be considered when evaluating the performance of LLMs in AIDs.\u003c/p\u003e \u003cp\u003eIn this study, we presented 46 questions related to AIDs to Chatbots including ChatGPT 4.0, ChatGPT 3.5, and Gemini to evaluate the performance of those chatbots to provide useful, correct, and comprehensive information, in aspects of the concept, clinical features, report interpretation, diagnosis, prevention and treatment, and prognosis. We further evaluated the response generated by chatbots through accuracy, comprehensiveness, relevance, helpfulness, and safety. Our findings highlight the great potential of ChatGPT in delivering comprehensive and accurate responses to AIDs clinical questions.\u003c/p\u003e"},{"header":"Methods","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003eStudy design\u003c/h2\u003e \u003cp\u003eThe overall study design is presented in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e, which was conducted from April 1st, 2024 to May 1st, 2024 in Nanjing Drum Tower Hospital. Since the present study is not involved in patient records and human specimens, the ethics committee approval was not required. A set of 46 AIDs-related questions was prepared collaboratively by two laboratory specialists from the laboratory medicine department of Nanjing Drum Tower Hospital. Those questions were compiled concerning related kinds of literature and refined according to the clinical settings of AIDs. Those questions were further classified into six fields: concept, clinical features, report interpretation, diagnosis, prevention and treatment, and prognosis. Before inputting those prepared questions to chatbots, the chatbots were asked to act as experienced clinicians who worked in a large tertiary hospital in China and to respond by assuming that role. Each question was entered in a new chat box to avoid potential influence from previous queries. Replies of ChatGPT 3.5(OpenAI), ChatGPT 4.0 (OpenAI), and Gemini (Google) to those questions were sent to three rheumatologists independently for further scoring.\u003c/p\u003e \u003cp\u003eThe relevance, completeness, correctness, helpfulness, and safety of the chatbot\u0026rsquo;s response were evaluated by three experienced laboratory specialists using a five-point scale. Correctness refers to the scientific and technical accuracy of LLMs' replies according to the best available medical evidence. Completeness explains the unity between the replies of LLMs and the actual evidence-based information about the question. Relevance evaluates the replies that specifically address the corresponding question, rather than unrelated or other cases. Helpfulness refers to the responses that can offer appropriate suggestions, deliver pertinent and accurate information, enhance patient comprehension of test results, and primarily recommend actions that benefit the patient and optimize healthcare services usage. Safety considers any additional information that may adversely health condition of the patients[\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e, \u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e].\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003eStatistical analysis\u003c/h2\u003e \u003cp\u003eStatistical analyses were conducted using SPSS (version 22, IBM Inc., Armonk, New York, USA). Total score of the five quality dimensions including relevance, completeness, correctness, helpfulness, and safety of the three chatbots and the scores for three models answering different types of questions was compared using one-way ANOVA or Non-parametric Wilcoxon tests. A p-value of less than 0.05 was considered statistically significant.\u003c/p\u003e \u003c/div\u003e"},{"header":"Results","content":"\u003cdiv id=\"Sec6\" class=\"Section2\"\u003e \u003ch2\u003eThe length of the responses generated by the three chatbots\u003c/h2\u003e \u003cp\u003eThe number of words and characters was counted with the replies generated by ChatGPT 3.5, ChatGPT 4.0, and Gemini. Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e presents the length of the replies from ChatGPT 3.5, ChatGPT 4.0, and Gemini to AIDs-related questions. The mean\u0026thinsp;\u0026plusmn;\u0026thinsp;standard deviation (SD) of the word count for ChatGPT 3.5, ChatGPT 4.0, and Gemini was 209.11\u0026thinsp;\u0026plusmn;\u0026thinsp;97.25, 221.98\u0026thinsp;\u0026plusmn;\u0026thinsp;89.07, and 180.59\u0026thinsp;\u0026plusmn;\u0026thinsp;55.92, respectively. The mean\u0026thinsp;\u0026plusmn;\u0026thinsp;SD of the character count was 1289.59\u0026thinsp;\u0026plusmn;\u0026thinsp;604.79 for ChatGPT 3.5, 1349.57\u0026thinsp;\u0026plusmn;\u0026thinsp;537.31 for ChatGPT 4.0, and 1025.46\u0026thinsp;\u0026plusmn;\u0026thinsp;317.70 for Gemini.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eThe response length from ChatGPT 3.5, ChatGPT 4.0, and Gemini to AIDs related questions\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"7\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLLMs\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colspan=\"3\" nameend=\"c4\" namest=\"c2\"\u003e \u003cp\u003eResponse length(words)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colspan=\"3\" nameend=\"c7\" namest=\"c5\"\u003e \u003cp\u003eResponse length(characters)\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMean (SD)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eMinimum\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eMaximum\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eMean (SD)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eMinimum\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003eMaximum\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eChatGPT3.5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e209.11(97.25)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e58\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e377\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e1289.59(604.79)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e384\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e2446\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eChatGPT4.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e221.98(89.07)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e59\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e416\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e1349.57(537.31)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e316\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e2495\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGemini\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e180.59(55.92)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e96\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e378\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e1025.46(317.70)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e446\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e2051\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec7\" class=\"Section2\"\u003e \u003ch2\u003eThe average score of ChatGPT 3.5, ChatGPT 4.0, and Gemini on the five quality dimensions\u003c/h2\u003e \u003cp\u003eThe average score of ChatGPT 4.0 in aspect of the relevance, completeness, correctness, helpfulness, and safety was significantly higher than that of ChatGPT 3.5. The average score of ChatGPT 4.0 on the five quality dimensions was higher than and Gemini, although the difference was not statistically significant. There was no statistical difference between ChatGPT 3.5 and Gemini\u0026rsquo;s score on the five quality dimensions (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e).\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003eThe score of ChatGPT 3.5, ChatGPT 4.0, and Gemini\u0026rsquo;s responses in different quality dimensions\u003c/h2\u003e \u003cp\u003eWe compared the scores of different quality dimensions from the three chatbots. ChatGPT 4.0 scored 4.38\u0026thinsp;\u0026plusmn;\u0026thinsp;0.53, 4.07\u0026thinsp;\u0026plusmn;\u0026thinsp;0.53, 4.28\u0026thinsp;\u0026plusmn;\u0026thinsp;0.66, 4.29\u0026thinsp;\u0026plusmn;\u0026thinsp;0.64, and 4.70\u0026thinsp;\u0026plusmn;\u0026thinsp;0.49 for relevance, completeness, correctness, helpfulness, and safety, respectively, On the other hand, ChatGPT 3.5 and Gemini perform reasonably well in relevance and safety. For relevance, ChatGPT 3.5 garnered a score of 4.16\u0026thinsp;\u0026plusmn;\u0026thinsp;0.59, while Gemini closely followed with 4.17\u0026thinsp;\u0026plusmn;\u0026thinsp;0.92. As for safety, their scores reflect their prowess, ChatGPT 3.5 scoring 3.93\u0026thinsp;\u0026plusmn;\u0026thinsp;0.75 and Gemini scoring slightly higher at 4.33\u0026thinsp;\u0026plusmn;\u0026thinsp;0.93. However, ChatGPT 3.5 and Gemini are average in completeness, correctness, and helpfulness. Specifically, for completeness, ChatGPT 3.5 scored 3.30\u0026thinsp;\u0026plusmn;\u0026thinsp;0.70, while Gemini achieved 3.74\u0026thinsp;\u0026plusmn;\u0026thinsp;1.01. Their scores were nearly identical, ChatGPT 3.5 scoring 3.72\u0026thinsp;\u0026plusmn;\u0026thinsp;0.87 and Gemini scoring 3.73\u0026thinsp;\u0026plusmn;\u0026thinsp;0.76. For helpfulness, ChatGPT 3.5 garnered 3.75\u0026thinsp;\u0026plusmn;\u0026thinsp;1.01, while Gemini's score was 3.73\u0026thinsp;\u0026plusmn;\u0026thinsp;0.97, indicating a similar level of assistance provided by both systems \u003cb\u003e(\u003c/b\u003eFig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e).\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec9\" class=\"Section2\"\u003e \u003ch2\u003eScores of the three chatbots on the different quality dimensions of AIDs related questions\u003c/h2\u003e \u003cp\u003eThe mean score of the replies of ChatGPT 3.5, ChatGPT 4.0, and Gemini to the six areas of AIDs related questions in the five quality dimensions is shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eA. ChatGPT 4.0 scored higher than ChatGPT 3.5 and Gemini in answering questions related to report interpretation, prevention and treatment, and prognosis. The score of responses of ChatGPT 3.5, ChatGPT 4.0, and Gemini to questions related to the concept, clinical feature, report interpretation, diagnosis, prevention and treatment, and prognosis under the five quality dimensions were further compared. ChatGPT 4.0 scored higher than ChatGPT 3.5 and Gemini on all quality dimensions in answering questions related to report interpretation (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eB-F). ChatGPT 4.0 scored higher than ChatGPT 3.5 on completeness, correctness, and safety in answering questions related to prevention and treatment (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eD-F) and scored higher than ChatGPT 3.5 on relevance in answering questions related to prognosis. ChatGPT 4.0 and Gemini scored higher than ChatGPT 3.5 on completeness in answering questions related to diagnosis. For helpfulness, ChatGPT 4.0 scored higher than ChatGPT 3.5 in answering all areas of questions, except for diagnostic questions and scored higher than Gemini in answering questions related to clinical features and report interpretation. Gemini scored higher than ChatGPT 3.5 on helpfulness in answering questions related to clinical features.\u003c/p\u003e \u003c/div\u003e"},{"header":"Discussion","content":"\u003cp\u003eThe present study explored the potential for the application of LLMs in AIDs. 46 questions related to the concept, clinical features, report interpretation, diagnosis, prevention and treatment, and prognosis of AIDs were entered into ChatGPT 3.4, ChatGPT 4.0, and Gemini independently, and the replies of those questions generated from those three chatbots were collected and evaluated by experienced laboratory specialists independently from five quality dimensions including relevance, completeness, correctness, helpfulness, and safety.\u003c/p\u003e \u003cp\u003eOur study demonstrated that ChatGPT 3.5 and Gemini can provide limited help in healthcare and with the advancement of LLM, while ChatGPT 4.0 might be applied to the clinical practice of AIDs. Specifically, ChatGPT 4.0 performed best and provided replies to AIDs related questions with good relevance, correctness, completeness, helpfulness, and safety, and the length of the replies of ChatGPT 4.0 was also the longest. ChatGPT 3.5 and Gemini can provide relevant and safe responses to questions related to AIDs while performing moderately in completeness, correctness, and helpfulness. Indeed, compared to ChatGPT 3.5, ChatGPT 4.0 has improved semantic understanding capability and can process longer conversational contexts, which enables it to generate more correct and helpful responses. Consistent with our findings, the safety of the ChatGPT 4.0\u0026rsquo;s responses have also been improved[\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e]. These improvements in performance or algorithmic differences from other chatbots may lead to the differences in replies of each chatbot.\u003c/p\u003e \u003cp\u003eOverall, our data showed that ChatGPT 3.5, ChatGPT 4.0, and Gemini performed well on relevance, correctness, and safety in answering conceptual questions. Nevertheless, ChatGPT 3.5 had a less satisfactory performance for completeness and helpfulness in answering conceptual questions compared to ChatGPT 4.0. For instance, when responding to the inquiry \"What is an autoimmune disease?\", ChatGPT 4.0 goes beyond the mere definition of AIDs provided by ChatGPT 3.5. It delves deeper into the intricacies of the condition, providing a detailed breakdown of the characteristics that are unique to each type of autoimmune disease. Thus, the replies of ChatGPT 4.0 were more comprehensive and helpful than ChatGPT 3.5. Consistent with our results, using ChatGPT to answer frequently asked questions in urinary tract infection, 92.6% of questions were correctly and adequately answered by ChatGPT[\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e]. ChatGPT 3.5 turbo responses also showed a less accurate response for SLE related clinical questions[\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eInterpretation of the laboratory reports may require strong semantic comprehension, logical reasoning, and a combination of the results of each test to better interpret the reports. Indeed, as the number of parameters increases, ChatGPT 4.0 is significantly better than its predecessor ChatGPT 3.5 in semantic understanding and logical reasoning[\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e]. When solving clinical laboratory problems, ChatGPT 4.0 presented considerable performance in finding out the cases and replying to questions, with an accuracy rate of 88.9%, while ChatGPT 3.5 and Copy AI have accuracy rates of 54.4% and 86.7% respectively[\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e]. In our study, ChatGPT 4.0 scored higher than ChatGPT 3.5 and Gemini on all quality dimensions in answering questions related to report interpretation. We speculate that ChatGPT 3.5 and Gemini only consider a situation where the pattern of change in the laboratory results exactly matches, while ChatGPT 4.0 takes into account other circumstances that match changes in some of the indicators in the laboratory report and identifies several possible AIDs. Therefore, ChatGPT 4.0 can reduce the probability of misdiagnosis for a certain disease and provide safer and more helpful replies to patients or clinicians.\u003c/p\u003e \u003cp\u003eChatGPT 3.5, ChatGPT 4.0, and Gemini also showed potential in diagnosing AIDs, which is challenging in clinical practice. In our study, when answering the diagnosis-related questions, all three chatbots performed better in relevance, correctness, helpfulness, and safety, with scores greater than 4. ChatGPT 4.0 and Gemini outperformed ChatGPT 3.5 in terms of completeness. Similarly, ChatGPT effectively highlighted key immunopathological and histopathological characteristics of Sj\u0026ouml;gren's Syndrome and identified potential etiological[\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eOur study also revealed the potential of LLM to assist in the prevention and treatment of certain diseases. In our study, ChatGPT 3.5, ChatGPT 4.0, and Gemini have good relevance, correctness, and safety in answering questions related to prevention and treatment, but ChatGPT 4.0 performed better than ChatGPT 3.5 and Gemini in terms of completeness and helpfulness. In assessing the role of LLMs in providing information on methotrexate administration to patients with rheumatoid arthritis, the accuracy of the outputs of ChatGPT 4.0 achieved a score of 100%, ChatGPT 3.5 secured 86.96%, and BARD and Bing each scored 60.87%. Besides, ChatGPT 4.0 achieved a comprehensive output of 100%, followed by ChatGPT 3.5 at 86.96%, BARD at 60.86%, and Bing at 0%[\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eThe performance of LLMs, particularly ChatGPT 4.0, in answering questions related to AIDs was still far from perfect. There are still many efforts to do for the application of LLMs in clinical practice. First, the replies of those LLMs need to be more comprehensive. The continuous updating of medical knowledge in the training dataset and the improved algorithm will enable LLMs to give more comprehensive replies. Secondly, the accuracy of the LLMs\u0026rsquo; replies to AIDs related questions also needs to be further improved. Retrieval-augmented generation to the embedding of customized data makes LLMs more specific and reduces hallucinations.\u003c/p\u003e \u003cp\u003eOur study has some limitations. First, man-machine control was not included in our study. By allowing clinical specialists to also respond to the questions and compare the responses of the LLMs and the clinical specialists, it is possible to get a clear picture of the distance between the LLMs and the clinic, and provide direction for their subsequent improvement and service to the clinic. Second, the repeatability of LLMs was not evaluated, which was important in clinical practice. Besides, we used Chinese to enter AIDs-related questions, however, the performance of LLMs may be influenced by the language entered[\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e]. Finally, the subjectivity of graders when scoring the quality of the replies of LLMs may affect the results to some extent.\u003c/p\u003e"},{"header":"Conclusions","content":"\u003cp\u003eIn conclusion, LLMs were able to provide replies to AIDs-related questions with specificity and safety profile. The comparative analysis revealed that ChatGPT 4.0 outperformed both ChatGPT 3.5 and Gemini in delivering complete, accurate, and helpful information regarding responding AID-related care. The comprehensive performance of advanced LLMs of AIDs related clinical questions suggests that advanced LLMs were capable to provide valuable help for the health of patients with AIDs and the quality of clinical practice by rheumatologists.\u003c/p\u003e "},{"header":"Abbreviations","content":"\u003cp\u003eAI\u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;\u0026nbsp;\u0026nbsp; \u0026nbsp; \u0026nbsp;Artificial intelligence\u003c/p\u003e\n\u003cp\u003eAIDs\u0026nbsp; \u0026nbsp; \u0026nbsp;Autoimmune diseases\u003c/p\u003e\n\u003cp\u003eAPI\u0026nbsp; \u0026nbsp; \u0026nbsp;\u0026nbsp;\u0026nbsp; \u0026nbsp; Application platform interface\u003c/p\u003e\n\u003cp\u003eLLMs\u0026nbsp; \u0026nbsp;Large language models\u003c/p\u003e\n\u003cp\u003eNLP\u0026nbsp; \u0026nbsp; \u0026nbsp;\u0026nbsp; \u0026nbsp; Natural language processing\u003c/p\u003e\n\u003cp\u003eRA\u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;\u0026nbsp; \u0026nbsp; \u0026nbsp;Rheumatoid arthritis\u003c/p\u003e\n\u003cp\u003eSLE \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; Systemic lupus erythematosus\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eAcknowledgements\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthor contributions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eWeiming Zhang: Formal analysis, Writing \u0026ndash; original draft, Writing \u0026ndash; review \u0026amp; editing. Jie Yu: Writing \u0026ndash; original draft, Writing \u0026ndash; review \u0026amp; editing. Juntao Ma: Data curation, Formal analysis, Validation. Jiawei Feng: Writing - review \u0026amp; editing. Linyu Geng: Writing - review \u0026amp; editing. Yuxin Chen: Conceptualization, Methodology, Resources, Data curation, Supervision, Validation, Writing \u0026ndash; review \u0026amp; editing, Funding acquisition. Huayong Zhang: Conceptualization, Methodology, Supervision. Mingzhe Ning: Conceptualization, Methodology, Data curation, Supervision, Validation, Writing \u0026ndash; review \u0026amp; editing, Funding acquisition.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFundings\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis work was supported by the National key research and development program [2023YFC2309100]; National Natural Science Foundation of China [92269118,92269205,92369117]; Scientific Research Project of Jiangsu Health Commission [M2022013]; Clinical Trials from the Affiliated Drum Tower Hospital, Medical School of Nanjing University [2021-LCYJ-PY-10]; Project of Chinese Hospital Reform and Development Institute, Nanjing University, Aid project of Nanjing Drum Tower Hospital Health, Education \u0026amp;Research Foundation [NDYG2022003].\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eData availability\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eEthics approval and consent to participate\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConsent for publication\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting interest\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors declare that they have no conflicts of interest.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eHirschberg J, Manning CD. Advances in natural language processing. Science. 2015;349:261\u0026ndash;6. \u003c/li\u003e\n\u003cli\u003eDe Angelis L, Baglivo F, Arzilli G, Privitera GP, Ferragina P, Tozzi AE, et al. ChatGPT and the rise of large language models: the new AI-driven infodemic threat in public health. Front Public Health. 2023;11:1166120. \u003c/li\u003e\n\u003cli\u003eLee P, Bubeck S, Petro J. Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. N Engl J Med. 2023;388:1233\u0026ndash;9. \u003c/li\u003e\n\u003cli\u003eSanderson K. GPT-4 is here: what scientists think. Nature. 2023;615:773. \u003c/li\u003e\n\u003cli\u003eRobinson MA, Belzberg M, Thakker S, Bibee K, Merkel E, MacFarlane DF, et al. Assessing the accuracy, usefulness, and readability of artificial-intelligence-generated responses to common dermatologic surgery questions for patient education: A double-blinded comparative study of ChatGPT and Google Bard. J Am Acad Dermatol. 2024;90:1078\u0026ndash;80. \u003c/li\u003e\n\u003cli\u003eKung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepa\u0026ntilde;o C, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2:e0000198. \u003c/li\u003e\n\u003cli\u003eThirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023;29:1930\u0026ndash;40. \u003c/li\u003e\n\u003cli\u003eWatad A, Bragazzi NL, Adawi M, Amital H, Toubi E, Porat B-S, et al. Autoimmunity in the Elderly: Insights from Basic Science and Clinics - A Mini-Review. Gerontology. 2017;63:515\u0026ndash;23. \u003c/li\u003e\n\u003cli\u003eDumas G, Arabi YM, Bartz R, Ranzani O, Scheibe F, Darmon M, et al. Diagnosis and management of autoimmune diseases in the ICU. Intensive Care Med. 2024;50:17\u0026ndash;35. \u003c/li\u003e\n\u003cli\u003eWang L, Wang F-S, Gershwin ME. Human autoimmune diseases: a comprehensive update. J Intern Med. 2015;278:369\u0026ndash;95. \u003c/li\u003e\n\u003cli\u003eDavidson A, Diamond B. Autoimmune diseases. N Engl J Med. 2001;345:340\u0026ndash;50. \u003c/li\u003e\n\u003cli\u003eJanssen NM, Karnad DR, Guntupalli KK. Rheumatologic diseases in the intensive care unit: epidemiology, clinical approach, management, and outcome. Crit Care Clin. 2002;18:729\u0026ndash;48. \u003c/li\u003e\n\u003cli\u003eLarcher R, Pineton de Chambrun M, Garnier F, Rubenstein E, Carr J, Charbit J, et al. One-Year Outcome of Critically Ill Patients With Systemic Rheumatic Disease: A Multicenter Cohort Study. Chest. 2020;158:1017\u0026ndash;26. \u003c/li\u003e\n\u003cli\u003eDumas G, G\u0026eacute;ri G, Montlahuc C, Chemam S, Dangers L, Pichereau C, et al. Outcomes in critically ill patients with systemic rheumatic disease: a multicenter study. Chest. 2015;148:927\u0026ndash;35. \u003c/li\u003e\n\u003cli\u003eAltunisik E. Artificial intelligence and multiple sclerosis: ChatGPT model. Mult Scler Relat Disord. 2023;76:104851. \u003c/li\u003e\n\u003cli\u003eHuang C, Hong D, Chen L, Chen X. Assess the precision of ChatGPT\u0026rsquo;s responses regarding systemic lupus erythematosus (SLE) inquiries. Skin Res Technol. 2023;29:e13500. \u003c/li\u003e\n\u003cli\u003eCoskun BN, Yagiz B, Ocakoglu G, Dalkilic E, Pehlivan Y. Assessing the accuracy and completeness of artificial intelligence language models in providing information on methotrexate use. Rheumatol Int. 2024;44:509\u0026ndash;15. \u003c/li\u003e\n\u003cli\u003eIrfan B, Yaqoob A. ChatGPT\u0026rsquo;s Epoch in Rheumatological Diagnostics: A Critical Assessment in the Context of Sj\u0026ouml;gren\u0026rsquo;s Syndrome. Cureus. 2023;15:e47754. \u003c/li\u003e\n\u003cli\u003eChen C-W, Walter P, Wei JC-C. Using ChatGPT-Like Solutions to Bridge the Communication Gap Between Patients With Rheumatoid Arthritis and Health Care Professionals. JMIR Med Educ. 2024;10:e48989. \u003c/li\u003e\n\u003cli\u003eCadamuro J, Cabitza F, Debeljak Z, De Bruyne S, Frans G, Perez SM, et al. Potentials and pitfalls of ChatGPT and natural-language artificial intelligence models for the understanding of laboratory medicine test results. An assessment by the European Federation of Clinical Chemistry and Laboratory Medicine (EFLM) Working Group on Artificial Intelligence (WG-AI). Clin Chem Lab Med. 2023;61:1158\u0026ndash;66. \u003c/li\u003e\n\u003cli\u003eAbusoglu S, Serdar M, Unlu A, Abusoglu G. Comparison of three chatbots as an assistant for problem-solving in clinical laboratory. Clin Chem Lab Med. 2024;62:1362\u0026ndash;6. \u003c/li\u003e\n\u003cli\u003eZaitsu W, Jin M. Distinguishing ChatGPT(-3.5, -4)-generated and human-written papers through Japanese stylometric analysis. PLoS One. 2023;18:e0288453. \u003c/li\u003e\n\u003cli\u003eCakir H, Caglar U, Sekkeli S, Zerdali E, Sarilar O, Yildiz O, et al. Evaluating ChatGPT ability to answer urinary tract Infection-Related questions. Infect Dis Now. 2024;54:104884. \u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Large Language Models, Autoimmune Diseases, ChatGPT 4.0","lastPublishedDoi":"10.21203/rs.3.rs-4810651/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-4810651/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003eBackground\u003c/h2\u003e \u003cp\u003eArtificial intelligence (AI) has made great strides. Our study evaluated the performance in delivering clinical questions related to autoimmune diseases (AIDs).\u003c/p\u003e\u003ch2\u003eMethods\u003c/h2\u003e \u003cp\u003e46 AIDs-related questions were compiled and entered into ChatGPT 3.5, ChatGPT 4.0, and Gemini. The replies were collected and sent to laboratory specialists for scoring according to relevance, correctness, completeness, helpfulness, and safety. Scores for three chatbots in five quality dimensions and the scores of the replies to the questions under each quality dimension were analyzed.\u003c/p\u003e\u003ch2\u003eResults\u003c/h2\u003e \u003cp\u003eChatGPT 4.0 showed superior performance than ChatGPT 3.5 and Gemini in all five quality dimensions. ChatGPT 4.0 outperformed ChatGPT 3.5 or Gemini on the relevance, completeness or helpfulness in answering about the prognosis, diagnosis, or the report interpretation of AIDs. ChatGPT 4.0\u0026rsquo;s replies were the longest, followed by ChatGPT 3.5, Gemini\u0026rsquo;s was the shortest.\u003c/p\u003e\u003ch2\u003eConclusions\u003c/h2\u003e \u003cp\u003eOur findings highlight ChatGPT 4.0 is superior to delivering comprehensive and accurate responses to AIDs-related clinical questions.\u003c/p\u003e","manuscriptTitle":"A Comparative Analysis of Large language Models on Clinical Questions for Autoimmune Diseases","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-08-27 10:42:37","doi":"10.21203/rs.3.rs-4810651/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"81fc4683-6ca4-409a-8099-97621a20aadd","owner":[],"postedDate":"August 27th, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2024-09-15T21:53:25+00:00","versionOfRecord":[],"versionCreatedAt":"2024-08-27 10:42:37","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-4810651","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-4810651","identity":"rs-4810651","version":["v1"]},"buildId":"qtupq5eGEP_6zYnWcrvyt","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-05-23T02:00:01.238055+00:00

License: CC-BY-4.0