Comparative Evaluation of Viral Hepatitis Question Responses: ChatGPT-4.5 Outperforms Three Established Models

doi:10.21203/rs.3.rs-6935611/v1

Comparative Evaluation of Viral Hepatitis Question Responses: ChatGPT-4.5 Outperforms Three Established Models

2025 · doi:10.21203/rs.3.rs-6935611/v1

preprint OA: closed

Full text JSON View at publisher

Full text 90,536 characters · extracted from preprint-html · click to expand

Comparative Evaluation of Viral Hepatitis Question Responses: ChatGPT-4.5 Outperforms Three Established Models | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Comparative Evaluation of Viral Hepatitis Question Responses: ChatGPT-4.5 Outperforms Three Established Models Juntao Ma, Linyan Gong, Yuchen Song, Guiyang Wang, Juan Xia, Yun Liu, and 2 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-6935611/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 26 Nov, 2025 Read the published version in BMC Medical Informatics and Decision Making → Version 1 posted 12 You are reading this latest preprint version Abstract Background Viral hepatitis is an important global public health problem that affects millions of people, which needs accurate information to help the public understand the disease correctly. This study evaluated four large language models (LLMs) including Gemini-2.0, Claude-3.5-sonnet, ChatGPT-4.5 and ChatGPT-4, and compared their responses to questions related to viral hepatitis to determine whether ChatGPT-4.5 was better than the other three models in this field. Methods This comparative evaluation study, conducted at Nanjing Drum Tower Hospital from March to April 2025, examined 52 questions pertaining to viral hepatitis. Four large language models were assessed based on their responses to these 52 questions which encompassed four domains: concepts, risk factors, diagnosis, and prevention and treatment. Initial evaluation used a three-point scale of good, borderline, and poor. Further evaluation criteria included relevance, comprehensiveness, accuracy, safety, and readability, with each response scored on a scale of 1 to 5. Results ChatGPT-4.5 achieved the highest performance, with 89.1% of its responses rated as good, significantly outperforming Claude-3.5-sonnet (71.15% good), Gemini-2.0 (62.82% good), and ChatGPT-4 (50.64% good). Statistical analysis confirmed superior performance of ChatGPT-4.5 in all evaluated dimensions. Consistently, ChatGPT-4.5 scored the highest across all five criteria: relevance, comprehensiveness, accuracy, safety, and readability. Conclusions ChatGPT-4.5 demonstrates superior performance in addressing viral hepatitis queries compared to other three models. Its high reliability makes it a valuable tool for healthcare professionals and patients by improving information accessibility. Viral Hepatitis Large language model ChatGPT Comparative Study Infectious disease Figures Figure 1 Figure 2 Figure 3 Figure 4 Introduction Large language models (LLMs) utilize advanced artificial intelligence (AI) algorithms to generate language that closely mimics human communication. Trained on diverse datasets, these models excel in tasks ranging from translation to summarization and question answering( 1 ). During training, the model employed self-supervised learning to predict context within the text, gradually learning patterns and relationships between words and sentences. Additionally, with improvements in large-scale computing and optimization( 2 ), LLMs like Gemini-2.0 (Released on February 5, 2025), Claude-3.5-sonnet (Released on October 22, 2023), ChatGPT-4.5 (Released on February 27, 2025), and ChatGPT-4 (Released on March 14, 2023) has significantly advanced natural language processing (NLP), enabling machines to understand and generate human-like text( 3 – 6 ). Viral hepatitis, a significant global health issue, includes a range of liver diseases caused by different viruses, primarily classified as hepatitis A, B, C, D, and E. Although much progress has been made in the prevention and treatment of viral hepatitis( 7 – 9 ), viral hepatitis is still a global public health concern that affects millions of people and causes thousands of deaths due to acute and chronic infection, cirrhosis, and liver cancer( 10 , 11 ). In 2020, viral hepatitis B- and C-related diseases led to 1.1 million deaths, similar to the number caused by tuberculosis (1.3 million deaths) and significantly higher than the number caused by HIV (0.68 million deaths) or malaria (0.627 million deaths)( 12 ). In May 2022, the WHO developed a plan to eliminate viral hepatitis, 2022–2030 Action Plan proposed to eliminate the public health threat of viral hepatitis by 2023, the new infection rate of chronic hepatitis B (CHB) and chronic hepatitis C (CHC) decreased by 90%, the mortality rate decreased by 65%, and the cure rate reached 80%~90%. By the measure of the plan, the current state of China is not optimistic( 13 ). Accurate and accessible information is crucial for healthcare professionals and the general public to understand, prevent, and manage these conditions, and effective communication and dissemination of information about viral hepatitis are critical in combating this public health threat. Our research explores how LLM technology can enhance the accuracy, accessibility, and timeliness of information related to viral hepatitis, benefiting both the public and healthcare providers. This supports WHO's 2030 targets for reducing viral hepatitis incidence and mortality by enhancing prevention, diagnosis, and treatment, as well as raising public awareness( 13 ). LLMs offer a promising solution by delivering reliable, comprehensive, and easy-to-understand information. It's worth noting that, LLMs are based on human-generated documents and that, even with proper curation, biases persist. In November 2022, OpenAI introduced an updated LLM called ChatGPT, which attracted significant attention for its accessibility, ease of use, and human-like outputs( 14 , 15 ). Since its release, numerous other LLMs and tools have been developed unprecedentedly( 16 , 17 ), including Gemini-2.0, Claude-3.5-sonnet, ChatGPT-4.5, and ChatGPT-4( 18 ). These LLMs have shown potential in various applications, including medical information retrieval, patient education, and clinical decision support( 3 , 19 ). Recently, LLMs have shown promise in providing information on multiple topics, including health-related queries( 16 , 17 ). Previous studies have demonstrated the capabilities of LLMs in numerous applications, including medical information retrieval, patient education, and clinical decision support( 20 ). For instance, ChatGPT has been shown to generate accurate and satisfactory answers for both public-based inquiries and EAU urological infection guideline-based questions( 21 ). Similarly, ChatGPT-4 has the potential to deliver precise and comprehensive responses to myopia-related queries( 22 ). There is currently limited research comparing the performance of different LLMs in specific medical topics such as viral hepatitis. This study seeks to fill this gap by assessing the effectiveness of four prominent models—Gemini-2.0, Claude-3.5-sonnet, ChatGPT-4.5 and ChatGPT-4—in providing accurate and relevant medical information. By comparing the performance of these LLMs, this research aims to highlight the strengths and limitations of each model in addressing viral hepatitis-related queries. Methods Study design A comprehensive set of 52 questions related to viral hepatitis was curated, covering various aspects such as concepts, risk factors, diagnosis, prevention and treatment (Fig. 1 ). These questions were meticulously designed by three infectious disease experts, some of the questions were adapted from patient case scenarios, while the others were high-frequency issues they encounter in clinical practice. A portion of the questions is presented in Table 1 . The profiles of the three experts are provided in the S1 Table. Subsequently, we reviewed and revised the questions to ensure they adequately tested the depth and breadth of the LLMs’ knowledge. The questions aimed to reflect real-world queries that healthcare professionals and patients might have, thereby providing a robust framework for evaluation. Table 1 A portion of the questions designed by three infectious disease experts. Topic Questions Concept 1. What is viral hepatitis? 7. What is chronic hepatitis B? Risk Factors 9. Under what conditions is one more likely to contract hepatitis A? 14. Is hepatitis B hereditary? My mother has hepatitis B; what should I be aware of? Should I undergo further testing? Diagnosis 16. What laboratory tests are needed to help doctors diagnose viral hepatitis? 17. During a routine physical examination, is it necessary to perform a full hepatitis panel? 18. Under what circumstances should co-infection with hepatitis D be considered? Prevention and Treatment 26. If someone in the family has viral hepatitis, how should others prevent it? 29. If I have viral hepatitis, what dietary precautions should I take? The models evaluated in this study include Gemini-2.0, Claude-3.5-sonnet, ChatGPT-4.5 and ChatGPT-4. Gemini-2.0, developed by OpenAI, is known for its strong baseline performance in generating coherent and contextually relevant text. Claude-3.5-sonnet, an advanced version of Gemini-2.0, offers improved accuracy, comprehensiveness, and contextual understanding. ChatGPT-4.5, the new model released by OpenAI, boasts faster speeds and superior performance. ChatGPT-4, developed by Google, is noted for its versatility and application in various NLP tasks, leveraging Google's extensive resources and advanced machine learning algorithms to deliver high-quality responses. Before engaging in conversation with LLMs, we first input a prompt: "Answer my questions as an infectious disease doctor." Each question will be conducted in a separate dialogue, and the entire Q&A process was conducted in English. The evaluation focuses on their relevance (The degree of relevance between the model response and the user's question. A highly relevant response should directly address the user's question and provide relevant information and answers.), comprehensiveness (Involves the breadth and depth of the model's response. A comprehensive answer should cover all aspects of the user's question, provide detailed information, and provide background information and context when necessary.), accuracy (The accuracy of the information in the model's response. Accurate answers should be based on facts and avoid misleading users.), safety (This involves whether the model's responses are safe and do not contain harmful content. Safety is one of the standards that LLMs must strictly control to ensure that the model's output does not cause harm to society or individuals.), and readability (The clarity and comprehensibility of the model's responses. Highly readable responses should use concise language, have clear logic, and be easy for users to understand.). The scoring scheme in this study is based on previously published studies( 22 ), and the evaluation scheme for our project was determined through joint discussion among all participants. Responses were initially rated as good, borderline, and poor by a panel of three infectious disease experts who designed the questions. All evaluators had passed China's College English Test Band 6 (CET-6), and the use of translation software was permitted throughout the question design and scoring process. This rating was based on the overall quality of the response, considering factors such as clarity, relevance, and accuracy. After the first evaluation, there will be a two-day interval to minimize the influence of prior assessments on the experts' judgments. Following the initial rating, a different panel of three infectious disease experts assessed the responses on five dimensions: relevance, comprehensiveness, accuracy, safety, and readability. Each dimension was scored on a scale of 1 to 5 by three experts. Statistical analysis Statistical analyses were conducted using GraphPad Prism (version 10.1.2, GraphPad Software, San Diego, California USA, www.graphpad.com ) and SPSS (version 27.0, IBM Corp., Armonk, New York, USA, www.ibm.com/spss ). We utilized Fleiss's Kappa in SPSS to conduct an inter-rater reliability (IRR) analysis on the scoring data. Then we used Prism to compare the total scores of four LLMs across five quality dimensions: relevance, comprehensiveness, accuracy, safety, and readability, as well as the scores of four models in answering different types of questions. After testing the data for normality and homogeneity of variance, a two-way ANOVA was employed to evaluate the effects of model type and question type on the scores. Prism's multiple comparisons automatically adjust the p-values, and we used the Tukey test. A p-value of less than 0.05 was considered statistically significant. Results The performance of four models—Gemini-2.0, Claude-3.5-sonnet, ChatGPT-4.5, and ChatGPT-4—was evaluated using a comprehensive set of 52 questions related to viral hepatitis. Developed by a panel of infectious disease specialists, the questions spanned four clinical domains: concepts, risk factors, prevention and treatment, and diagnosis. The responses were assessed using a two-step evaluation process: an initial categorization using a three-point scale, followed by a more in-depth five-point rating system that assessed dimensions such as relevance, comprehensiveness, accuracy, safety, and readability. This methodology allowed for a nuanced evaluation of the models' performance. The evaluations revealed varying levels of performance among the models, with each demonstrating unique strengths and areas for improvement. All question-and-answer sessions are presented in the S1 File. We utilized Fleiss's Kappa in SPSS to conduct an inter-rater reliability (IRR) analysis on the scoring data. The results indicated a Kappa value of 0.77, suggesting a high level of agreement among the raters. Comparison of Scores Across Four Models Using a Three-Point Scale Responses were categorized as good, borderline, and poor, revealing significant differences among the models. ChatGPT-4.5 outperformed all other models, with 89.1% of its responses rated as good, 10.9% as borderline, and no poor responses, demonstrating its capacity to deliver high-quality answers efficiently and accurately (Fig. 2 ). Claude-3.5-sonnet followed with 71.15% good, 28.21% borderline, and 0.64% poor responses. This reflects a marked improvement over Gemini-2.0, which achieved 62.82% good, 35.9% borderline, and 0.64% poor responses. However, there remains a notable gap between Claude-3.5-sonnet and ChatGPT-4.5. ChatGPT-4, while versatile, showed less consistent performance with 50.64% good, 41.67% borderline, and 1.28% poor ratings. The findings suggest that the model's capacity to deliver high-quality responses is inferior to that of its competitors. The comparison of four models using a three-grade scale including good, borderline and poor. The rating was based on the overall quality of the response, considering factors such as clarity, relevance and accuracy. The scores are presented as percentages for each category, highlighting the variations in response quality and effectiveness among the models. Evaluation of Four Models Using a Five-Point Rating System The five-point rating system provided a detailed assessment across five dimensions: relevance, comprehensiveness, accuracy, safety, and readability. ChatGPT-4.5 consistently scored highest across all dimensions (Fig. 3 ). For relevance, it achieved an average score of 4.91 ± 0.18, compared to Claude-3.5-sonnet's 4.74 ± 0.23, Gemini-2.0's 4.67 ± 0.24, and ChatGPT-4's 4.51 ± 0.31. The statistical analysis confirmed significant differences, particularly between ChatGPT-4.5 and the other models, highlighting its superior contextual understanding. The evaluation of four models across five dimensions—relevance, comprehensiveness, accuracy, safety, and readability—utilizes a five-point rating system from 1 to 5. Statistical significance is denoted with asterisks, where "ns" indicates not significant, "*" for P < 0.05, "**" for P < 0.01, "***" for P < 0.001, and "****" for P < 0.0001. Error bars represent the standard error. In terms of comprehensiveness, ChatGPT-4.5 led with a score of 4.75 ± 0.42, followed by Claude-3.5-sonnet at 4.22 ± 0.60, Gemini-2.0 at 4.20 ± 0.53, and ChatGPT-4 at 3.66 ± 0.83, indicating its ability to provide thorough and detailed answers. Accuracy scores also demonstrated ChatGPT-4.5's leading position, with it scoring 4.81 ± 0.25, compared to Claude-3.5-sonnet’s 4.41 ± 0.42, ChatGPT-4's 4.14 ± 0.67, and Gemini-2.0's 4.31 ± 0.40. This indicates that ChatGPT-4.5 is more precise when handling complex queries, although other models also performed competently. For safety, ChatGPT-4.5 scored the highest at 4.87 ± 0.20, reflecting its strong adherence to providing accurate medical information. Claude-3.5-sonnet also performed well at 4.55 ± 0.34, with Gemini-2.0 4.52 ± 0.29 and ChatGPT-4 4.45 ± 0.33. While these scores are relatively close, ChatGPT-4.5's higher rating indicates a notable distinction in ensuring the safety and accuracy of the information provided. In terms of readability, ChatGPT-4.5 excelled with a score of 4.80 ± 0.29, compared to Claude-3.5-sonnet's 4.27 ± 0.34, Gemini-2.0's 4.21 ± 0.37, and ChatGPT-4's 4.10 ± 0.47. This highlights ChatGPT-4.5's ability to generate clear and comprehensible text, which is critical for ensuring user understanding. Evaluation of Four Models Across Four Domains Using a Five-Point Scale The performance across specific domains—concepts, risk factors, diagnosis, and prevention and treatment—provided further insights into the models' effectiveness. For this section, the total scores for each question across the four domains were averaged among the three judges and compared, allowing for a more comprehensive comparison. In the domain of concepts, ChatGPT-4.5 achieved an average score of 22.17 ± 0.96, compared to Claude-3.5-sonnet's 20.71 ± 1.50, ChatGPT-4's 21.50 ± 1.32, and Gemini-2.0's 21.21 ± 1.31 (Fig. 4 ). Although no significant differences were found statistically, ChatGPT-4.5 showed a slight edge in understanding and explaining concepts. The performance of Gemini-2.0, Claude-3.5-sonnet, ChatGPT-4.5, and ChatGPT-4 is evaluated across four domains: concepts, risk factors, diagnosis, and prevention and treatment. Each model was assessed using a five-point scale, with total scores for each question averaged among three judges, and a maximum possible score of 25. Statistical significance is denoted with asterisks, where "ns" indicates not significant, "*" for P < 0.05, "**" for P < 0.01, "***" for P < 0.001, and "****" for P < 0.0001. Error bars represent the standard error. When evaluating risk factors, ChatGPT-4.5 led with an average score of 23.61 ± 1.31, followed by Claude-3.5-sonnet at 23.06 ± 1.64, Gemini-2.0 at 22.72 ± 1.14, and ChatGPT-4 at 21.83 ± 1.09. While differences in performance were not statistically significant, ChatGPT-4.5 and Claude-3.5-sonnet consistently demonstrated high performance in this domain, suggesting their effectiveness in identifying and elaborating on risk factors. In diagnosing, ChatGPT-4.5 scored 24.73 ± 0.49, significantly higher than Claude-3.5-sonnet's 21.79 ± 1.57, Gemini-2.0's 21.15 ± 1.67, and ChatGPT-4's 21.12 ± 1.85. Statistical analysis revealed significant differences, particularly between ChatGPT-4.5 and the other models, demonstrating its superior performance in diagnostic queries. This finding underscores ChatGPT-4.5's advantage in addressing complex diagnostic questions accurately. Regarding prevention and treatment, ChatGPT-4.5 excelled with an average score of 24.58 ± 0.78, ahead of Claude-3.5-sonnet's 22.62 ± 1.45, Gemini-2.0's 22.26 ± 1.43, and ChatGPT-4's 20.33 ± 2.52. Statistical analysis confirmed significant differences, particularly between ChatGPT-4.5 and the other models, highlighting its ability to provide detailed and precise information for the prevention and treatment of viral hepatitis. Discussion Our work conducts an comparison of LLMs in addressing issues related to viral hepatitis across five dimensions, four domains, and three levels. We evaluated and assessed the performance of four language models, ultimately finding that ChatGPT-4.5 significantly outperforms the other LLMs. Our findings demonstrate the capabilities of ChatGPT-4.5, rapidly filling the academic gap regarding this new model. While numerous published articles integrate LLMs with disease treatment, research specifically applying LLMs to viral hepatitis is relatively rare, making our study a valuable contribution to the specialty medical scene. Our data revealed that ChatGPT-4.5's "Good" rating leads the other three models in a three-point comparison, as recognized by clinical physicians. ChatGPT-4.5's "Borderline" rating is significantly lower, and its "Poor" rating is zero. Our research further shows that ChatGPT-4.5 scores higher than other LLMs in the five dimensions of relevance, completeness, accuracy, safety, and readability. This conclusion validates the robust functionality of ChatGPT-4.5. The study also reveals that ChatGPT-4.5's responses in diagnosis and prevention and treatment are significantly higher than other LLMs. From a technical perspective, ChatGPT-4.5's superior performance likely stems from iterative improvements in its multimodal architecture. Compared to earlier versions like Gemini-2.0, its training data integrates more up-to-date medical knowledge such as clinical guidelines and evidence-based research. Notably, the four LLMs do not show significant differences in concept and risk factors. We hypothesize this condition occurs because such questions primarily assess data retrieval capabilities rather than requiring complex logical reasoning. Consequently, even the earlier-generation Gemini-2.0 demonstrated satisfactory performance in addressing these inquiries. For subjective questions, both Claude-3.5-sonnet and ChatGPT-4.5 provide more intuitive answers, with ChatGPT-4.5 providing more comprehensive and well-organized responses, generally meeting the current needs for simple disease management. However, compared to clinical physicians, LLMs tend to provide a range of answers, resulting in mixed information and the inability to formulate personalized strategies. Furthermore, LLMs' content requires maintenance with the latest guidelines and research, introducing a certain level of information error. The answers of LLMs to common clinical questions about viral hepatitis are beneficial for general practitioners or healthcare institutions to quickly and systematically acquire information, especially non-specialist doctors. Additionally, LLMs can provide "informal" medical consultations for healthcare institutions. In chronic disease follow-up management, setting common questions and systematic answers allows doctors to guide patients to check details during follow-up independently, and updates can be made based on patient feedback. Clinically, LLMs can assist in organizing medical histories and case information. Although LLMs cannot make decisive treatment decisions for complex cases, it can provide systematic information matching, enhancing clinical efficiency. However, our study has certain limitations. The design of the questions might contain subjectivity, the number and breadth of the question set are relatively limited, and the lack of real-world application testing during the evaluation process might slightly impact the comprehensiveness and objectivity of the results. Although the questions were designed by infectious disease experts, they might not fully cover all real-world scenarios. The limited number of questions may not fully showcase the models' performance across different domains. Furthermore we did not compare the Q&A results between LLMs and physicians in infectious disease. Without this comparison, it's difficult to determine the true impact of LLMs on the quality of healthcare services. In addition, LLMs are highly time-sensitive, with new and more advanced models emerging regularly. As such, our research can only reflect the capabilities of LLMs at the time of the study. Also, we did not test other LLMs such as BERT, Ernie, Gemma, Lamda, etc. While these models are powerful, they often require complex technical setups like local deployment, which can hinder usability. We acknowledge the limitation of not evaluating a broader range of models, but our focus was on selecting models that offer user-friendly interfaces suitable for patient use. Looking ahead, large language models like ChatGPT-4.5 can improve clinical value in two ways. First, they can create automated connections with trusted medical databases to make sure they follow the latest clinical guidelines. Second, they can build a system where doctors and AI work together, using patient lab results to help create treatment plans that are personalized. These improvements would work well with ChatGPT-4.5's current strengths, such as quick data processing, 24/7 monitoring, and facial recognition. The facial recognition technology can help with both health checks and noticing small changes in facial expressions, which can improve both diagnosis and emotional support for patients. Together, these changes could help meet the WHO's 2030 goal of eliminating hepatitis by giving doctors quick, guideline-based insights and making their work more efficient while improving decision-making.( 23 ). Conclusions This study evaluates LLMs' responses to issues related to viral hepatitis from multiple dimensions, revealing varying degrees of reliability in the answers provided by the LLMs, with ChatGPT-4.5 leading across all dimensions. ChatGPT-4.5 is particularly effective for handling complex medical queries on viral hepatitis, making it a reliable choice for accurate and comprehensive responses in viral hepatitis diagnosis. Abbreviation AI artificial intelligence LLM large language model CHB chronic hepatitis B CHC chronic hepatitis C Declarations Ethics approval and consent to participate The study protocol was approved by the Ethics Committee of Nanjing University Affiliated Drum Tower Hospital (Approval Number: 2008022). Informed consent has been obtained from all participants in the study, and our research strictly adheres to the Declaration of Helsinki. The ethics approval scan has been added to the supplementary material S2_File. Consent for publication All authors agree to the publication. Competing interests The authors declare that there are no conflicts of interest. Funding statement This work was supported by the National key research and development program [2023YFC2309100]; National Natural Science Foundation of China [92269118,92269205,92369117]; Scientific Research Project of Jiangsu Health Commission [M2022013]; Clinical Trials from the Affiliated Drum Tower Hospital, Medical School of Nanjing University [2021-LCYJ-PY-10]; Project of Chinese Hospital Reform and Development Institute, Nanjing University, Aid project of Nanjing Drum Tower Hospital Health, Education &Research Foundation [NDYG2022003]. Author Contribution Juntao Ma: Formal analysis, Data curation, Validation, Writing – original draft, Writing – review & editing. Linyan Gong: Formal analysis, Writing – original draft, Writing – review & editing. Yuchen Song: Formal analysis, Writing – original draft, Writing – review & editing. Guiyang Wang: Data scoring. Juan Xia: Data scoring. Bei Jia: Conceptualization, Methodology, Resources, Data curation, Supervision, Validation, Writing – review & editing, Funding acquisition.Yuxin Chen: Conceptualization, Methodology, Data curation, Supervision, Validation, Writing – review & editing, Funding acquisition. Acknowledgements Not applicable. Data Availability Data is provided within the supplementary information files. References Binz M, Schulz E. Using cognitive psychology to understand GPT-3. Proc Natl Acad Sci U S A. 2023 Feb 7;120(6):e2218523120. V S, E K, M SL, I C, Db Z, N BL, et al. Large language model (ChatGPT) as a support tool for breast tumor board. NPJ breast cancer [Internet]. 2023 May 30 [cited 2024 Oct 14];9(1). Available from: https://pubmed.ncbi.nlm.nih.gov/37253791/ Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large language models encode clinical knowledge. Nature. 2023 Aug;620(7972):172–80. C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE; 2016. doi: 10.1109/cvpr.2016.308. In. Zhang W, Feng Y, Meng F, You D, Liu Q. Bridging the gap between training and inference for neural machine translation. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA, USA: Association for Computational Linguistics; 2019. doi: 10.18653/v1/p19-1426. Bhatia Y, Bajpayee A, Raghuvanshi D, Mittal H. Image captioning using Google’s inception-resnet-v2 and recurrent neural network. 2019 Twelfth International Conference on Contemporary Computing (IC3). IEEE; 2019. doi: 10.1109/ic3.2019.8844921. Zeng DY, Li JM, Lin S, Dong X, You J, Xing QQ, et al. Global burden of acute viral hepatitis and its association with socioeconomic development status, 1990-2019. J Hepatol. 2021 Sep;75(3):547–56. Nassal M. (2015). HBV cccDNA: viral persistence reservoir and key obstacle for a cure of chronic hepatitis B. Gut, 64(12), 1972–1984. https://doi.org/10.1136/gutjnl-2015-309809. Revill, P. A., Chisari, F. V., Block, J. M., Dandri, M., Gehring, A. J., Guo, H., Hu, J., Kramvis, A., Lampertico, P., Janssen, H. L. A., Levrero, M., Li, W., Liang, T. J., Lim, S. G., Lu, F., Penicaud, M. C., Tavis, J. E., Thimme, R., Members of the ICE-HBV Working Groups, ICE-HBV Stakeholders Group Chairs, … Zoulim, F. (2019). A global scientific strategy to cure hepatitis B. The lancet. Gastroenterology & hepatology, 4(7), 545–558. https://doi.org/10.1016/S2468-1253(19)30119-0. Ouyang, G., Pan, G., Guan, L., Wu, Y., Lu, W., Qin, C., Li, S., Xu, H., Yang, J., & Wen, Y. (2022). Incidence trends of acute viral hepatitis caused by four viral etiologies between 1990 and 2019 at the global, regional and national levels. Liver international : official journal of the International Association for the Study of the Liver, 42(12), 2662–2673. https://doi.org/10.1111/liv.15452IF: 6.7 Q1. Te H, Doucette K. Viral hepatitis: Guidelines by the American Society of Transplantation Infectious Disease Community of Practice. Clin Transplant 2019;33:e13514. https://doi.org/10.1111/ctr.13514. Devarbhavi H, Asrani SK, Arab JP, Nartey YA, Pose E, Kamath PS. Global burden of liver disease: 2023 update. J Hepatol. 2023 Aug;79(2):516–37. WHO. Global health sector strategies on, respectively, HIV, viral hepatitis and sexually transmitted infections for the period 2022-2030[EB/OL].( 2022-06)[2022-11-13]. https://cdn.who.int/media/docs/default-source/hq-hiv-hepatitis-and-stis-library/full-final-who-ghss-hiv-vh-sti_1-june2022.pdf?sfvrsn=7c074b36_9. Nazir A, Wang Z. A Comprehensive Survey of ChatGPT: Advancements, Applications, Prospects, and Challenges. Meta Radiol. 2023 Sep;1(2):100022. doi: 10.1016/j.metrad.2023.100022. Epub 2023 Oct 7. PMID: 37901715; PMCID: PMC10611551. OpenAI. ChatGPT: Optimizing Language Models for Dialogue.https://openai. com/blog/chatgpt/ (2022). Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023 Feb;2(2):e0000198. OpenAI. GPT-4 technical report. Preprint at arXiv https://doi.org/ 10.48550/arXiv.2303.08774 (2023). Google. Bard updates from Google I/O 2023: Images, new features. https://blog.google/technology/ai/google-bard-updates-io-2023/. Steimetz E, Minkowitz J, Gabutan EC, Ngichabe J, Attia H, Hershkop M, et al. Use of Artificial Intelligence Chatbots in Interpretation of Pathology Reports. JAMA Netw Open. 2024 May 22;7(5):e2412767. Sarraju A, Bruemmer D, Van Iterson E, Cho L, Rodriguez F, Laffin L. Appropriateness of Cardiovascular Disease Prevention Recommendations Obtained From a Popular Online Chat-Based Artificial Intelligence Model. JAMA. 2023 Mar 14;329(10):842–4. Cakir H, Caglar U, Sekkeli S, Zerdali E, Sarilar O, Yıldız O, et al. Evaluating ChatGPT Ability to Answer Urinary Tract Infection-Related Questions. Infect Dis Now. 2024 Mar 7;104884. Lim ZW, Pushpanathan K, Yew SME, Lai Y, Sun CH, Lam JSH, et al. Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard. EBioMedicine. 2023 Sep;95:104770. Zhang N, Sun Z, Xie Y, Wu H, Li C. The latest version ChatGPT powered by GPT-4o: what will it bring to the medical field? Int J Surg. 2024 Jun 10; Additional Declarations No competing interests reported. Supplementary Files S1Table.xlsx S1 Table. Introduction of the Evaluators. S1File.pdf S1 File. Q&A Process for Each Model. S2File.pdf Cite Share Download PDF Status: Published Journal Publication published 26 Nov, 2025 Read the published version in BMC Medical Informatics and Decision Making → Version 1 posted Editorial decision: Revision requested 12 Aug, 2025 Reviews received at journal 11 Aug, 2025 Reviews received at journal 07 Aug, 2025 Reviews received at journal 24 Jul, 2025 Reviewers agreed at journal 16 Jul, 2025 Reviewers agreed at journal 11 Jul, 2025 Reviewers agreed at journal 11 Jul, 2025 Reviewers invited by journal 11 Jul, 2025 Editor assigned by journal 11 Jul, 2025 Editor invited by journal 26 Jun, 2025 Submission checks completed at journal 25 Jun, 2025 First submitted to journal 25 Jun, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-6935611","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":485356384,"identity":"d79a8bb7-5932-4ec7-9f71-7fe69ef54999","order_by":0,"name":"Juntao Ma","email":"","orcid":"","institution":"Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine","correspondingAuthor":false,"prefix":"","firstName":"Juntao","middleName":"","lastName":"Ma","suffix":""},{"id":485356385,"identity":"21d77c9b-b963-4582-b3e2-e57354a47a6b","order_by":1,"name":"Linyan Gong","email":"","orcid":"","institution":"Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine","correspondingAuthor":false,"prefix":"","firstName":"Linyan","middleName":"","lastName":"Gong","suffix":""},{"id":485356386,"identity":"ea69efaf-4bcc-4f9c-92b2-4210f41d6681","order_by":2,"name":"Yuchen Song","email":"","orcid":"","institution":"School of Integrative Medicine，Nanjing University of Chinese Medicine","correspondingAuthor":false,"prefix":"","firstName":"Yuchen","middleName":"","lastName":"Song","suffix":""},{"id":485356387,"identity":"ed371630-8a52-4535-9c87-b37773f3bf56","order_by":3,"name":"Guiyang Wang","email":"","orcid":"","institution":"Department of Infectious Diseases, Nanjing Drum Tower Hospital, The Affiliated Hospital of Nanjing University Medical School","correspondingAuthor":false,"prefix":"","firstName":"Guiyang","middleName":"","lastName":"Wang","suffix":""},{"id":485356388,"identity":"39c568f5-3989-4b79-a925-8026ec3d29df","order_by":4,"name":"Juan Xia","email":"","orcid":"","institution":"Department of Infectious Diseases, Nanjing Drum Tower Hospital, The Affiliated Hospital of Nanjing University Medical School","correspondingAuthor":false,"prefix":"","firstName":"Juan","middleName":"","lastName":"Xia","suffix":""},{"id":485356389,"identity":"6158f95d-c991-4171-9411-f1267b4bc14b","order_by":5,"name":"Yun Liu","email":"","orcid":"","institution":"Department of Emergency, Nanjing Drum Tower Hospital","correspondingAuthor":false,"prefix":"","firstName":"Yun","middleName":"","lastName":"Liu","suffix":""},{"id":485356390,"identity":"ecd7ddfa-c601-4cd0-ac28-14a1efff4fbd","order_by":6,"name":"Bei Jia","email":"","orcid":"","institution":"Department of Infectious Diseases, Nanjing Drum Tower Hospital, The Affiliated Hospital of Nanjing University Medical School","correspondingAuthor":false,"prefix":"","firstName":"Bei","middleName":"","lastName":"Jia","suffix":""},{"id":485356392,"identity":"2c2a11a7-52ae-447c-8dc5-0a26ebc34af5","order_by":7,"name":"Yuxin Chen","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA5ElEQVRIiWNgGAWjYJCCA0CcwMDAfIBBAsYlUgtbAvFaGCBaeAwQJuADBjdyDA8XMNjl8Uv3fHtg2cYgx3cjgfFzAV4taQmHZzAkF0vOObvdQLKNwVjyRgKz9Ay8WpIPHOZhOJC44UbuNgmgFiAjgY2ZB6+WxAawlv03cp6BtNQToQVmi0QOG0hLggEhLZJnniUc5jFITpxxI81MQuKchOHMMw+bpfFp4TueY/yZp8IusX9G8jNpiTIbeb7jyQc/49OicADsPAiHWQIcmYwNeDQwMMgjSzN+wKt2FIyCUTAKRioAAF9sTILrWiNsAAAAAElFTkSuQmCC","orcid":"","institution":"Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine","correspondingAuthor":true,"prefix":"","firstName":"Yuxin","middleName":"","lastName":"Chen","suffix":""}],"badges":[],"createdAt":"2025-06-20 05:53:23","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-6935611/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-6935611/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1186/s12911-025-03273-4","type":"published","date":"2025-11-26T15:58:48+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":87029857,"identity":"bb503842-1909-4a65-b925-0ab5f4ba10cd","added_by":"auto","created_at":"2025-07-18 12:39:29","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":281253,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eFlow chart of the whole research process.\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"Fig.1.png","url":"https://assets-eu.researchsquare.com/files/rs-6935611/v1/4e733aa17030c87207ff51ed.png"},{"id":87031031,"identity":"d12a2eeb-ec93-45c3-8a24-f43e8b6d8d5c","added_by":"auto","created_at":"2025-07-18 12:47:29","extension":"jpg","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":200077,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eComparison of Scores Across Four Models Using a Three-Point Scale.\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"Fig.2.tif.jpg","url":"https://assets-eu.researchsquare.com/files/rs-6935611/v1/4980fb3cc620a6bdf86ed4c4.jpg"},{"id":87029861,"identity":"c87679a2-9a6a-4376-a1a8-ecc527c336dd","added_by":"auto","created_at":"2025-07-18 12:39:29","extension":"jpg","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":387570,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eEvaluation of Four Models Using a Five-Point Rating System.\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"Fig.3.tif.jpg","url":"https://assets-eu.researchsquare.com/files/rs-6935611/v1/ccf7d362a8087cc9c397f30f.jpg"},{"id":87031038,"identity":"e4631797-74fc-4f9f-8a78-b17f098ab97e","added_by":"auto","created_at":"2025-07-18 12:47:29","extension":"jpg","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":335230,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eEvaluation of Four Models Across Four Domains Using a Five-Point Scale.\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"Fig.4.tif.jpg","url":"https://assets-eu.researchsquare.com/files/rs-6935611/v1/db389f110042192e9d5f7c55.jpg"},{"id":97179477,"identity":"2e56537b-84e4-4d0a-8268-954d8b20701a","added_by":"auto","created_at":"2025-12-01 16:15:53","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1865602,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6935611/v1/1c3a8dae-74df-4e4e-984f-553ab3596b67.pdf"},{"id":87029858,"identity":"e1f6b74e-c0c9-4545-a026-92d3a6e8ecc0","added_by":"auto","created_at":"2025-07-18 12:39:29","extension":"xlsx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":12000,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eS1 Table. Introduction of the Evaluators.\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"S1Table.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-6935611/v1/0447559329f64a528b142c4e.xlsx"},{"id":87029864,"identity":"e21b3299-ac3c-4c70-a5d8-2a71421a6e05","added_by":"auto","created_at":"2025-07-18 12:39:29","extension":"pdf","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":1091840,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eS1 File. Q\u0026amp;A Process for Each Model.\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"S1File.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6935611/v1/a94599b74e800d167c6a92d6.pdf"},{"id":87029879,"identity":"da0c3268-11ae-4837-a064-0954e8598bb9","added_by":"auto","created_at":"2025-07-18 12:39:29","extension":"pdf","order_by":3,"title":"","display":"","copyAsset":false,"role":"supplement","size":1161215,"visible":true,"origin":"","legend":"","description":"","filename":"S2File.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6935611/v1/dbc07f10de31893375d56f2c.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Comparative Evaluation of Viral Hepatitis Question Responses: ChatGPT-4.5 Outperforms Three Established Models","fulltext":[{"header":"Introduction","content":"\u003cp\u003eLarge language models (LLMs) utilize advanced artificial intelligence (AI) algorithms to generate language that closely mimics human communication. Trained on diverse datasets, these models excel in tasks ranging from translation to summarization and question answering(\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e). During training, the model employed self-supervised learning to predict context within the text, gradually learning patterns and relationships between words and sentences. Additionally, with improvements in large-scale computing and optimization(\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e), LLMs like Gemini-2.0 (Released on February 5, 2025), Claude-3.5-sonnet (Released on October 22, 2023), ChatGPT-4.5 (Released on February 27, 2025), and ChatGPT-4 (Released on March 14, 2023) has significantly advanced natural language processing (NLP), enabling machines to understand and generate human-like text(\u003cspan additionalcitationids=\"CR4 CR5\" citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eViral hepatitis, a significant global health issue, includes a range of liver diseases caused by different viruses, primarily classified as hepatitis A, B, C, D, and E. Although much progress has been made in the prevention and treatment of viral hepatitis(\u003cspan additionalcitationids=\"CR8\" citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e), viral hepatitis is still a global public health concern that affects millions of people and causes thousands of deaths due to acute and chronic infection, cirrhosis, and liver cancer(\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e, \u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e). In 2020, viral hepatitis B- and C-related diseases led to 1.1\u0026nbsp;million deaths, similar to the number caused by tuberculosis (1.3\u0026nbsp;million deaths) and significantly higher than the number caused by HIV (0.68\u0026nbsp;million deaths) or malaria (0.627\u0026nbsp;million deaths)(\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e). In May 2022, the WHO developed a plan to eliminate viral hepatitis, \u003cb\u003e2022\u0026ndash;2030 Action Plan\u003c/b\u003e proposed to eliminate the public health threat of viral hepatitis by 2023, the new infection rate of chronic hepatitis B (CHB) and chronic hepatitis C (CHC) decreased by 90%, the mortality rate decreased by 65%, and the cure rate reached 80%~90%. By the measure of the plan, the current state of China is not optimistic(\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eAccurate and accessible information is crucial for healthcare professionals and the general public to understand, prevent, and manage these conditions, and effective communication and dissemination of information about viral hepatitis are critical in combating this public health threat. Our research explores how LLM technology can enhance the accuracy, accessibility, and timeliness of information related to viral hepatitis, benefiting both the public and healthcare providers. This supports WHO's 2030 targets for reducing viral hepatitis incidence and mortality by enhancing prevention, diagnosis, and treatment, as well as raising public awareness(\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eLLMs offer a promising solution by delivering reliable, comprehensive, and easy-to-understand information. It's worth noting that, LLMs are based on human-generated documents and that, even with proper curation, biases persist. In November 2022, OpenAI introduced an updated LLM called ChatGPT, which attracted significant attention for its accessibility, ease of use, and human-like outputs(\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e, \u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e). Since its release, numerous other LLMs and tools have been developed unprecedentedly(\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e, \u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e), including Gemini-2.0, Claude-3.5-sonnet, ChatGPT-4.5, and ChatGPT-4(\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e). These LLMs have shown potential in various applications, including medical information retrieval, patient education, and clinical decision support(\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e, \u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e). Recently, LLMs have shown promise in providing information on multiple topics, including health-related queries(\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e, \u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e). Previous studies have demonstrated the capabilities of LLMs in numerous applications, including medical information retrieval, patient education, and clinical decision support(\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e). For instance, ChatGPT has been shown to generate accurate and satisfactory answers for both public-based inquiries and EAU urological infection guideline-based questions(\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e). Similarly, ChatGPT-4 has the potential to deliver precise and comprehensive responses to myopia-related queries(\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eThere is currently limited research comparing the performance of different LLMs in specific medical topics such as viral hepatitis. This study seeks to fill this gap by assessing the effectiveness of four prominent models\u0026mdash;Gemini-2.0, Claude-3.5-sonnet, ChatGPT-4.5 and ChatGPT-4\u0026mdash;in providing accurate and relevant medical information. By comparing the performance of these LLMs, this research aims to highlight the strengths and limitations of each model in addressing viral hepatitis-related queries.\u003c/p\u003e"},{"header":"Methods","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e\u003ch2\u003eStudy design\u003c/h2\u003e\u003cp\u003eA comprehensive set of 52 questions related to viral hepatitis was curated, covering various aspects such as concepts, risk factors, diagnosis, prevention and treatment (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). These questions were meticulously designed by three infectious disease experts, some of the questions were adapted from patient case scenarios, while the others were high-frequency issues they encounter in clinical practice. A portion of the questions is presented in Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e. The profiles of the three experts are provided in the S1 Table. Subsequently, we reviewed and revised the questions to ensure they adequately tested the depth and breadth of the LLMs\u0026rsquo; knowledge. The questions aimed to reflect real-world queries that healthcare professionals and patients might have, thereby providing a robust framework for evaluation.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eA portion of the questions designed by three infectious disease experts.\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"2\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eTopic\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eQuestions\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e\u003cp\u003eConcept\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e1. What is viral hepatitis?\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e7. What is chronic hepatitis B?\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e\u003cp\u003eRisk Factors\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e9. Under what conditions is one more likely to contract hepatitis A?\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e14. Is hepatitis B hereditary? My mother has hepatitis B; what should I be aware of? Should I undergo further testing?\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\" morerows=\"2\" rowspan=\"3\"\u003e\u003cp\u003eDiagnosis\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e16. What laboratory tests are needed to help doctors diagnose viral hepatitis?\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e17. During a routine physical examination, is it necessary to perform a full hepatitis panel?\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e18. Under what circumstances should co-infection with hepatitis D be considered?\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e\u003cp\u003ePrevention and Treatment\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e26. If someone in the family has viral hepatitis, how should others prevent it?\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e29. If I have viral hepatitis, what dietary precautions should I take?\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eThe models evaluated in this study include Gemini-2.0, Claude-3.5-sonnet, ChatGPT-4.5 and ChatGPT-4. Gemini-2.0, developed by OpenAI, is known for its strong baseline performance in generating coherent and contextually relevant text. Claude-3.5-sonnet, an advanced version of Gemini-2.0, offers improved accuracy, comprehensiveness, and contextual understanding. ChatGPT-4.5, the new model released by OpenAI, boasts faster speeds and superior performance. ChatGPT-4, developed by Google, is noted for its versatility and application in various NLP tasks, leveraging Google's extensive resources and advanced machine learning algorithms to deliver high-quality responses. Before engaging in conversation with LLMs, we first input a prompt: \"Answer my questions as an infectious disease doctor.\" Each question will be conducted in a separate dialogue, and the entire Q\u0026amp;A process was conducted in English.\u003c/p\u003e\u003cp\u003eThe evaluation focuses on their \u003cb\u003erelevance\u003c/b\u003e (The degree of relevance between the model response and the user's question. A highly relevant response should directly address the user's question and provide relevant information and answers.), \u003cb\u003ecomprehensiveness\u003c/b\u003e (Involves the breadth and depth of the model's response. A comprehensive answer should cover all aspects of the user's question, provide detailed information, and provide background information and context when necessary.), \u003cb\u003eaccuracy\u003c/b\u003e (The accuracy of the information in the model's response. Accurate answers should be based on facts and avoid misleading users.), \u003cb\u003esafety\u003c/b\u003e (This involves whether the model's responses are safe and do not contain harmful content. Safety is one of the standards that LLMs must strictly control to ensure that the model's output does not cause harm to society or individuals.), and \u003cb\u003ereadability\u003c/b\u003e (The clarity and comprehensibility of the model's responses. Highly readable responses should use concise language, have clear logic, and be easy for users to understand.).\u003c/p\u003e\u003cp\u003eThe scoring scheme in this study is based on previously published studies(\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e), and the evaluation scheme for our project was determined through joint discussion among all participants. Responses were initially rated as good, borderline, and poor by a panel of three infectious disease experts who designed the questions. All evaluators had passed China's College English Test Band 6 (CET-6), and the use of translation software was permitted throughout the question design and scoring process. This rating was based on the overall quality of the response, considering factors such as clarity, relevance, and accuracy. After the first evaluation, there will be a two-day interval to minimize the influence of prior assessments on the experts' judgments. Following the initial rating, a different panel of three infectious disease experts assessed the responses on five dimensions: relevance, comprehensiveness, accuracy, safety, and readability. Each dimension was scored on a scale of 1 to 5 by three experts.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec4\" class=\"Section2\"\u003e\u003ch2\u003eStatistical analysis\u003c/h2\u003e\u003cp\u003eStatistical analyses were conducted using GraphPad Prism (version 10.1.2, GraphPad Software, San Diego, California USA, \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ewww.graphpad.com\u003c/span\u003e\u003cspan address=\"http://www.graphpad.com\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e) and SPSS (version 27.0, IBM Corp., Armonk, New York, USA, \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ewww.ibm.com/spss\u003c/span\u003e\u003cspan address=\"http://www.ibm.com/spss\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e). We utilized Fleiss's Kappa in SPSS to conduct an inter-rater reliability (IRR) analysis on the scoring data. Then we used Prism to compare the total scores of four LLMs across five quality dimensions: relevance, comprehensiveness, accuracy, safety, and readability, as well as the scores of four models in answering different types of questions. After testing the data for normality and homogeneity of variance, a two-way ANOVA was employed to evaluate the effects of model type and question type on the scores. Prism's multiple comparisons automatically adjust the p-values, and we used the Tukey test. A p-value of less than 0.05 was considered statistically significant.\u003c/p\u003e\u003c/div\u003e"},{"header":"Results","content":"\u003cp\u003eThe performance of four models\u0026mdash;Gemini-2.0, Claude-3.5-sonnet, ChatGPT-4.5, and ChatGPT-4\u0026mdash;was evaluated using a comprehensive set of 52 questions related to viral hepatitis. Developed by a panel of infectious disease specialists, the questions spanned four clinical domains: concepts, risk factors, prevention and treatment, and diagnosis. The responses were assessed using a two-step evaluation process: an initial categorization using a three-point scale, followed by a more in-depth five-point rating system that assessed dimensions such as relevance, comprehensiveness, accuracy, safety, and readability. This methodology allowed for a nuanced evaluation of the models' performance. The evaluations revealed varying levels of performance among the models, with each demonstrating unique strengths and areas for improvement. All question-and-answer sessions are presented in the S1 File. We utilized Fleiss's Kappa in SPSS to conduct an inter-rater reliability (IRR) analysis on the scoring data. The results indicated a Kappa value of 0.77, suggesting a high level of agreement among the raters.\u003c/p\u003e\n\u003ch3\u003eComparison of Scores Across Four Models Using a Three-Point Scale\u003c/h3\u003e\n\u003cp\u003eResponses were categorized as good, borderline, and poor, revealing significant differences among the models. ChatGPT-4.5 outperformed all other models, with 89.1% of its responses rated as good, 10.9% as borderline, and no poor responses, demonstrating its capacity to deliver high-quality answers efficiently and accurately (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e). Claude-3.5-sonnet followed with 71.15% good, 28.21% borderline, and 0.64% poor responses. This reflects a marked improvement over Gemini-2.0, which achieved 62.82% good, 35.9% borderline, and 0.64% poor responses. However, there remains a notable gap between Claude-3.5-sonnet and ChatGPT-4.5. ChatGPT-4, while versatile, showed less consistent performance with 50.64% good, 41.67% borderline, and 1.28% poor ratings. The findings suggest that the model's capacity to deliver high-quality responses is inferior to that of its competitors.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eThe comparison of four models using a three-grade scale including good, borderline and poor. The rating was based on the overall quality of the response, considering factors such as clarity, relevance and accuracy. The scores are presented as percentages for each category, highlighting the variations in response quality and effectiveness among the models.\u003c/p\u003e\n\u003ch3\u003eEvaluation of Four Models Using a Five-Point Rating System\u003c/h3\u003e\n\u003cp\u003e The five-point rating system provided a detailed assessment across five dimensions: relevance, comprehensiveness, accuracy, safety, and readability. ChatGPT-4.5 consistently scored highest across all dimensions (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e). For relevance, it achieved an average score of 4.91\u0026thinsp;\u0026plusmn;\u0026thinsp;0.18, compared to Claude-3.5-sonnet's 4.74\u0026thinsp;\u0026plusmn;\u0026thinsp;0.23, Gemini-2.0's 4.67\u0026thinsp;\u0026plusmn;\u0026thinsp;0.24, and ChatGPT-4's 4.51\u0026thinsp;\u0026plusmn;\u0026thinsp;0.31. The statistical analysis confirmed significant differences, particularly between ChatGPT-4.5 and the other models, highlighting its superior contextual understanding.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eThe evaluation of four models across five dimensions\u0026mdash;relevance, comprehensiveness, accuracy, safety, and readability\u0026mdash;utilizes a five-point rating system from 1 to 5. Statistical significance is denoted with asterisks, where \"ns\" indicates not significant, \"*\" for P\u0026thinsp;\u0026lt;\u0026thinsp;0.05, \"**\" for P\u0026thinsp;\u0026lt;\u0026thinsp;0.01, \"***\" for P\u0026thinsp;\u0026lt;\u0026thinsp;0.001, and \"****\" for P\u0026thinsp;\u0026lt;\u0026thinsp;0.0001. Error bars represent the standard error.\u003c/p\u003e\u003cp\u003eIn terms of comprehensiveness, ChatGPT-4.5 led with a score of 4.75\u0026thinsp;\u0026plusmn;\u0026thinsp;0.42, followed by Claude-3.5-sonnet at 4.22\u0026thinsp;\u0026plusmn;\u0026thinsp;0.60, Gemini-2.0 at 4.20\u0026thinsp;\u0026plusmn;\u0026thinsp;0.53, and ChatGPT-4 at 3.66\u0026thinsp;\u0026plusmn;\u0026thinsp;0.83, indicating its ability to provide thorough and detailed answers. Accuracy scores also demonstrated ChatGPT-4.5's leading position, with it scoring 4.81\u0026thinsp;\u0026plusmn;\u0026thinsp;0.25, compared to Claude-3.5-sonnet\u0026rsquo;s 4.41\u0026thinsp;\u0026plusmn;\u0026thinsp;0.42, ChatGPT-4's 4.14\u0026thinsp;\u0026plusmn;\u0026thinsp;0.67, and Gemini-2.0's 4.31\u0026thinsp;\u0026plusmn;\u0026thinsp;0.40. This indicates that ChatGPT-4.5 is more precise when handling complex queries, although other models also performed competently.\u003c/p\u003e\u003cp\u003eFor safety, ChatGPT-4.5 scored the highest at 4.87\u0026thinsp;\u0026plusmn;\u0026thinsp;0.20, reflecting its strong adherence to providing accurate medical information. Claude-3.5-sonnet also performed well at 4.55\u0026thinsp;\u0026plusmn;\u0026thinsp;0.34, with Gemini-2.0 4.52\u0026thinsp;\u0026plusmn;\u0026thinsp;0.29 and ChatGPT-4 4.45\u0026thinsp;\u0026plusmn;\u0026thinsp;0.33. While these scores are relatively close, ChatGPT-4.5's higher rating indicates a notable distinction in ensuring the safety and accuracy of the information provided. In terms of readability, ChatGPT-4.5 excelled with a score of 4.80\u0026thinsp;\u0026plusmn;\u0026thinsp;0.29, compared to Claude-3.5-sonnet's 4.27\u0026thinsp;\u0026plusmn;\u0026thinsp;0.34, Gemini-2.0's 4.21\u0026thinsp;\u0026plusmn;\u0026thinsp;0.37, and ChatGPT-4's 4.10\u0026thinsp;\u0026plusmn;\u0026thinsp;0.47. This highlights ChatGPT-4.5's ability to generate clear and comprehensible text, which is critical for ensuring user understanding.\u003c/p\u003e\u003cdiv id=\"Sec8\" class=\"Section2\"\u003e\u003ch2\u003eEvaluation of Four Models Across Four Domains Using a Five-Point Scale\u003c/h2\u003e\u003cp\u003eThe performance across specific domains\u0026mdash;concepts, risk factors, diagnosis, and prevention and treatment\u0026mdash;provided further insights into the models' effectiveness. For this section, the total scores for each question across the four domains were averaged among the three judges and compared, allowing for a more comprehensive comparison.\u003c/p\u003e\u003cp\u003eIn the domain of concepts, ChatGPT-4.5 achieved an average score of 22.17\u0026thinsp;\u0026plusmn;\u0026thinsp;0.96, compared to Claude-3.5-sonnet's 20.71\u0026thinsp;\u0026plusmn;\u0026thinsp;1.50, ChatGPT-4's 21.50\u0026thinsp;\u0026plusmn;\u0026thinsp;1.32, and Gemini-2.0's 21.21\u0026thinsp;\u0026plusmn;\u0026thinsp;1.31 (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e). Although no significant differences were found statistically, ChatGPT-4.5 showed a slight edge in understanding and explaining concepts.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eThe performance of Gemini-2.0, Claude-3.5-sonnet, ChatGPT-4.5, and ChatGPT-4 is evaluated across four domains: concepts, risk factors, diagnosis, and prevention and treatment. Each model was assessed using a five-point scale, with total scores for each question averaged among three judges, and a maximum possible score of 25. Statistical significance is denoted with asterisks, where \"ns\" indicates not significant, \"*\" for P\u0026thinsp;\u0026lt;\u0026thinsp;0.05, \"**\" for P\u0026thinsp;\u0026lt;\u0026thinsp;0.01, \"***\" for P\u0026thinsp;\u0026lt;\u0026thinsp;0.001, and \"****\" for P\u0026thinsp;\u0026lt;\u0026thinsp;0.0001. Error bars represent the standard error.\u003c/p\u003e\u003cp\u003eWhen evaluating risk factors, ChatGPT-4.5 led with an average score of 23.61\u0026thinsp;\u0026plusmn;\u0026thinsp;1.31, followed by Claude-3.5-sonnet at 23.06\u0026thinsp;\u0026plusmn;\u0026thinsp;1.64, Gemini-2.0 at 22.72\u0026thinsp;\u0026plusmn;\u0026thinsp;1.14, and ChatGPT-4 at 21.83\u0026thinsp;\u0026plusmn;\u0026thinsp;1.09. While differences in performance were not statistically significant, ChatGPT-4.5 and Claude-3.5-sonnet consistently demonstrated high performance in this domain, suggesting their effectiveness in identifying and elaborating on risk factors.\u003c/p\u003e\u003cp\u003eIn diagnosing, ChatGPT-4.5 scored 24.73\u0026thinsp;\u0026plusmn;\u0026thinsp;0.49, significantly higher than Claude-3.5-sonnet's 21.79\u0026thinsp;\u0026plusmn;\u0026thinsp;1.57, Gemini-2.0's 21.15\u0026thinsp;\u0026plusmn;\u0026thinsp;1.67, and ChatGPT-4's 21.12\u0026thinsp;\u0026plusmn;\u0026thinsp;1.85. Statistical analysis revealed significant differences, particularly between ChatGPT-4.5 and the other models, demonstrating its superior performance in diagnostic queries. This finding underscores ChatGPT-4.5's advantage in addressing complex diagnostic questions accurately.\u003c/p\u003e\u003cp\u003eRegarding prevention and treatment, ChatGPT-4.5 excelled with an average score of 24.58\u0026thinsp;\u0026plusmn;\u0026thinsp;0.78, ahead of Claude-3.5-sonnet's 22.62\u0026thinsp;\u0026plusmn;\u0026thinsp;1.45, Gemini-2.0's 22.26\u0026thinsp;\u0026plusmn;\u0026thinsp;1.43, and ChatGPT-4's 20.33\u0026thinsp;\u0026plusmn;\u0026thinsp;2.52. Statistical analysis confirmed significant differences, particularly between ChatGPT-4.5 and the other models, highlighting its ability to provide detailed and precise information for the prevention and treatment of viral hepatitis.\u003c/p\u003e\u003c/div\u003e"},{"header":"Discussion","content":"\u003cp\u003eOur work conducts an comparison of LLMs in addressing issues related to viral hepatitis across five dimensions, four domains, and three levels. We evaluated and assessed the performance of four language models, ultimately finding that ChatGPT-4.5 significantly outperforms the other LLMs. Our findings demonstrate the capabilities of ChatGPT-4.5, rapidly filling the academic gap regarding this new model. While numerous published articles integrate LLMs with disease treatment, research specifically applying LLMs to viral hepatitis is relatively rare, making our study a valuable contribution to the specialty medical scene.\u003c/p\u003e\u003cp\u003eOur data revealed that ChatGPT-4.5's \"Good\" rating leads the other three models in a three-point comparison, as recognized by clinical physicians. ChatGPT-4.5's \"Borderline\" rating is significantly lower, and its \"Poor\" rating is zero. Our research further shows that ChatGPT-4.5 scores higher than other LLMs in the five dimensions of relevance, completeness, accuracy, safety, and readability. This conclusion validates the robust functionality of ChatGPT-4.5. The study also reveals that ChatGPT-4.5's responses in diagnosis and prevention and treatment are significantly higher than other LLMs. From a technical perspective, ChatGPT-4.5's superior performance likely stems from iterative improvements in its multimodal architecture. Compared to earlier versions like Gemini-2.0, its training data integrates more up-to-date medical knowledge such as clinical guidelines and evidence-based research.\u003c/p\u003e\u003cp\u003eNotably, the four LLMs do not show significant differences in concept and risk factors. We hypothesize this condition occurs because such questions primarily assess data retrieval capabilities rather than requiring complex logical reasoning. Consequently, even the earlier-generation Gemini-2.0 demonstrated satisfactory performance in addressing these inquiries. For subjective questions, both Claude-3.5-sonnet and ChatGPT-4.5 provide more intuitive answers, with ChatGPT-4.5 providing more comprehensive and well-organized responses, generally meeting the current needs for simple disease management. However, compared to clinical physicians, LLMs tend to provide a range of answers, resulting in mixed information and the inability to formulate personalized strategies. Furthermore, LLMs' content requires maintenance with the latest guidelines and research, introducing a certain level of information error.\u003c/p\u003e\u003cp\u003eThe answers of LLMs to common clinical questions about viral hepatitis are beneficial for general practitioners or healthcare institutions to quickly and systematically acquire information, especially non-specialist doctors. Additionally, LLMs can provide \"informal\" medical consultations for healthcare institutions. In chronic disease follow-up management, setting common questions and systematic answers allows doctors to guide patients to check details during follow-up independently, and updates can be made based on patient feedback. Clinically, LLMs can assist in organizing medical histories and case information. Although LLMs cannot make decisive treatment decisions for complex cases, it can provide systematic information matching, enhancing clinical efficiency.\u003c/p\u003e\u003cp\u003eHowever, our study has certain limitations. The design of the questions might contain subjectivity, the number and breadth of the question set are relatively limited, and the lack of real-world application testing during the evaluation process might slightly impact the comprehensiveness and objectivity of the results. Although the questions were designed by infectious disease experts, they might not fully cover all real-world scenarios. The limited number of questions may not fully showcase the models' performance across different domains. Furthermore we did not compare the Q\u0026amp;A results between LLMs and physicians in infectious disease. Without this comparison, it's difficult to determine the true impact of LLMs on the quality of healthcare services. In addition, LLMs are highly time-sensitive, with new and more advanced models emerging regularly. As such, our research can only reflect the capabilities of LLMs at the time of the study.\u003c/p\u003e\u003cp\u003eAlso, we did not test other LLMs such as BERT, Ernie, Gemma, Lamda, etc. While these models are powerful, they often require complex technical setups like local deployment, which can hinder usability. We acknowledge the limitation of not evaluating a broader range of models, but our focus was on selecting models that offer user-friendly interfaces suitable for patient use.\u003c/p\u003e\u003cp\u003eLooking ahead, large language models like ChatGPT-4.5 can improve clinical value in two ways. First, they can create automated connections with trusted medical databases to make sure they follow the latest clinical guidelines. Second, they can build a system where doctors and AI work together, using patient lab results to help create treatment plans that are personalized. These improvements would work well with ChatGPT-4.5's current strengths, such as quick data processing, 24/7 monitoring, and facial recognition. The facial recognition technology can help with both health checks and noticing small changes in facial expressions, which can improve both diagnosis and emotional support for patients. Together, these changes could help meet the WHO's 2030 goal of eliminating hepatitis by giving doctors quick, guideline-based insights and making their work more efficient while improving decision-making.(\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e).\u003c/p\u003e"},{"header":"Conclusions","content":"\u003cp\u003eThis study evaluates LLMs' responses to issues related to viral hepatitis from multiple dimensions, revealing varying degrees of reliability in the answers provided by the LLMs, with ChatGPT-4.5 leading across all dimensions. ChatGPT-4.5 is particularly effective for handling complex medical queries on viral hepatitis, making it a reliable choice for accurate and comprehensive responses in viral hepatitis diagnosis.\u003c/p\u003e"},{"header":"Abbreviation","content":"\u003cp\u003eAI \u0026nbsp;artificial intelligence\u003c/p\u003e\n\u003cp\u003eLLM \u0026nbsp;large language model\u003c/p\u003e\n\u003cp\u003eCHB \u0026nbsp;chronic hepatitis B\u003c/p\u003e\n\u003cp\u003eCHC \u0026nbsp;chronic hepatitis C\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eEthics approval and consent to participate\u003c/strong\u003e\u003cp\u003e The study protocol was approved by the Ethics Committee of Nanjing University Affiliated Drum Tower Hospital (Approval Number: 2008022). Informed consent has been obtained from all participants in the study, and our research strictly adheres to the Declaration of Helsinki. The ethics approval scan has been added to the supplementary material S2_File.\u003c/p\u003e\u003c/p\u003e\u003cp\u003e\u003cstrong\u003eConsent for publication\u003c/strong\u003e\u003cp\u003e All authors agree to the publication.\u003c/p\u003e\u003c/p\u003e\u003cp\u003e\u003ch2\u003eCompeting interests\u003c/h2\u003e\u003cp\u003eThe authors declare that there are no conflicts of interest.\u003c/p\u003e\u003c/p\u003e\u003ch2\u003eFunding statement\u003c/h2\u003e\u003cp\u003eThis work was supported by the National key research and development program [2023YFC2309100]; National Natural Science Foundation of China [92269118,92269205,92369117]; Scientific Research Project of Jiangsu Health Commission [M2022013]; Clinical Trials from the Affiliated Drum Tower Hospital, Medical School of Nanjing University [2021-LCYJ-PY-10]; Project of Chinese Hospital Reform and Development Institute, Nanjing University, Aid project of Nanjing Drum Tower Hospital Health, Education \u0026amp;Research Foundation [NDYG2022003].\u003c/p\u003e\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003eJuntao Ma: Formal analysis, Data curation, Validation, Writing \u0026ndash; original draft, Writing \u0026ndash; review \u0026amp; editing. Linyan Gong: Formal analysis, Writing \u0026ndash; original draft, Writing \u0026ndash; review \u0026amp; editing. Yuchen Song: Formal analysis, Writing \u0026ndash; original draft, Writing \u0026ndash; review \u0026amp; editing. Guiyang Wang: Data scoring. Juan Xia: Data scoring. Bei Jia: Conceptualization, Methodology, Resources, Data curation, Supervision, Validation, Writing \u0026ndash; review \u0026amp; editing, Funding acquisition.Yuxin Chen: Conceptualization, Methodology, Data curation, Supervision, Validation, Writing \u0026ndash; review \u0026amp; editing, Funding acquisition.\u003c/p\u003e\u003ch2\u003eAcknowledgements\u003c/h2\u003e\u003cp\u003eNot applicable.\u003c/p\u003e\u003ch2\u003eData Availability\u003c/h2\u003e\u003cp\u003eData is provided within the supplementary information files.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eBinz M, Schulz E. Using cognitive psychology to understand GPT-3. Proc Natl Acad Sci U S A. 2023 Feb 7;120(6):e2218523120. \u003c/li\u003e\n\u003cli\u003eV S, E K, M SL, I C, Db Z, N BL, et al. Large language model (ChatGPT) as a support tool for breast tumor board. NPJ breast cancer [Internet]. 2023 May 30 [cited 2024 Oct 14];9(1). Available from: https://pubmed.ncbi.nlm.nih.gov/37253791/\u003c/li\u003e\n\u003cli\u003eSinghal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large language models encode clinical knowledge. Nature. 2023 Aug;620(7972):172\u0026ndash;80. \u003c/li\u003e\n\u003cli\u003eC, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE; 2016. doi: 10.1109/cvpr.2016.308. In. \u003c/li\u003e\n\u003cli\u003eZhang W, Feng Y, Meng F, You D, Liu Q. Bridging the gap between training and inference for neural machine translation. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA, USA: Association for Computational Linguistics; 2019. doi: 10.18653/v1/p19-1426. \u003c/li\u003e\n\u003cli\u003eBhatia Y, Bajpayee A, Raghuvanshi D, Mittal H. Image captioning using Google\u0026rsquo;s inception-resnet-v2 and recurrent neural network. 2019 Twelfth International Conference on Contemporary Computing (IC3). IEEE; 2019. doi: 10.1109/ic3.2019.8844921. \u003c/li\u003e\n\u003cli\u003eZeng DY, Li JM, Lin S, Dong X, You J, Xing QQ, et al. Global burden of acute viral hepatitis and its association with socioeconomic development status, 1990-2019. J Hepatol. 2021 Sep;75(3):547\u0026ndash;56. \u003c/li\u003e\n\u003cli\u003eNassal M. (2015). HBV cccDNA: viral persistence reservoir and key obstacle for a cure of chronic hepatitis B. Gut, 64(12), 1972\u0026ndash;1984. https://doi.org/10.1136/gutjnl-2015-309809. \u003c/li\u003e\n\u003cli\u003eRevill, P. A., Chisari, F. V., Block, J. M., Dandri, M., Gehring, A. J., Guo, H., Hu, J., Kramvis, A., Lampertico, P., Janssen, H. L. A., Levrero, M., Li, W., Liang, T. J., Lim, S. G., Lu, F., Penicaud, M. C., Tavis, J. E., Thimme, R., Members of the ICE-HBV Working Groups, ICE-HBV Stakeholders Group Chairs, \u0026hellip; Zoulim, F. (2019). A global scientific strategy to cure hepatitis B. The lancet. Gastroenterology \u0026amp; hepatology, 4(7), 545\u0026ndash;558. https://doi.org/10.1016/S2468-1253(19)30119-0. \u003c/li\u003e\n\u003cli\u003eOuyang, G., Pan, G., Guan, L., Wu, Y., Lu, W., Qin, C., Li, S., Xu, H., Yang, J., \u0026amp; Wen, Y. (2022). Incidence trends of acute viral hepatitis caused by four viral etiologies between 1990 and 2019 at the global, regional and national levels. Liver international : official journal of the International Association for the Study of the Liver, 42(12), 2662\u0026ndash;2673. https://doi.org/10.1111/liv.15452IF: 6.7 Q1. \u003c/li\u003e\n\u003cli\u003eTe H, Doucette K. Viral hepatitis: Guidelines by the American Society of Transplantation Infectious Disease Community of Practice. Clin Transplant 2019;33:e13514. https://doi.org/10.1111/ctr.13514. \u003c/li\u003e\n\u003cli\u003eDevarbhavi H, Asrani SK, Arab JP, Nartey YA, Pose E, Kamath PS. Global burden of liver disease: 2023 update. J Hepatol. 2023 Aug;79(2):516\u0026ndash;37. \u003c/li\u003e\n\u003cli\u003eWHO. Global health sector strategies on, respectively, HIV, viral hepatitis and sexually transmitted infections for the period 2022-2030[EB/OL].( 2022-06)[2022-11-13]. https://cdn.who.int/media/docs/default-source/hq-hiv-hepatitis-and-stis-library/full-final-who-ghss-hiv-vh-sti_1-june2022.pdf?sfvrsn=7c074b36_9. \u003c/li\u003e\n\u003cli\u003eNazir A, Wang Z. A Comprehensive Survey of ChatGPT: Advancements, Applications, Prospects, and Challenges. Meta Radiol. 2023 Sep;1(2):100022. doi: 10.1016/j.metrad.2023.100022. Epub 2023 Oct 7. PMID: 37901715; PMCID: PMC10611551. \u003c/li\u003e\n\u003cli\u003eOpenAI. ChatGPT: Optimizing Language Models for Dialogue.https://openai. com/blog/chatgpt/ (2022). \u003c/li\u003e\n\u003cli\u003eKung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepa\u0026ntilde;o C, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023 Feb;2(2):e0000198. \u003c/li\u003e\n\u003cli\u003eOpenAI. GPT-4 technical report. Preprint at arXiv https://doi.org/ 10.48550/arXiv.2303.08774 (2023). \u003c/li\u003e\n\u003cli\u003eGoogle. Bard updates from Google I/O 2023: Images, new features. https://blog.google/technology/ai/google-bard-updates-io-2023/. \u003c/li\u003e\n\u003cli\u003eSteimetz E, Minkowitz J, Gabutan EC, Ngichabe J, Attia H, Hershkop M, et al. Use of Artificial Intelligence Chatbots in Interpretation of Pathology Reports. JAMA Netw Open. 2024 May 22;7(5):e2412767. \u003c/li\u003e\n\u003cli\u003eSarraju A, Bruemmer D, Van Iterson E, Cho L, Rodriguez F, Laffin L. Appropriateness of Cardiovascular Disease Prevention Recommendations Obtained From a Popular Online Chat-Based Artificial Intelligence Model. JAMA. 2023 Mar 14;329(10):842\u0026ndash;4. \u003c/li\u003e\n\u003cli\u003eCakir H, Caglar U, Sekkeli S, Zerdali E, Sarilar O, Yıldız O, et al. Evaluating ChatGPT Ability to Answer Urinary Tract Infection-Related Questions. Infect Dis Now. 2024 Mar 7;104884. \u003c/li\u003e\n\u003cli\u003eLim ZW, Pushpanathan K, Yew SME, Lai Y, Sun CH, Lam JSH, et al. Benchmarking large language models\u0026rsquo; performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard. EBioMedicine. 2023 Sep;95:104770. \u003c/li\u003e\n\u003cli\u003eZhang N, Sun Z, Xie Y, Wu H, Li C. The latest version ChatGPT powered by GPT-4o: what will it bring to the medical field? Int J Surg. 2024 Jun 10; \u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"bmc-medical-informatics-and-decision-making","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"midm","sideBox":"Learn more about [BMC Medical Informatics and Decision Making](http://bmcmedinformdecismak.biomedcentral.com/)","snPcode":"","submissionUrl":"https://www.editorialmanager.com/midm/default.aspx","title":"BMC Medical Informatics and Decision Making","twitterHandle":"BMC_series","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"em","reportingPortfolio":"BMC Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"Viral Hepatitis, Large language model, ChatGPT, Comparative Study, Infectious disease","lastPublishedDoi":"10.21203/rs.3.rs-6935611/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-6935611/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003eBackground\u003c/h2\u003e\u003cp\u003eViral hepatitis is an important global public health problem that affects millions of people, which needs accurate information to help the public understand the disease correctly. This study evaluated four large language models (LLMs) including Gemini-2.0, Claude-3.5-sonnet, ChatGPT-4.5 and ChatGPT-4, and compared their responses to questions related to viral hepatitis to determine whether ChatGPT-4.5 was better than the other three models in this field.\u003c/p\u003e\u003ch2\u003eMethods\u003c/h2\u003e\u003cp\u003eThis comparative evaluation study, conducted at Nanjing Drum Tower Hospital from March to April 2025, examined 52 questions pertaining to viral hepatitis. Four large language models were assessed based on their responses to these 52 questions which encompassed four domains: concepts, risk factors, diagnosis, and prevention and treatment. Initial evaluation used a three-point scale of good, borderline, and poor. Further evaluation criteria included relevance, comprehensiveness, accuracy, safety, and readability, with each response scored on a scale of 1 to 5.\u003c/p\u003e\u003ch2\u003eResults\u003c/h2\u003e\u003cp\u003eChatGPT-4.5 achieved the highest performance, with 89.1% of its responses rated as good, significantly outperforming Claude-3.5-sonnet (71.15% good), Gemini-2.0 (62.82% good), and ChatGPT-4 (50.64% good). Statistical analysis confirmed superior performance of ChatGPT-4.5 in all evaluated dimensions. Consistently, ChatGPT-4.5 scored the highest across all five criteria: relevance, comprehensiveness, accuracy, safety, and readability.\u003c/p\u003e\u003ch2\u003eConclusions\u003c/h2\u003e\u003cp\u003eChatGPT-4.5 demonstrates superior performance in addressing viral hepatitis queries compared to other three models. Its high reliability makes it a valuable tool for healthcare professionals and patients by improving information accessibility.\u003c/p\u003e","manuscriptTitle":"Comparative Evaluation of Viral Hepatitis Question Responses: ChatGPT-4.5 Outperforms Three Established Models","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-07-18 12:39:24","doi":"10.21203/rs.3.rs-6935611/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2025-08-12T08:34:15+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-08-11T14:40:51+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-08-07T11:52:25+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-07-24T15:44:02+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"170020092817481687765317970713097853340","date":"2025-07-16T16:00:05+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"84474335868076716308019245898322664292","date":"2025-07-11T13:55:03+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"63784993805415976448643700390017092965","date":"2025-07-11T10:42:24+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2025-07-11T10:01:28+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2025-07-11T09:52:53+00:00","index":"","fulltext":""},{"type":"editorInvited","content":"","date":"2025-06-26T10:15:10+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2025-06-25T15:40:24+00:00","index":"","fulltext":""},{"type":"submitted","content":"BMC Medical Informatics and Decision Making","date":"2025-06-25T15:37:07+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"bmc-medical-informatics-and-decision-making","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"midm","sideBox":"Learn more about [BMC Medical Informatics and Decision Making](http://bmcmedinformdecismak.biomedcentral.com/)","snPcode":"","submissionUrl":"https://www.editorialmanager.com/midm/default.aspx","title":"BMC Medical Informatics and Decision Making","twitterHandle":"BMC_series","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"em","reportingPortfolio":"BMC Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"f209afc9-3673-49cd-82fa-c41ac638e7ea","owner":[],"postedDate":"July 18th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[],"tags":[],"updatedAt":"2025-12-01T16:12:06+00:00","versionOfRecord":{"articleIdentity":"rs-6935611","link":"https://doi.org/10.1186/s12911-025-03273-4","journal":{"identity":"bmc-medical-informatics-and-decision-making","isVorOnly":false,"title":"BMC Medical Informatics and Decision Making"},"publishedOn":"2025-11-26 15:58:48","publishedOnDateReadable":"November 26th, 2025"},"versionCreatedAt":"2025-07-18 12:39:24","video":"","vorDoi":"10.1186/s12911-025-03273-4","vorDoiUrl":"https://doi.org/10.1186/s12911-025-03273-4","workflowStages":[]},"version":"v1","identity":"rs-6935611","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-6935611","identity":"rs-6935611","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00