Diagnostic Performance of Large Language Models and Radiologists in Case-Based Radiology Questions

doi:10.21203/rs.3.rs-8618169/v1

Diagnostic Performance of Large Language Models and Radiologists in Case-Based Radiology Questions

2026 · doi:10.21203/rs.3.rs-8618169/v1

preprint OA: closed CC-BY-4.0

📄 Open PDF Full text JSON View at publisher

Full text 67,564 characters · extracted from preprint-html · click to expand

Diagnostic Performance of Large Language Models and Radiologists in Case-Based Radiology Questions | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Diagnostic Performance of Large Language Models and Radiologists in Case-Based Radiology Questions Raif Can Yarol, Ali Cantürk, Kenan Kadirli, Aslı Suner Karakulah, and 1 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8618169/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Background Large language models (LLM) have demonstrated considerable potential in supporting medical decision-making. Until recently, LLM were restricted to text inputs, limiting their utility in image interpretation. The introduction of ChatGPT-4V, with capability of analyzing visual data, has opened new opportunities to evaluate LLM performance in radiological image interpretation. This study aims to investigate the performance of ChatGPT-4V in radiological image interpretation compared to two board-certified radiologists. The secondary aim of the study is to compare the accuracy of primary and differential diagnoses provided by three different LLM. Materials and Methods A total of 121 radiology cases were retrospectively retrieved from the Association of Academic Radiology “Case of the Month” archive. Each case consisted of the following sections: patient presentation, unlabeled findings, labeled findings, final diagnosis, case discussion, and references. Three LLMs —Chat GPT-3.5, Chat GPT-4o, Google Gemini 1.5 Pro— were provided with the patient presentation and labeled findings sections.Chat GPT-4V and two board certified radiologists was provided with the patient presentation and unlabeled findings sections, including radiological images. All comparisons were conducted separately for image based evaluations (Radiologist 1, Radiologist 2, and GPT-4V) and text-based evaluations (GPT-4o, GPT-3.5, and Gemini 1.5 Pro). Categorical variables indicating correct and incorrect responses were summarized as frequencies and percentages. Primary diagnosis and differential diagnosis accuracy were analyzed with McNemar’s and Cochran’s Q test. Results Both Radiologist 1 (72.5%) and Radiologist 2 (71.1%) significantly outperformed ChatGPT4V (38.8%)(p < 0.001). ChatGPT-3.5 achieved the highest primary diagnostic accuracy (80.9%), followed by ChatGPT-4o (78.5%) and Gemini 1.5 Pro (72.7%). For differential diagnoses, ChatGPT-4o achieved the highest accuracy (90.9%), slightly outperforming ChatGPT-3.5 (90.1%) and significantly exceeding Gemini 1.5 Pro (81.8%, p = 0.001). No significant difference was observed between ChatGPT-4o and ChatGPT-3.5. Conclusion LLMs demonstrated strong performance in generating primary and differential diagnoses based on text-based radiologic findings, with ChatGPT-3.5 and ChatGPT-4o outperforming Gemini 1.5 Pro. However, ChatGPT-4V showed substantially lower accuracy in direct radiological image interpretation compared to radiologists. While promising in text-based applications, LLMs require further development and validation. Large Language Model Chat GPT 3.5 Chat GPT 4o Chat GPT 4V Gemini Figures Figure 1 INTRODUCTION Artificial intelligence is increasingly becoming integrated into both daily life and medicine. LLMs are deep learning based AI systems which is designed to process, understand, and generate human-like text ( 1 , 2 ). Assessing the utility of LLMs in medical questions represents a contemporary field of investigation. Studies have been conducted to evaluate the information processing capacity of LLMs in medical practice through examinations across a wide range of medical specialties, including general surgery, ophthalmology, sleep medicine, neurosurgery, orthopedics, internal medicine( 3 – 8 ). Radiology is an image-based specialty, making the application of LLMs in this field more challenging compared to other medical domains. There are numerous LLMs available, developed and released by various companies. Chat GPT 3.5 (Open AI), Chat GPT-4o (Open AI), Gemini 1.5 Pro (Google) are among the most commonly used models. Until recently, LLMs lacked the ability to analyze visual data, which limited their direct applicability to radiology. Consequently, studies in this field have primarily relied on converting findings from radiological images into textual descriptions, which were then used to query these models ( 9 – 12 ). The introduction of ChatGPT-4V (Vision) in 2023 marked a significant advancement in the field of artificial intelligence, enabling LLMs to process and interpret visual data( 13 ). This innovation has opened new avenues for research, including the evaluation of LLMs' capabilities in image analysis ( 14 ). On the other hand, the key requirements for the use of artificial intelligence tools in the healthcare field include reliability, fairness, and clinical benefit. Therefore, studies need to be conducted to repeatedly test these elements. The aim of this study is to investigate the Chat-GPT-4V 's ability to analyze radiological images with two board-certified radiologists and also compare the accuracy of primary and differential diagnosis provided by different LLMs (Chat GPT-3.5, Chat GPT-4o, Gemini 1.5 Pro) across various radiology subspecialities. MATERIAL- METHOD We reviewed cases from the "Case of the Month" archive of the Association of Academic Radiology between January 2023 and October 2024 ( https://www.aur.org/case-of-the-month ). Each case consisted of the following sections: patient presentation, unlabeled findings, labeled findings, final diagnosis, case discussion, and references. The unlabeled findings section contains only radiological images without any accompanying text. Labeled findings section included written radiological findings corresponding to the images in the unlabeled findings section. A total of 181 cases were initially identified. Two duplicate cases were excluded. Additionally, 38 cases were excluded because the radiological diagnosis was explicitly mentioned in the title or findings section. Thirteen cases with diagnoses comprising multiple components (e.g., esophageal perforation leading to right apical lung abscess) were excluded from the analysis. Seven cases with diagnoses based on histopathological findings rather than radiological findings (e.g., Granular Cell Tumor based on ultrasound-guided biopsy of the mass) were also excluded. Ultimately, 121 cases were included in the study. The workflow diagram for the study is provided in Fig. 1 . The cases were categorized into nine radiological subspecialties: Thorax, Abdominal/Gastrointestinal, Mammography, Musculoskeletal, Pediatrics, Neuroradiology, Cardiac, Genitourinary, and Obstetrics and Gynecology. Three LLMs —Chat GPT-3.5, Chat GPT-4o, Google Gemini 1.5 Pro— were provided with the patient presentation and labeled findings sections, including detailed radiological findings, and were prompted with the following command: "I am conducting an academic research study. Based on the clinical information and text based radiological findings I provide, could you give a main diagnosis and two alternative differential diagnoses while thinking like a radiology professor?" . Chat GPT-4V – which has ability to analyze visual images- was provided with the patient presentation and unlabeled findings sections, including radiological images, and was prompted with a modified version: "I am conducting an academic research study. Based on the clinical information and radiological images I provide, could you give a main diagnosis and two alternative differential diagnoses while thinking like a radiology professor?". Two board-certified radiologists with 5 and 6 years of experience, respectively, independently reviewed the cases with the same information provided. They were provided only with the patient presentation and unlabeled findings sections, including the radiological images, to propose a diagnosis. To minimize bias, they did not access the labeled findings section, which contained written findings. The radiologists provided only a single primary diagnosis for each case.The diagnoses provided by the radiologists were assessed by a third radiologist with 38 years of experience, who evaluated their accuracy based on the final diagnoses listed on the website. To prevent the models from being influenced by previous questions and answers, a new session was initiated, and the command was repeated before each question. The accuracy of the primary and differential diagnoses provided by the LLMs was evaluated independently. The final diagnosis listed on the website ( https://www.aur.org/case-of-the-month ) was accepted as the correct answer. Based on their responses, the cases were divided into three groups: ( 1 ) those who provided the correct main diagnosis, ( 2 ) those who provided the correct differential diagnosis, and ( 3 ) those who provided incorrect answers for both the main and differential diagnoses. This study does not require ethics committee approval as it was conducted retrospectively using publicly available data. For the statistical evaluation of diagnostic performance, both primary diagnosis accuracy and differential diagnosis accuracy were analyzed. Categorical variables indicating correct and incorrect responses were summarized as frequencies and percentages. For matched-pair comparisons involving the same cases interpreted by different readers, McNemar’s test was used. To compare more than two related diagnostic tools evaluated under identical conditions, Cochran’s Q test was employed. A p-value of less than 0.05 was considered statistically significant. All comparisons were conducted separately for image-based evaluations (Radiologist 1, Radiologist 2, and GPT-4V) and text-based evaluations (GPT-4o, GPT-3.5, and Gemini 1.5 Pro). Cross-modal analyses were also performed. All statistical analyses were carried out using IBM SPSS Statistics (Version 27.0; IBM Corp., Armonk, NY). RESULTS The study encompassed a total of 121 cases, categorized as follows: 13 thoracic, 25 abdominal and gastrointestinal, 9 mammographic, 16 musculoskeletal, 10 pediatric, 32 neuroradiologic, 5 cardiac, 8 genitourinary, and 3 obstetric and gynecologic cases. This study contains two groups. ChatGPT-4o, ChatGPT-3.5, and Gemini 1.5 Pro were provided with patient presentation and labeled findings. ChatGPT-4V and two board certified radiologists were provided with patient presentation and unlabeled radiological images. The two groups were analyzed separately. When the diagnostic accuracies for primary diagnosis were evaluated, ChatGPT-3.5 demonstrated the highest accuracy with a success rate of 80.9%, followed by ChatGPT-4o at 78.5%. Gemini 1.5 Pro achieved an accuracy of 72.7%. When the top three differential diagnoses were considered, ChatGPT-4o achieved the highest overall accuracy, correctly answering 110 out of 121 cases, corresponding to a success rate of 90.9%. ChatGPT-3.5 ranked second with 109 correct responses, yielding an accuracy rate of 90.1%. Gemini 1.5 Pro followed in third place, providing 99 correct answers and an overall accuracy of 81.8%. Direct comparison between ChatGPT-4o and Gemini 1.5 Pro revealed no statistically significant difference in primary diagnostic accuracy (78.5% vs. 72.7%, p = 0.157). However, ChatGPT-4o showed significantly higher overall success (90.9% vs. 81.8%, p = 0.001) when differential diagnosis taken into count. Similarly, ChatGPT-3.5 outperformed Gemini 1.5 Pro in terms of differential diagnosis (90.1% vs. 81.8%, p = 0.008), despite no statistically significant difference in primary diagnostic accuracy (80.8% vs. 72.5%, p = 0.061). No statistically significant differences were observed between ChatGPT-4o and ChatGPT-3.5 in terms of primary diagnostic accuracy (78.5% vs. 80.9%, p = 0.564) or overall success rate (90.9% vs. 90.1%, p = 0.366)(Table 1 ). Table 1 Primary and differential diagnostic accuracies of LLMs and Mcnemar test p values for each comparison Chat GPT 3.5 Chat GPT4o Gemini 1.5 Pro Chat GPT 3.5 vs Chat GPT4o Chat GPT 3.5 vs Gemini 1.5 Pro Chat GPT4o vs Gemini 1.5 Pro Primary diagnosis 80.9% (98/121) 78.5% (95/121) 72.7% (88/121) p = 0.564 p = 0.061 p = 0.157 Top 3 differential diagnosis 90.1% (110/121) 90.9%(109/121) 81.8% (99/121) p = 0.366 p = 0.008 p = 0.001 When evaluating the groups that provided patient presentation and unlabeled radiological images, radiologist 1 achieved the highest performance, with 87 correct answers and an overall accuracy rate of 72.5%, followed by Radiologist 2 with 86 correct responses and an accuracy rate of 71.1%. The lowest performance was observed in ChatGPT-4V, which provided 47 correct answers, corresponding to an accuracy rate of 38.8%. Radiologist 1 demonstrated a significantly higher primary diagnostic accuracy compared to ChatGPT-4V (72.5% vs. 38.8%, p < 0.001). Similarly, Radiologist 2 also outperformed ChatGPT-4V in terms of primary diagnostic accuracy (71.1% vs. 38.8%, p < 0.001)(Table 2 ) Table 2 Primary diagnosis accuracies of radiologists-Chat GPT 4V and p value for McNemar test. Both radiologists outperformed Chat GPT-4V in primary diagnosis accuracy. Radiologist 1 Radiologist 2 Chat GPT 4V Radiologist 1 vs Radiologist 2 Radiologist 1 vs Chat GPT 4V Radiologist 2 vs Chat GPT 4V Primary diagnosis 72.5% (87/121) 71.1% (86/121) 38.8% (47/121) p > 0.005 p < 0.001 p < 0.001 DISCUSSION In this study, we conducted two comparative analyses; ( 1 ) the accuracy of image interpretation between two board certified radiologists ( 2 ) the performance of three commonly used LLM’s in generating primary and differential diagnoses based on patient history and text-based radiological findings. In the overall analysis, both Radiologist 1 and Radiologist 2 demonstrated significantly higher diagnostic accuracy compared to ChatGPT-4V. Several factors may help explain why GPT-4V performed lower than both radiologists, even when provided with identical clinical information and radiological images. First, GPT-4V has been trained on far fewer DICOM based studies than radiologists typically encounter in clinical practice, which may limit its ability to recognize certain complex imaging patterns ( 15 ). Second, previous studies demonstrated that GPT-4V may overlook fine pixel-level details its image tokenization process, potentially leading to misclassifications ( 16 ). Third, accumulated clinical experience still plays a crucial role in complex image interpretation. On the contrary, it should be emphasized that, in real-world clinical practice, radiologists establish diagnoses using full DICOM datasets rather than relying on single slice images.A recent meta-analysis showed that generative AI models generally perform inferior to expert physicians, which is consistent with the observed superior accuracy of both radiologists in our study ( 17 ). Across diverse modalities and study designs, current evidence—including our own data—shows that GPT-4V’s image-interpretation skills fall short of experienced radiologists, with the gap widening in cognitively demanding domains such as neuroradiology. Larger, multi-centre datasets will be necessary to determine whether the smaller, non-significant differences we saw in other subgroups reflect true clinical parity or simply low statistical power( 15,18,19,20). In the text-based evaluation, where clinical information and structured radiologic findings were provided, GPT-4o, GPT-3.5, Gemini 1.5 Pro, demonstrated comparable performance in identifying the primary diagnosis, with no significant differences observed. However, Gemini 1.5 Pro was outperformed by both GPT-4o and GPT-3.5 in generating a top three differential diagnosis list. Similarly, in a previous study evaluating the diagnostic capabilities of LLMs on ‘Diagnosis Please’ cases, Gemini 1.5 Pro was also outperformed by GPT-4o (%49,4 vs %41,0, respectively ) ( 10 ). Despite being trained on more recent and extensive datasets, Chat GPT-4o did not demonstrate a statistically significant improvement over ChatGPT-3.5 in either primary diagnostic accuracy or differential diagnosis accuracy. This finding suggests that the capacity of ChatGPT-4o for radiologic case interpretation may not have substantially improved compared to its predecessor, despite additional training. These results highlight a potential need for dedicated, radiology-specific training to enhance the model’s diagnostic performance in medical imaging tasks. Both GPT-4o and GPT-3.5 achieved higher success rates compared to Gemini 1.5 Pro, reflecting their better capability to generate clinically relevant differential diagnoses when the correct diagnosis was not initially selected. This likely reflects their more advanced reasoning and language understanding capacities, enabling better integration of clinical information and radiologic findings into plausible alternatives. In contrast, Gemini 1.5 Pro may have more difficulty weighing multiple potential diagnoses, particularly when findings are less specific.The absence of a significant difference between GPT-4o and GPT-3.5 may be related to the study design, as the models were restricted to providing only one main and two differential diagnoses. This limited output may have prevented GPT-4o’s advanced reasoning capabilities from fully manifesting. Allowing a broader differential list might better highlight differences between these models. Although the models demonstrated comparable main diagnosis performance to radiologists in the text-based setting, their reasoning process may still be influenced by the structured nature of the input and the restricted differential list. While GPT-4o and GPT-3.5 effectively generated relevant differential diagnoses within these constraints, further studies are needed to assess their ability to fully replicate the complex clinical prioritization and decision-making processes applied by experienced radiologists in real-world practice. This study has several limitations. First, the sample size of 121 patients may not be sufficient to generalize findings across all radiologic diagnoses. Second, the heterogeneity among subspecialty groups limits the statistical power for subgroup-specific diagnostic performance analyses. Lastly, the radiologists included in this study did not provide differential diagnoses. Therefore, comparisons between radiologists and LLMs were limited to primary diagnoses only. Further studies are needed to evaluate and compare the performance of radiologists and LLMs in generating accurate differential diagnoses. In conclusion, although LLM’s demonstrate promising potential in generating diagnoses based on patient history and text-based radiologic findings, the current performance of ChatGPT-4V in direct radiological image interpretation remains far from reliable. Therefore it should be approached with caution. Abbreviations LLM Large Language Model Declarations ACKNOWLEDGEMENTS Human Ethics and Consent to Participate declarations: not applicable Competing interests : The author(s) declare no competing interests. Data Availability The authors confirm that the data supporting the findings of this study are available within the article. If necessary, it can also be obtained from the corresponding author. Funding: This research received no external funding Institutional Review Board Statement: Not applicable Informed Consent Statement: Not applicable Acknowledgments: The authors have reviewed and edited the output and take full responsibility for the content of this publication Conflicts of Interest: The authors declare no conflicts of interest References Alowais, S.A., Alghamdi, S.S., Alsuhebany, N. et al. Revolutionizing healthcare: the role of artificial intelligence in clinical practice. BMC Med Educ 23 , 689 (2023). Vaishya, Raju, Anoop Misra, and Abhishek Vaish. "ChatGPT: Is this version good for healthcare and research?." Diabetes & Metabolic Syndrome: Clinical Research & Reviews 17.4 (2023): 102744. Oh N, Choi GS, Lee WY. ChatGPT goes to the operating room: evaluating GPT-4 performance and its potential in surgical education and training in the era of large language models. Ann Surg Treat Res. 2023 May;104(5):269-273. https://doi.org/10.4174/astr.2023.104.5.269 Botross M, Mohammadi SO, Montgomery K, Crawford C. Performance of Google's Artificial Intelligence Chatbot "Bard" (Now "Gemini") on Ophthalmology Board Exam Practice Questions. Cureus. 2024 Mar 31;16(3):e57348. doi: 10.7759/cureus.57348. PMID: 38690460; PMCID: PMC11060832. Cheong, R.C.T., Pang, K.P., Unadkat, S. et al. Performance of artificial intelligence chatbots in sleep medicine certification board exams: ChatGPT versus Google Bard. Eur Arch Otorhinolaryngol 281 , 2137–2143 (2024). https://doi.org/10.1007/s00405-023-08381-3 Ali R, Tang OY, Connolly ID, Zadnik Sullivan PL, Shin JH, Fridley JS, Asaad WF, Cielo D, Oyelese AA, Doberstein CE, Gokaslan ZL, Telfeian AE. Performance of ChatGPT and GPT-4 on Neurosurgery Written Board Examinations. Neurosurgery. 2023 Dec 1;93(6):1353-1365. doi: 10.1227/neu.0000000000002632. Epub 2023 Aug 15. PMID: 37581444. Massey PA, Montgomery C, Zhang AS. Comparison of ChatGPT-3.5, ChatGPT-4, and Orthopaedic Resident Performance on Orthopaedic Assessment Examinations. J Am Acad Orthop Surg. 2023 Dec 1;31(23):1173-1179. doi: 10.5435/JAAOS-D-23-00396. Epub 2023 Sep 4. PMID: 37671415; PMCID: PMC10627532. Tarabanis C, Zahid S, Mamalis M, Zhang K, Kalampokis E, Jankelson L. Performance of Publicly Available Large Language Models on Internal Medicine Board-style Questions. PLOS Digit Health. 2024 Sep 17;3(9):e0000604. doi: 10.1371/journal.pdig.0000604. PMID: 39288137; PMCID: PMC11407633. Suthar, Pokhraj P., et al. "Artificial intelligence (AI) in radiology: a deep dive into ChatGPT 4.0's accuracy with the American Journal of Neuroradiology's (AJNR)" Case of the Month"." Cureus 15.8 (2023). Sonoda, Yuki, et al. "Diagnostic performances of gpt-4o, claude 3 opus, and gemini 1.5 pro in “diagnosis please” cases." Japanese Journal of Radiology (2024): 1-5. Toyama, Yoshitaka, et al. "Performance evaluation of ChatGPT, GPT-4, and Bard on the official board examination of the Japan Radiology Society." Japanese Journal of Radiology 42.2 (2024): 201-207. Gunes, Yasin Celal, and Turay Cesur. "The Diagnostic Performance of Large Language Models and General Radiologists in Thoracic Radiology Cases: A Comparative Study." Journal of Thoracic Imaging : 10-1097. OpenAI. ChatGPT-4V (Vision) . Artificial intelligence model, 2023. Accessed 28 Nov. 2024 Oura, T., Tatekawa, H., Horiuchi, D., Matsushita, S., Takita, H., Atsukawa, N., ... & Ueda, D. (2024). Diagnostic accuracy of vision-language models on Japanese diagnostic radiology, nuclear medicine, and interventional radiology specialty board examinations. Japanese Journal of Radiology , 1-7. Huppertz, Marc Sebastian, et al. "Revolution or risk?—Assessing the potential and challenges of GPT-4V in radiologic image interpretation." European Radiology 35.3 (2025): 1111-1121. Jiang, Yuyang, et al. "Gpt-4v cannot generate radiology reports yet." arXiv preprint arXiv:2407.12176 (2024). Takita, Hirotaka, et al. "A systematic review and meta-analysis of diagnostic performance comparison between generative AI and physicians." npj Digital Medicine 8.1 (2025): 175. Horiuchi, Daisuke, et al. "Comparing the diagnostic performance of GPT-4-based ChatGPT, GPT-4V-based ChatGPT, and radiologists in challenging neuroradiology cases." Clinical neuroradiology 34.4 (2024): 779-787. Horiuchi, Daisuke, et al. "ChatGPT’s diagnostic performance based on textual vs. visual information compared to radiologists’ diagnostic performance in musculoskeletal radiology." European Radiology 35.1 (2025): 506-516. Zhou, Yiliang, et al. "Evaluating GPT-4V (GPT-4 with Vision) on detection of radiologic findings on chest radiographs." Radiology 311.2 (2024): e233270. Additional Declarations No competing interests reported. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8618169","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":605129866,"identity":"0ab4d430-2c66-4d2f-b786-5401e152396c","order_by":0,"name":"Raif Can Yarol","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA3klEQVRIiWNgGAWjYFACxoYPIIqPvQFIGlgQpaVxBohi4zkA0iJBnDUQLRIJIIoILebthxubeWruybFJPr+64UeBBAN/e3cCXi0yZxKBWo4VG7NJ55Td7AE6TOLM2Q14tUgwJLY/5mFLSGyTzkm7wQPUYiCRS0AL/0OgLf+AWiTPpN38Q5QWCaDDeNuAWiTYj90mzhaJh42Nc/sSjNl4cthuyxhI8BD2C3/6w4Y33xLk+NmPP7v55o+NHH97L34tIMDEA6Z4DMAkQeUgwPgDTLE/IEr1KBgFo2AUjDwAAK/gREA043jTAAAAAElFTkSuQmCC","orcid":"","institution":"Iğdır Dr. Nevruz Erez State Hospital","correspondingAuthor":true,"prefix":"","firstName":"Raif","middleName":"Can","lastName":"Yarol","suffix":""},{"id":605129867,"identity":"0d42682a-8e85-41cb-ada3-eba1b986c8c4","order_by":1,"name":"Ali Cantürk","email":"","orcid":"","institution":"Sultan 2. Abdülhamid Han Eğitim ve Araştırma Hastanesi","correspondingAuthor":false,"prefix":"","firstName":"Ali","middleName":"","lastName":"Cantürk","suffix":""},{"id":605129868,"identity":"97e46d8d-b3e7-4d48-a4f2-211263a444ef","order_by":2,"name":"Kenan Kadirli","email":"","orcid":"","institution":"Sultan 2. Abdülhamid Han Eğitim ve Araştırma Hastanesi","correspondingAuthor":false,"prefix":"","firstName":"Kenan","middleName":"","lastName":"Kadirli","suffix":""},{"id":605129869,"identity":"8dec2143-efe4-4df0-8b92-335b539ef83b","order_by":3,"name":"Aslı Suner Karakulah","email":"","orcid":"","institution":"Ege University","correspondingAuthor":false,"prefix":"","firstName":"Aslı","middleName":"Suner","lastName":"Karakulah","suffix":""},{"id":605129870,"identity":"5f7ec497-a766-4465-a8e8-971d1293d87f","order_by":4,"name":"Oğuz Dicle","email":"","orcid":"","institution":"Dokuz Eylül University","correspondingAuthor":false,"prefix":"","firstName":"Oğuz","middleName":"","lastName":"Dicle","suffix":""}],"badges":[],"createdAt":"2026-01-16 11:27:46","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8618169/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8618169/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":104593732,"identity":"9a67ba55-60f7-4bf0-a4a3-7010b214fb88","added_by":"auto","created_at":"2026-03-13 17:42:54","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":94814,"visible":true,"origin":"","legend":"\u003cp\u003eOverview of the study\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-8618169/v1/44808aa17b3db306702de03c.png"},{"id":104593734,"identity":"a97eefa2-d374-4dbf-b500-09bab333bc25","added_by":"auto","created_at":"2026-03-13 17:42:59","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":593140,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8618169/v1/b3e52916-6fa4-4cbc-8b2c-9b0bd482a471.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Diagnostic Performance of Large Language Models and Radiologists in Case-Based Radiology Questions","fulltext":[{"header":"INTRODUCTION","content":"\u003cp\u003eArtificial intelligence is increasingly becoming integrated into both daily life and medicine. LLMs are deep learning based AI systems which is designed to process, understand, and generate human-like text (\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e, \u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e). Assessing the utility of LLMs in medical questions represents a contemporary field of investigation.\u003c/p\u003e \u003cp\u003eStudies have been conducted to evaluate the information processing capacity of LLMs in medical practice through examinations across a wide range of medical specialties, including general surgery, ophthalmology, sleep medicine, neurosurgery, orthopedics, internal medicine(\u003cspan additionalcitationids=\"CR4 CR5 CR6 CR7\" citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e). Radiology is an image-based specialty, making the application of LLMs in this field more challenging compared to other medical domains. There are numerous LLMs available, developed and released by various companies. Chat GPT 3.5 (Open AI), Chat GPT-4o (Open AI), Gemini 1.5 Pro (Google) are among the most commonly used models. Until recently, LLMs lacked the ability to analyze visual data, which limited their direct applicability to radiology. Consequently, studies in this field have primarily relied on converting findings from radiological images into textual descriptions, which were then used to query these models (\u003cspan additionalcitationids=\"CR10 CR11\" citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e). The introduction of ChatGPT-4V (Vision) in 2023 marked a significant advancement in the field of artificial intelligence, enabling LLMs to process and interpret visual data(\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e). This innovation has opened new avenues for research, including the evaluation of LLMs' capabilities in image analysis (\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e). On the other hand, the key requirements for the use of artificial intelligence tools in the healthcare field include reliability, fairness, and clinical benefit. Therefore, studies need to be conducted to repeatedly test these elements.\u003c/p\u003e \u003cp\u003eThe aim of this study is to investigate the Chat-GPT-4V 's ability to analyze radiological images with two board-certified radiologists and also compare the accuracy of primary and differential diagnosis provided by different LLMs (Chat GPT-3.5, Chat GPT-4o, Gemini 1.5 Pro) across various radiology subspecialities.\u003c/p\u003e"},{"header":"MATERIAL- METHOD","content":"\u003cp\u003eWe reviewed cases from the \"Case of the Month\" archive of the Association of Academic Radiology between January 2023 and October 2024 (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.aur.org/case-of-the-month\u003c/span\u003e\u003cspan address=\"https://www.aur.org/case-of-the-month\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eEach case consisted of the following sections: patient presentation, unlabeled findings, labeled findings, final diagnosis, case discussion, and references. The unlabeled findings section contains only radiological images without any accompanying text. Labeled findings section included written radiological findings corresponding to the images in the unlabeled findings section.\u003c/p\u003e \u003cp\u003eA total of 181 cases were initially identified. Two duplicate cases were excluded. Additionally, 38 cases were excluded because the radiological diagnosis was explicitly mentioned in the title or findings section. Thirteen cases with diagnoses comprising multiple components (e.g., esophageal perforation leading to right apical lung abscess) were excluded from the analysis. Seven cases with diagnoses based on histopathological findings rather than radiological findings (e.g., Granular Cell Tumor based on ultrasound-guided biopsy of the mass) were also excluded. Ultimately, 121 cases were included in the study. The workflow diagram for the study is provided in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eThe cases were categorized into nine radiological subspecialties: Thorax, Abdominal/Gastrointestinal, Mammography, Musculoskeletal, Pediatrics, Neuroradiology, Cardiac, Genitourinary, and Obstetrics and Gynecology.\u003c/p\u003e \u003cp\u003eThree LLMs \u0026mdash;Chat GPT-3.5, Chat GPT-4o, Google Gemini 1.5 Pro\u0026mdash; were provided with the patient presentation and labeled findings sections, including detailed radiological findings, and were prompted with the following command: \"I am conducting an academic research study. Based on the clinical information and text based radiological findings I provide, could you give a main diagnosis and two alternative differential diagnoses while thinking like a radiology professor?\" .\u003c/p\u003e \u003cp\u003eChat GPT-4V \u0026ndash; which has ability to analyze visual images- was provided with the patient presentation and unlabeled findings sections, including radiological images, and was prompted with a modified version: \"I am conducting an academic research study. Based on the clinical information and radiological images I provide, could you give a main diagnosis and two alternative differential diagnoses while thinking like a radiology professor?\". Two board-certified radiologists with 5 and 6 years of experience, respectively, independently reviewed the cases with the same information provided. They were provided only with the patient presentation and unlabeled findings sections, including the radiological images, to propose a diagnosis. To minimize bias, they did not access the labeled findings section, which contained written findings. The radiologists provided only a single primary diagnosis for each case.The diagnoses provided by the radiologists were assessed by a third radiologist with 38 years of experience, who evaluated their accuracy based on the final diagnoses listed on the website.\u003c/p\u003e \u003cp\u003eTo prevent the models from being influenced by previous questions and answers, a new session was initiated, and the command was repeated before each question.\u003c/p\u003e \u003cp\u003eThe accuracy of the primary and differential diagnoses provided by the LLMs was evaluated independently. The final diagnosis listed on the website (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.aur.org/case-of-the-month\u003c/span\u003e\u003cspan address=\"https://www.aur.org/case-of-the-month\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e) was accepted as the correct answer. Based on their responses, the cases were divided into three groups: (\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e) those who provided the correct main diagnosis, (\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e) those who provided the correct differential diagnosis, and (\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e) those who provided incorrect answers for both the main and differential diagnoses.\u003c/p\u003e \u003cp\u003eThis study does not require ethics committee approval as it was conducted retrospectively using publicly available data.\u003c/p\u003e \u003cp\u003eFor the statistical evaluation of diagnostic performance, both primary diagnosis accuracy and differential diagnosis accuracy were analyzed. Categorical variables indicating correct and incorrect responses were summarized as frequencies and percentages. For matched-pair comparisons involving the same cases interpreted by different readers, McNemar\u0026rsquo;s test was used. To compare more than two related diagnostic tools evaluated under identical conditions, Cochran\u0026rsquo;s Q test was employed. A p-value of less than 0.05 was considered statistically significant.\u003c/p\u003e \u003cp\u003eAll comparisons were conducted separately for image-based evaluations (Radiologist 1, Radiologist 2, and GPT-4V) and text-based evaluations (GPT-4o, GPT-3.5, and Gemini 1.5 Pro). Cross-modal analyses were also performed. All statistical analyses were carried out using IBM SPSS Statistics (Version 27.0; IBM Corp., Armonk, NY).\u003c/p\u003e"},{"header":"RESULTS","content":"\u003cp\u003eThe study encompassed a total of 121 cases, categorized as follows: 13 thoracic, 25 abdominal and gastrointestinal, 9 mammographic, 16 musculoskeletal, 10 pediatric, 32 neuroradiologic, 5 cardiac, 8 genitourinary, and 3 obstetric and gynecologic cases. This study contains two groups. ChatGPT-4o, ChatGPT-3.5, and Gemini 1.5 Pro were provided with patient presentation and labeled findings. ChatGPT-4V and two board certified radiologists were provided with patient presentation and unlabeled radiological images. The two groups were analyzed separately.\u003c/p\u003e \u003cp\u003eWhen the diagnostic accuracies for primary diagnosis were evaluated, ChatGPT-3.5 demonstrated the highest accuracy with a success rate of 80.9%, followed by ChatGPT-4o at 78.5%. Gemini 1.5 Pro achieved an accuracy of 72.7%. When the top three differential diagnoses were considered, ChatGPT-4o achieved the highest overall accuracy, correctly answering 110 out of 121 cases, corresponding to a success rate of 90.9%. ChatGPT-3.5 ranked second with 109 correct responses, yielding an accuracy rate of 90.1%. Gemini 1.5 Pro followed in third place, providing 99 correct answers and an overall accuracy of 81.8%. Direct comparison between ChatGPT-4o and Gemini 1.5 Pro revealed no statistically significant difference in primary diagnostic accuracy (78.5% vs. 72.7%, p\u0026thinsp;=\u0026thinsp;0.157). However, ChatGPT-4o showed significantly higher overall success (90.9% vs. 81.8%, p\u0026thinsp;=\u0026thinsp;0.001) when differential diagnosis taken into count. Similarly, ChatGPT-3.5 outperformed Gemini 1.5 Pro in terms of differential diagnosis (90.1% vs. 81.8%, p\u0026thinsp;=\u0026thinsp;0.008), despite no statistically significant difference in primary diagnostic accuracy (80.8% vs. 72.5%, p\u0026thinsp;=\u0026thinsp;0.061). No statistically significant differences were observed between ChatGPT-4o and ChatGPT-3.5 in terms of primary diagnostic accuracy (78.5% vs. 80.9%, p\u0026thinsp;=\u0026thinsp;0.564) or overall success rate (90.9% vs. 90.1%, p\u0026thinsp;=\u0026thinsp;0.366)(Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e).\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003ePrimary and differential diagnostic accuracies of LLMs and Mcnemar test p values for each comparison\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"7\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eChat GPT 3.5\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eChat GPT4o\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eGemini 1.5 Pro\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eChat GPT 3.5 vs Chat GPT4o\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eChat GPT 3.5 vs Gemini 1.5 Pro\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003eChat GPT4o vs Gemini 1.5 Pro\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ePrimary\u003c/p\u003e \u003cp\u003ediagnosis\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e80.9%\u003c/p\u003e \u003cp\u003e(98/121)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e78.5%\u003c/p\u003e \u003cp\u003e(95/121)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e72.7%\u003c/p\u003e \u003cp\u003e(88/121)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003ep\u0026thinsp;=\u0026thinsp;0.564\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003ep\u0026thinsp;=\u0026thinsp;0.061\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003ep\u0026thinsp;=\u0026thinsp;0.157\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eTop 3 differential diagnosis\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e90.1% (110/121)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e90.9%(109/121)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e81.8% (99/121)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003ep\u0026thinsp;=\u0026thinsp;0.366\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e\u003cb\u003ep\u0026thinsp;=\u0026thinsp;0.008\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e\u003cb\u003ep\u0026thinsp;=\u0026thinsp;0.001\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eWhen evaluating the groups that provided patient presentation and unlabeled radiological images, radiologist 1 achieved the highest performance, with 87 correct answers and an overall accuracy rate of 72.5%, followed by Radiologist 2 with 86 correct responses and an accuracy rate of 71.1%. The lowest performance was observed in ChatGPT-4V, which provided 47 correct answers, corresponding to an accuracy rate of 38.8%. Radiologist 1 demonstrated a significantly higher primary diagnostic accuracy compared to ChatGPT-4V (72.5% vs. 38.8%, p\u0026thinsp;\u0026lt;\u0026thinsp;0.001). Similarly, Radiologist 2 also outperformed ChatGPT-4V in terms of primary diagnostic accuracy (71.1% vs. 38.8%, p\u0026thinsp;\u0026lt;\u0026thinsp;0.001)(Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e)\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003ePrimary diagnosis accuracies of radiologists-Chat GPT 4V and p value for McNemar test. Both radiologists outperformed Chat GPT-4V in primary diagnosis accuracy.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"7\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eRadiologist 1\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eRadiologist 2\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eChat GPT 4V\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eRadiologist 1 vs Radiologist 2\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eRadiologist 1 vs Chat GPT 4V\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003eRadiologist 2 vs Chat GPT 4V\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ePrimary diagnosis\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e72.5%\u003c/p\u003e \u003cp\u003e(87/121)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e71.1%\u003c/p\u003e \u003cp\u003e(86/121)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e38.8%\u003c/p\u003e \u003cp\u003e(47/121)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003ep\u0026thinsp;\u0026gt;\u0026thinsp;0.005\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e\u003cb\u003ep\u0026thinsp;\u0026lt;\u0026thinsp;0.001\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e\u003cb\u003ep\u0026thinsp;\u0026lt;\u0026thinsp;0.001\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e"},{"header":"DISCUSSION","content":"\u003cp\u003eIn this study, we conducted two comparative analyses; (\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e) the accuracy of image interpretation between two board certified radiologists (\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e) the performance of three commonly used LLM\u0026rsquo;s in generating primary and differential diagnoses based on patient history and text-based radiological findings.\u003c/p\u003e \u003cp\u003eIn the overall analysis, both Radiologist 1 and Radiologist 2 demonstrated significantly higher diagnostic accuracy compared to ChatGPT-4V. Several factors may help explain why GPT-4V performed lower than both radiologists, even when provided with identical clinical information and radiological images. First, GPT-4V has been trained on far fewer DICOM based studies than radiologists typically encounter in clinical practice, which may limit its ability to recognize certain complex imaging patterns (\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e). Second, previous studies demonstrated that GPT-4V may overlook fine pixel-level details its image tokenization process, potentially leading to misclassifications (\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e). Third, accumulated clinical experience still plays a crucial role in complex image interpretation. On the contrary, it should be emphasized that, in real-world clinical practice, radiologists establish diagnoses using full DICOM datasets rather than relying on single slice images.A recent meta-analysis showed that generative AI models generally perform inferior to expert physicians, which is consistent with the observed superior accuracy of both radiologists in our study (\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e). Across diverse modalities and study designs, current evidence\u0026mdash;including our own data\u0026mdash;shows that GPT-4V\u0026rsquo;s image-interpretation skills fall short of experienced radiologists, with the gap widening in cognitively demanding domains such as neuroradiology. Larger, multi-centre datasets will be necessary to determine whether the smaller, non-significant differences we saw in other subgroups reflect true clinical parity or simply low statistical power( \u003cb\u003e15,18,19,20).\u003c/b\u003e\u003c/p\u003e \u003cp\u003eIn the text-based evaluation, where clinical information and structured radiologic findings were provided, GPT-4o, GPT-3.5, Gemini 1.5 Pro, demonstrated comparable performance in identifying the primary diagnosis, with no significant differences observed. However, Gemini 1.5 Pro was outperformed by both GPT-4o and GPT-3.5 in generating a top three differential diagnosis list. Similarly, in a previous study evaluating the diagnostic capabilities of LLMs on \u0026lsquo;Diagnosis Please\u0026rsquo; cases, Gemini 1.5 Pro was also outperformed by GPT-4o (%49,4 vs %41,0, respectively\u003cb\u003e)\u003c/b\u003e(\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eDespite being trained on more recent and extensive datasets, Chat GPT-4o did not demonstrate a statistically significant improvement over ChatGPT-3.5 in either primary diagnostic accuracy or differential diagnosis accuracy. This finding suggests that the capacity of ChatGPT-4o for radiologic case interpretation may not have substantially improved compared to its predecessor, despite additional training. These results highlight a potential need for dedicated, radiology-specific training to enhance the model\u0026rsquo;s diagnostic performance in medical imaging tasks. Both GPT-4o and GPT-3.5 achieved higher success rates compared to Gemini 1.5 Pro, reflecting their better capability to generate clinically relevant differential diagnoses when the correct diagnosis was not initially selected. This likely reflects their more advanced reasoning and language understanding capacities, enabling better integration of clinical information and radiologic findings into plausible alternatives. In contrast, Gemini 1.5 Pro may have more difficulty weighing multiple potential diagnoses, particularly when findings are less specific.The absence of a significant difference between GPT-4o and GPT-3.5 may be related to the study design, as the models were restricted to providing only one main and two differential diagnoses. This limited output may have prevented GPT-4o\u0026rsquo;s advanced reasoning capabilities from fully manifesting. Allowing a broader differential list might better highlight differences between these models. Although the models demonstrated comparable main diagnosis performance to radiologists in the text-based setting, their reasoning process may still be influenced by the structured nature of the input and the restricted differential list. While GPT-4o and GPT-3.5 effectively generated relevant differential diagnoses within these constraints, further studies are needed to assess their ability to fully replicate the complex clinical prioritization and decision-making processes applied by experienced radiologists in real-world practice.\u003c/p\u003e \u003cp\u003eThis study has several limitations. First, the sample size of 121 patients may not be sufficient to generalize findings across all radiologic diagnoses. Second, the heterogeneity among subspecialty groups limits the statistical power for subgroup-specific diagnostic performance analyses. Lastly, the radiologists included in this study did not provide differential diagnoses. Therefore, comparisons between radiologists and LLMs were limited to primary diagnoses only. Further studies are needed to evaluate and compare the performance of radiologists and LLMs in generating accurate differential diagnoses.\u003c/p\u003e \u003cp\u003eIn conclusion, although LLM\u0026rsquo;s demonstrate promising potential in generating diagnoses based on patient history and text-based radiologic findings, the current performance of ChatGPT-4V in direct radiological image interpretation remains far from reliable. Therefore it should be approached with caution.\u003c/p\u003e"},{"header":"Abbreviations","content":"\u003cdiv class=\"DefinitionList\"\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eLLM\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eLarge Language Model\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003c/div\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eACKNOWLEDGEMENTS \u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eHuman Ethics and Consent to Participate declarations: not applicable\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting interests\u003c/strong\u003e:\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eThe author(s) declare no competing interests.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eData Availability\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors confirm that the data supporting the findings of this study are available within the article. If necessary, it can also be obtained from the corresponding author.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding:\u003c/strong\u003e This research received no external funding\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eInstitutional Review Board Statement:\u0026nbsp;\u003c/strong\u003eNot applicable\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eInformed Consent Statement:\u0026nbsp;\u003c/strong\u003eNot applicable\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAcknowledgments:\u003c/strong\u003e The authors have reviewed and edited the output and take full responsibility for the content of this publication\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConflicts of Interest:\u003c/strong\u003e The authors declare no conflicts of interest\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n \u003cli\u003eAlowais, S.A., Alghamdi, S.S., Alsuhebany, N. \u003cem\u003eet al.\u003c/em\u003e Revolutionizing healthcare: the role of artificial intelligence in clinical practice. \u003cem\u003eBMC Med Educ\u003c/em\u003e\u003cstrong\u003e23\u003c/strong\u003e, 689 (2023).\u003c/li\u003e\n \u003cli\u003eVaishya, Raju, Anoop Misra, and Abhishek Vaish. \u0026quot;ChatGPT: Is this version good for healthcare and research?.\u0026quot; \u003cem\u003eDiabetes \u0026amp; Metabolic Syndrome: Clinical Research \u0026amp; Reviews\u003c/em\u003e 17.4 (2023): 102744.\u003c/li\u003e\n \u003cli\u003eOh N, Choi GS, Lee WY. ChatGPT goes to the operating room: evaluating GPT-4 performance and its potential in surgical education and training in the era of large language models. Ann Surg Treat Res. 2023 May;104(5):269-273. https://doi.org/10.4174/astr.2023.104.5.269\u003c/li\u003e\n \u003cli\u003eBotross M, Mohammadi SO, Montgomery K, Crawford C. Performance of Google\u0026apos;s Artificial Intelligence Chatbot \u0026quot;Bard\u0026quot; (Now \u0026quot;Gemini\u0026quot;) on Ophthalmology Board Exam Practice Questions. Cureus. 2024 Mar 31;16(3):e57348. doi: 10.7759/cureus.57348. PMID: 38690460; PMCID: PMC11060832.\u003c/li\u003e\n \u003cli\u003eCheong, R.C.T., Pang, K.P., Unadkat, S. \u003cem\u003eet al.\u003c/em\u003e Performance of artificial intelligence chatbots in sleep medicine certification board exams: ChatGPT versus Google Bard. \u003cem\u003eEur Arch Otorhinolaryngol\u003c/em\u003e\u003cstrong\u003e281\u003c/strong\u003e, 2137\u0026ndash;2143 (2024). https://doi.org/10.1007/s00405-023-08381-3\u003c/li\u003e\n \u003cli\u003eAli R, Tang OY, Connolly ID, Zadnik Sullivan PL, Shin JH, Fridley JS, Asaad WF, Cielo D, Oyelese AA, Doberstein CE, Gokaslan ZL, Telfeian AE. Performance of ChatGPT and GPT-4 on Neurosurgery Written Board Examinations. Neurosurgery. 2023 Dec 1;93(6):1353-1365. doi: 10.1227/neu.0000000000002632. Epub 2023 Aug 15. PMID: 37581444.\u003c/li\u003e\n \u003cli\u003eMassey PA, Montgomery C, Zhang AS. Comparison of ChatGPT-3.5, ChatGPT-4, and Orthopaedic Resident Performance on Orthopaedic Assessment Examinations. J Am Acad Orthop Surg. 2023 Dec 1;31(23):1173-1179. doi: 10.5435/JAAOS-D-23-00396. Epub 2023 Sep 4. PMID: 37671415; PMCID: PMC10627532.\u003c/li\u003e\n \u003cli\u003eTarabanis C, Zahid S, Mamalis M, Zhang K, Kalampokis E, Jankelson L. Performance of Publicly Available Large Language Models on Internal Medicine Board-style Questions. PLOS Digit Health. 2024 Sep 17;3(9):e0000604. doi: 10.1371/journal.pdig.0000604. PMID: 39288137; PMCID: PMC11407633.\u003c/li\u003e\n \u003cli\u003eSuthar, Pokhraj P., et al. \u0026quot;Artificial intelligence (AI) in radiology: a deep dive into ChatGPT 4.0\u0026apos;s accuracy with the American Journal of Neuroradiology\u0026apos;s (AJNR)\u0026quot; Case of the Month\u0026quot;.\u0026quot; \u003cem\u003eCureus\u003c/em\u003e 15.8 (2023).\u003c/li\u003e\n \u003cli\u003eSonoda, Yuki, et al. \u0026quot;Diagnostic performances of gpt-4o, claude 3 opus, and gemini 1.5 pro in \u0026ldquo;diagnosis please\u0026rdquo; cases.\u0026quot; \u003cem\u003eJapanese Journal of Radiology\u003c/em\u003e (2024): 1-5.\u003c/li\u003e\n \u003cli\u003eToyama, Yoshitaka, et al. \u0026quot;Performance evaluation of ChatGPT, GPT-4, and Bard on the official board examination of the Japan Radiology Society.\u0026quot; \u003cem\u003eJapanese Journal of Radiology\u003c/em\u003e 42.2 (2024): 201-207.\u003c/li\u003e\n \u003cli\u003eGunes, Yasin Celal, and Turay Cesur. \u0026quot;The Diagnostic Performance of Large Language Models and General Radiologists in Thoracic Radiology Cases: A Comparative Study.\u0026quot; \u003cem\u003eJournal of Thoracic Imaging\u003c/em\u003e: 10-1097.\u003c/li\u003e\n \u003cli\u003eOpenAI. \u003cem\u003eChatGPT-4V (Vision)\u003c/em\u003e. Artificial intelligence model, 2023. Accessed 28 Nov. 2024\u003c/li\u003e\n \u003cli\u003eOura, T., Tatekawa, H., Horiuchi, D., Matsushita, S., Takita, H., Atsukawa, N., ... \u0026amp; Ueda, D. (2024). Diagnostic accuracy of vision-language models on Japanese diagnostic radiology, nuclear medicine, and interventional radiology specialty board examinations. \u003cem\u003eJapanese Journal of Radiology\u003c/em\u003e, 1-7.\u003c/li\u003e\n \u003cli\u003eHuppertz, Marc Sebastian, et al. \u0026quot;Revolution or risk?\u0026mdash;Assessing the potential and challenges of GPT-4V in radiologic image interpretation.\u0026quot; \u003cem\u003eEuropean Radiology\u003c/em\u003e 35.3 (2025): 1111-1121.\u003c/li\u003e\n \u003cli\u003eJiang, Yuyang, et al. \u0026quot;Gpt-4v cannot generate radiology reports yet.\u0026quot; \u003cem\u003earXiv preprint arXiv:2407.12176\u003c/em\u003e (2024).\u003c/li\u003e\n \u003cli\u003eTakita, Hirotaka, et al. \u0026quot;A systematic review and meta-analysis of diagnostic performance comparison between generative AI and physicians.\u0026quot; \u003cem\u003enpj Digital Medicine\u003c/em\u003e 8.1 (2025): 175.\u003c/li\u003e\n \u003cli\u003eHoriuchi, Daisuke, et al. \u0026quot;Comparing the diagnostic performance of GPT-4-based ChatGPT, GPT-4V-based ChatGPT, and radiologists in challenging neuroradiology cases.\u0026quot; \u003cem\u003eClinical neuroradiology\u003c/em\u003e 34.4 (2024): 779-787.\u003c/li\u003e\n \u003cli\u003eHoriuchi, Daisuke, et al. \u0026quot;ChatGPT\u0026rsquo;s diagnostic performance based on textual vs. visual information compared to radiologists\u0026rsquo; diagnostic performance in musculoskeletal radiology.\u0026quot; \u003cem\u003eEuropean Radiology\u003c/em\u003e 35.1 (2025): 506-516.\u003c/li\u003e\n \u003cli\u003eZhou, Yiliang, et al. \u0026quot;Evaluating GPT-4V (GPT-4 with Vision) on detection of radiologic findings on chest radiographs.\u0026quot; \u003cem\u003eRadiology\u003c/em\u003e 311.2 (2024): e233270.\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Large Language Model, Chat GPT 3.5, Chat GPT 4o, Chat GPT 4V, Gemini","lastPublishedDoi":"10.21203/rs.3.rs-8618169/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8618169/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003e\u003cb\u003eBackground\u003c/b\u003e\u003c/p\u003e \u003cp\u003eLarge language models (LLM) have demonstrated considerable potential in supporting medical decision-making. Until recently, LLM were restricted to text inputs, limiting their utility in image interpretation. The introduction of ChatGPT-4V, with capability of analyzing visual data, has opened new opportunities to evaluate LLM performance in radiological image interpretation. This study aims to investigate the performance of ChatGPT-4V in radiological image interpretation compared to two board-certified radiologists. The secondary aim of the study is to compare the accuracy of primary and differential diagnoses provided by three different LLM.\u003c/p\u003e\u003cp\u003e\u003cb\u003eMaterials and Methods\u003c/b\u003e\u003c/p\u003e \u003cp\u003eA total of 121 radiology cases were retrospectively retrieved from the Association of Academic Radiology \u0026ldquo;Case of the Month\u0026rdquo; archive. Each case consisted of the following sections: patient presentation, unlabeled findings, labeled findings, final diagnosis, case discussion, and references. Three LLMs \u0026mdash;Chat GPT-3.5, Chat GPT-4o, Google Gemini 1.5 Pro\u0026mdash; were provided with the patient presentation and labeled findings sections.Chat GPT-4V and two board certified radiologists was provided with the patient presentation and unlabeled findings sections, including radiological images. All comparisons were conducted separately for image based evaluations (Radiologist 1, Radiologist 2, and GPT-4V) and text-based evaluations (GPT-4o, GPT-3.5, and Gemini 1.5 Pro). Categorical variables indicating correct and incorrect responses were summarized as frequencies and percentages. Primary diagnosis and differential diagnosis accuracy were analyzed with McNemar\u0026rsquo;s and Cochran\u0026rsquo;s Q test.\u003c/p\u003e\u003cp\u003e\u003cb\u003eResults\u003c/b\u003e\u003c/p\u003e \u003cp\u003eBoth Radiologist 1 (72.5%) and Radiologist 2 (71.1%) significantly outperformed ChatGPT4V (38.8%)(p\u0026thinsp;\u0026lt;\u0026thinsp;0.001). ChatGPT-3.5 achieved the highest primary diagnostic accuracy (80.9%), followed by ChatGPT-4o (78.5%) and Gemini 1.5 Pro (72.7%). For differential diagnoses, ChatGPT-4o achieved the highest accuracy (90.9%), slightly outperforming ChatGPT-3.5 (90.1%) and significantly exceeding Gemini 1.5 Pro (81.8%, p\u0026thinsp;=\u0026thinsp;0.001). No significant difference was observed between ChatGPT-4o and ChatGPT-3.5.\u003c/p\u003e\u003cp\u003e\u003cb\u003eConclusion\u003c/b\u003e\u003c/p\u003e \u003cp\u003eLLMs demonstrated strong performance in generating primary and differential diagnoses based on text-based radiologic findings, with ChatGPT-3.5 and ChatGPT-4o outperforming Gemini 1.5 Pro. However, ChatGPT-4V showed substantially lower accuracy in direct radiological image interpretation compared to radiologists. While promising in text-based applications, LLMs require further development and validation.\u003c/p\u003e","manuscriptTitle":"Diagnostic Performance of Large Language Models and Radiologists in Case-Based Radiology Questions","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-03-13 17:42:37","doi":"10.21203/rs.3.rs-8618169/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"b67ecd37-1a7b-4846-a512-7c85b52c197f","owner":[],"postedDate":"March 13th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2026-05-04T00:08:20+00:00","versionOfRecord":[],"versionCreatedAt":"2026-03-13 17:42:37","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8618169","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8618169","identity":"rs-8618169","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-05-23T02:00:01.238055+00:00

License: CC-BY-4.0