Evolution of AI in Anatomy Education: Comparing Current Large Language Models Against Historical ChatGPT Performance on USMLE-Style Questions | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Evolution of AI in Anatomy Education: Comparing Current Large Language Models Against Historical ChatGPT Performance on USMLE-Style Questions Olena Bolgova, Volodymyr Mavrych This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-6219785/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 28 Oct, 2025 Read the published version in Scientific Reports → Version 1 posted 10 You are reading this latest preprint version Abstract Background The integration of Large Language Models (LLMs) in medical education has gained significant attention, particularly in their ability to handle complex medical knowledge assessments. However, comprehensive evaluation of their performance in anatomical education remains limited. To evaluate the performance accuracy of current LLMs compared to previous versions in answering anatomical multiple-choice questions and assessing their reliability across different anatomical topics. Methods We analyzed the performance of four LLMs (GPT-4o, Claude, Copilot, and Gemini) on 325 USMLE-style MCQs covering seven anatomical topics. Each model attempted the questions three times. Results were compared with the previous year's GPT-3.5 performance and random guessing. Statistical analysis included chi-square tests for performance differences. Results Current LLMs achieved an average accuracy of 76.8 ± 12.2%, significantly higher than GPT-3.5 (44.4 ± 8.5%) and random responses (19.4 ± 5.9%). GPT-4o demonstrated the highest accuracy (92.9 ± 2.5%), followed by Claude (76.7 ± 5.7%), Copilot (73.9 ± 11.9%), and Gemini (63.7 ± 6.5%). Performance varied significantly across anatomical topics, with Head & Neck (79.5%) and Abdomen (78.7%) showing the highest accuracy rates, while Upper Limb questions showed the lowest performance (72.9%). Only 29.5% of questions were answered correctly by all LLMs, and 2.5% were never answered correctly. Statistical analysis confirmed significant differences between models and across topics (χ² = 182.11–518.32, p < 0.001). Conclusions Current LLMs show markedly improved performance in anatomical knowledge assessment compared to previous versions, with GPT-4o demonstrating superior accuracy and consistency. However, performance variations across anatomical topics and between models suggest the need for careful consideration in educational applications. These tools show promise as supplementary resources in medical education while highlighting the continued necessity for human expertise. Clinical trial number Not applicable. Health sciences/Medical research/Pre clinical studies Health sciences/Anatomy Health sciences/Anatomy/Endocrine system Health sciences/Anatomy/Gastrointestinal system Health sciences/Anatomy/Musculoskeletal system Health sciences/Anatomy/Nervous system Health sciences/Anatomy/Oral anatomy Health sciences/Anatomy/Urinary tract Artificial Intelligence Medical Education Anatomy Large Language Models Assessment ChatGPT Figures Figure 1 Figure 2 Figure 3 Introduction The introduction of AI-driven large language models (LLMs) raised great interest in their use in medical education and assessment. Ever since ChatGPT’s first launch in November 2022, these models have excelled in elaborate processes like text processing and human-like speech generation [ 1 , 2 ]. However, the rapid advancement of AI in medical education brings important issues regarding an AI-integrated future of medical assessment and training. As LLMs become more sophisticated, it is important to understand how these educational technologies can be incorporated into existing frameworks without compromising the quality of medical education. Many studies have demonstrated the ability of AI to improve students’ learning. However, it is still best utilized alongside traditional teaching practices [ 3 , 4 ]. To evaluate LLMs’ proficiency and reliability, many researchers are studying how these models manage sophisticated concepts of medicine and clinical reasoning in different examinations. It has been observed that these models have been performing remarkably well on a variety of medical licensing examinations worldwide, especially GPT-4. For instance, in comparison to GPT-3.5, which gained 36–77% accuracy, GPT-4 achieved 64.4% − 100% accuracy across numerous medical licensing examinations compared to its predecessors [ 5 ]. These conclusions have been substantiated by meta-analyses of LLM results in their different versions and formats of examinations, where on average GPT-4 achieved an overall accuracy rate of 81%, significantly surpassing the 58% accuracy rate of GPT-3.5 [ 6 ]. The integration of AI into medical education presents both opportunities and challenges. While these tools add value for learning and examination purposes, there is some skepticism about their dependability and scope for errors. Studies indicate that LLMs tend to "hallucinate," which points to a generated piece of information that is false yet asserts confidence, demonstrating why caution is needed when deploying these LLMs [ 7 ]. The usefulness of AI when creating educational materials, especially multiple-choice questions and assessments, has been researched, and the results do not show a consensus in comparison to human content [ 7 , 8 ]. Some researchers have found that LLMs’ performance results significantly depending on the particular disciplines and nature of the assessments. For example, although ChatGPT proved useful as an engaging pedagogic tool for anatomy education, its ability to give detailed descriptions of anatomy as well as create acceptable images was limited [ 9 ]. In the same manner, for some of these models, the performance metrics of some clinical specialties are similar to the performance of junior medical residents but not of experienced clinicians [ 10 , 11 ]. In addition, the effects of AI on the education of medical students around the world are not uniform regarding different languages and healthcare systems. There are several studies that suggest that LLM performances between different language versions have shown significant variations in accuracy and reliability [ 12 ]. This also demonstrates the need to factor in culture and language when considering the application of AI-based educational materials in other regions, especially those in medical education of other languages apart from English. It has been established that AI-generated content, when infused into the curricula, is a very effective way of reinforcing what is taught in the classroom, provided such content is subjected to careful consideration from qualified instructors [ 13 ]. Recent research has specifically focused on identifying which curriculum components best prepare students to evaluate AI outputs critically [ 14 ]. Chatbots’ ability to perform content generation and knowledge assessment is impressive, however, their limitations and possible risks must be considered. The aim of this research was to evaluate the progress that ChatGPT has made over the last year and answer the following research questions: - What is the performance accuracy of GPT-4o compared to the previous version of GPT-3.5 in answering USMLE-style MCQs across different anatomical topics? - How does the performance of different LLMs (Claude, Copilot, and Gemini) differ across various anatomical topics compared to GPT-4o results? - How do the different LLMs' accuracy and reliability compare to each other? Methods Study design This research evaluated the performance of the four currently available most popular large language models, GPT-4o (OpenAI), Claude 3.5 Sonnet (Anthropic), Gemini 1.5 Flash (Google), and Copilot (Microsoft), versus the previous version of ChatGPT-3.5 on their proficiency in different anatomical topics. 325 USMLE-style MCQs, with five options and a single correct answer among them, were randomly chosen from the Gross Anatomy course's examination database for medical students and validated by three independent experts in our previous research [ 15 ]. The study did not include questions with images and tables. The selected questions encompassed various levels of complexity. They were distributed across seven distinctive topics/regions: Abdomen (50 MCQs), Back (25 MCQs), Head and Neck (50 MCQs), Lower Limb (50 MCQs), Pelvis (50 MCQs), Thorax (50 MCQs), and Upper limb (50 MCQs), so the entire questionary was 325 questions. Data collection Each selected chatbot was required to answer the full questionnaire for the testing phase. GPT-4-1106, Claude 3.5 Sonnet, Gemini 1.5 Flash, and Copilot proficiency in responding to multiple-choice questions was assessed in January 2025. ChatGPT (GPT-3.5) responses were recorded in October 2023 [ 15 ]. Each chatbot was given a prompt: “Generate the list of correct answers for the following MCQs:” following the MCQ set from each specific topic one by one. After that, this data collection was repeated 3 times with no particular time period between the attempts assigned. The results of these three successive attempts by each chatbot to answer this questionnaire were meticulously recorded in a Microsoft Excel spreadsheet (Microsoft®365) and evaluated based on accuracy. A total of 4,875 answers from LLMs were analyzed. To compare chatbots’ results with random guessing, three random sets of answers were generated for the same questionnaire utilizing the RAND() function in Microsoft Excel and analyzed. Data analysis The data from each of the three attempts was matched with the answer key and compared with results from previous attempts, finding the percentage of correct and repeated answers. After that, a detailed item analysis was performed regarding different topics and questions for each LLM. Basic data statistics was conducted using Statistica 13.5.0.17 (TIBC® Statistica™), with the Pearson chi-squared test employed to compare performance between different topics and LLMs using a significance threshold of p ≤ 0.05. Results According to our data, on average, four tested LLMs (GPT-4o, Claude, Copilot, and Gemini) accurately answered 76.8 ± 12.2% out of 325 MCQs from 7 topics in the Gross Anatomy course. This result was 27.7% above the GPT-3.5 year-ago results (44.4 ± 8.5%) and 3.7 times better than randomly generated responses (19.4 ± 5.9%) for the same questionnaire (Fig. 1 ). There was a significant variation in correct responses among the current version of LLMs. The best results were shown by GPT-4o (92.9 ± 2.5%), followed by Claude (76.7 ± 5.7%), Copilot (73.9 ± 11.9%), and Gemini (63.7.5 ± 6.5%). In the box plot analysis of AI system performance, GPT-4o demonstrates quite consistent performance across topics, with scores tightly clustered between 88% − 95.3%, and Copilot demonstrated the biggest results variation of 56% − 89.3%. Chi-square analysis revealed that all LLMs showed statistically significant deviation from the expected uniform distribution of correct answers χ² = 182.11–518.32 (p < 0.001). This means that the null hypothesis of uniform performance across topics and models can be rejected, and there is a statistically significant relationship between LLM performance and both model type and topic / anatomical region. After that, a detailed topic-vise evaluation of the results received from all up-to-date LLMs (GPT-4o, Claude, Copilot, and Gemini) was performed and compared to ChatGPT-3.5 year-ago performance (Fig. 2 ). In all attempts, only 29.5% (96/325) of questions were answered correctly by GPT-4o, Claude, Copilot, and Gemini. General item analysis revealed that Head & Neck and Abdomen were the two best categories, with the average results for these LLMs being 79.5% and 78.7%, respectively. In contrast, the lowest results were recorded for Upper Limb questions − 72.9%. Statistical analysis reveals statistically significant differences between different topics' performances across all LLMs (all p-values < 0.001). The highest variation was calculated for the Upper limb questions (χ² = 243.88) and the lowest for the Back (χ² = 109.25). 2.5% (8/325) of the questions were never answered correctly by any LLM. Item analysis revealed that all of them were high-level critical-thinking questions, equally (1–2) distributed among the different topics. Comparative analysis of GPT-4o and GPT-3.5 performance (Open AI) The results of three successive GPT-4o attempts to answer the 325 Gross Anatomy MCQs in January 2025 showed 92.9 ± 2.5% correct answers, 48.5% (χ² = 270.67, p < 0.001 ) better than GPT-3.5 performance in October 2023 (44.4 ± 8.5%). Interestingly, for both generations of ChatGPT, the results gradually increased in each consequent attempt: 91.7%, 93.2%, 94.8% and 42.8%, 43.1%, 44% percentage of correct answers, for GPT-4o and GPT3.5 correspondingly. The coincidence generated by GPT-4o's answers with the earlier attempts was 96.6% − 98.2%, and among them, the coincidence of correct answers was 91.4% − 93.2%, so consistency and reliability were very good. The previous model, GPT-3.5, did not show such results a year ago: coincidence with previously generated answers was 56%-61.8%, and correct among them were only 31.7%-32.3%, so the answers were mostly unreliable. Topic-wise analysis revealed the largest performance gaps for the following topics: Thorax, Upper and Lower limbs, and the lowest - for Back, Head & Neck, and Pelvis (Fig. 3 ). GPT-4o's best-performing topics were Pelvis (0.953 mean, 46/49 perfect scores), Upper limb (0.947 mean, 45/49 perfect scores), and Thorax (0.94 mean, 46/49 perfect scores). GPT-3.5 demonstrated the best results answering questions in the following topics: Back (0.60 mean, 11/24 perfect scores), Head & Neck (0.50 mean, 17/49 perfect scores), and Pelvis (0.46 mean, 18/49 perfect scores). 91.1% (296/325) of questions were answered correctly across three attempts by GPT-4o, which is a phenomenal result compared to the year-ago GPT-3.5 performance when only 28.3% (92/325) of questions were constantly answered correctly. GPT-4o did not answer only 5.2% (17/325) of MCQs from the entire questionnaire in any one out of 3 attempts, unlike GPT-3.5 was unable to answer 37.8% (123/325) of the questions. Claude 3.5 Sonnet (Anthropic) Claude, across three attempts, provided 76.7 ± 5.7% correct answers to the same questionnaire, 16.2% less ( p < 0.001 ) than GPT-4o. The first attempt was the most successful, with 78.8% correct answers, followed by 76% and 75.5% in the second and third attempts, so its attempts' dynamic is opposite to ChatGPT models. The coincidence generated by Claude's answers with the previous attempts was 86.8% − 89.2%, and among them, the coincidence of correct answers was 71.7% − 73.5%, with relatively good consistency. The item analysis suggested that Claude correctly answered 80.7% − 86.7% of questions from Lower limb topics and Pelvis, and the worst two topics were Upper limb and Abdomen, 69.3% − 72%. Results for the rest of the topics were in the mid-70s. Claude answered correctly 70.5% (229/325) questions across all attempts and did not solve 17.2% (56/325) of MCQs. These were comprehensive questions from different topics. Copilot (Microsoft) The disadvantage of working with Copilot is that it can only accept up to 4000 characters in the prompt, so only 15–25 MCQs can be answered simultaneously. However, the big advantage of this LLM is that Copilot is integrated into Microsoft's working space (windows, office, web browser) and is always available. Copilot generated 73.9 ± 11.9% accurate answers for 325 MCQs from the Gross Anatomy course, showing the third-best result. It is 19% ( p < 0.001 ) below GPT-4o but only 2.8% less than Claude's results. Attempts-wise, it shows the same dynamic as ChatGPT - the results are rising: 65.5%, 72%, and 80.6%. The coincidence generated by Copilot answers with the earlier attempts was 74.8% − 85.2%; among them, the coincidence of correct answers was 60.6% − 69.8%. The high standard deviation (11.9%) suggested more variability in its performance and, subsequently, low reliability. Copilot solved 59.1% of MCQs (192/325) across all three attempts, however, it could not answer 16% (52/325) of the questions. These MCQs are mostly from Thorax and Pelvis material. The item analysis revealed that Copilot performed well in Abdomen and Back questions (87.3%-89.3%), and the two lowest results were in Pelvis and Thorax (56%-64.8%) material. Gemini 1.5 Flash (Google) Among current LLMs, Gemini finished last with 63.7.5 ± 6.5% correct answers to the same set of questions. This result was 28.5% below GPT-4o’s performance but 19.3% above GPT-3.5 performance; both differences were statistically significant ( p < 0.001 ). The first two attempts showed almost identical results, 60.9% and 60% correct answers; the third one was the most successful, with 71.4% success. The coincidence generated by Gemini's answers with the previous attempts was 62.8% − 85.2%, and among them, the coincidence of correct answers was 50.8% − 55.4%, with a moderate standard deviation of 6.5%. Gemini answered correctly 47.7% (155/325) across all attempts and did not solve 17.8% of MCQs (58/325). Item performance analysis revealed that Gemini's two best topics were Pelvis and Head & Neck (71.3%-72.6%), and the lowest result was answering Upper Limb questions − 56%. Difference in LLMs performance Due to the binary nature of the data, we employed the Pearson Chi-square test to compare the performance of the different AI-driven chatbots against each other (Table 1 ). Table 1 Results of Pearson Chi-square test to compare the performance of Copilot, Claude, GPT-4o, and Gemini against each other LLMs Chi-square df P-value GPT-4o vs Claude 46.29 3 3.54E-10* GPT-4o vs Copilot 93.56 3 6.49E-20* GPT-4o vs Gemini 150.53 3 1.52E-32* GPT-4o vs GPT-3.5 270.67 3 1.87E-58* Claude vs Copilot 18.14 3 2.72E-04* Claude vs Gemini 49.76 3 2.00E-11* Claude vs GPT-3.5 121.01 3 5.94E-26* Copilot vs Gemini 17.6 3 2.08E-04* Copilot vs GPT-3.5 86.59 3 1.99E-19* Gemini vs GPT-3.5 41.85 3 3.83E-09* * - Statistically significant difference. All p-values were extremely small (much smaller than 0.05 or even 0.001), indicating that the performance differences between all model pairs are highly statistically significant. The smallest p-values are observed in comparisons involving GPT-4o with other models. The relatively larger (but still very small) p-values are found in Copilot vs Gemini and Claude vs Copilot. These results quantify the statistical significance of the performance differences we observed, with all comparisons showing extremely strong evidence of real differences in performance distributions between the models. Discussion Principal Findings A thorough evaluation of our data explains the dramatic progress achieved by contemporary LLMs in resolving anatomical multiple-choice questions. Currently, LLMs achieve an average accuracy of 76.8 ± 12.2%. This represents a dramatic increase over last year's GPT-3.5 performance (44.4 ± 8.5%) and random answers (19.4 ± 5.9%). This improvement reflects considerable strides in the AI's ability to understand and utilize medical information. Among all the models tested, GPT-4o stood out as the best performer with a remarkable accuracy of 92.9 ± 2.5%, followed by Claude (76.7 ± 5.7%), Copilot (73.9 ± 11.9%) and Gemini (63.7 ± 6.5%). The ranking of LLMs’ performance remained the same during different test tries, although some models were more reliable than others. Most striking was GPT-4o's accuracy across different anatomical topics, which ranged from 88–95.3%, while Copilot ranged from 56–89.3%. Comparison to Literature Recent research supports and extends our findings in the field of AI-assisted medical education. In comparison with those, GPT-4 managed a perfect score of 100%, far better than GPT-3.5 (82.21%), Claude (84.66%), and Bard (75.46%) [ 16 ]. In another extensive analysis, GPT-4 scored 83.3%, which is greatly superior compared to Claude (62%), Gemini (55.3%), and even Bard (54.7%), and excelled in pattern recognition (85%) versus intervention planning (71%) [ 17 ]. Meta-analyses of medical licensing examinations have shown that GPT-4 achieves an overall accuracy rate of 81% (95% CI 78–84%), significantly outperforming GPT-3.5's accuracy rate of 58% (95% CI 53–63%) [ 6 ]. Regarding specific medical course performances, variable quality in anatomical responses has been documented, with accuracy rates ranging from extremely good to very poor quality [ 8 , 9 ]. ChatGPT showed its effectiveness in tackling reasoning questions across diverse physiology modules, achieving an impressive 74% correctness [ 18 ]. Neuroscience testing revealed topic-specific variations. The strongest performance was seen in Neurocytology, Embryology, and Diencephalon (75–83%), while Brainstem, Cerebellum, and Special senses showed lower results (49–54%). On average, GPT-4 led with 81.7% accuracy, followed by Copilot (59.5%), GPT-3.5 (58.3%), and Gemini (53.6%) [ 19 ]. In clinical specialties, studies have shown 68% accuracy rates in diagnostic tasks, with performance decreasing when dealing with image-based scenarios [ 11 ]. In head and neck surgery, it responded correctly to 84.7% of closed-ended questions. It provided accurate diagnoses in 81.7% of clinical scenarios, with room for improvement in procedural details and bibliographic references [ 20 ]. Pathological diagnosis achieved an accuracy of 89.1%, achieving good results in infectious pneumonia and atelectasis; diffuse alveolar disease, however, was more difficult (66.7% accuracy) [ 21 ]. The progression in model capabilities is further evidenced by documented increases in performance from 37.2% for GPT-3.5 to 67.8% for GPT-4 in anesthesiology examinations [ 22 ]. Studies of Japanese medical licensing examinations have documented GPT-4o achieving accuracy rates of 89.2%, with approximately a 10% accuracy gap between image and non-image questions [ 6 ]. Evaluations of German medical licensing examinations have shown GPT-4 achieving average scores of 85% and ranking in the 92.8th to 99.5th percentile among exam takers [ 12 ]. Studies of AI versus human-generated multiple-choice questions have found AI-generated questions to be easier (mean difficulty index = 0.78 ± 0.22 vs. 0.69 ± 0.23, p < 0.01) but with similar discrimination indices [ 23 ]. Research focusing on curriculum components has shown that interactive case-based and pathology teaching are most helpful in evaluating AI outputs [ 14 ]. Implications of Findings The remarkable development of LLMs has dramatic consequences for medical education. Given the high accuracy of GPT-4o (92.9 percent), there are possibilities of using it as an additional educational assistance tool, especially in self-assessment and examination. The proportion of queries that remain unanswered or are answered incorrectly is considerable (2.5% − 8% depending on LLM) and brings the need for instructor supervision, which is well correlated with recent studies highlighting the importance of balancing the use of AI technologies with conventional instructional methods. The varying performance across different topics highlights the importance of subject-specific validation before implementing these tools in educational settings. It has been proven that performance can differ considerably across specialties and subjects [ 5 , 6 ], thus implying the possible need for a more focused approach towards training or LLMs in particular medical subjects as opposed to using a common model. Further Developments The fast-changing world of AI in medical education opens multiple research possibilities. To evaluate proficiency and reliability of LLMs more studies are needed due to the release of new versions. Studies utilizing image-based questions and clinical scenarios are necessary, as these areas are important in medical education. With a focus on addressing the performance variations observed across different medical specialties, the development of specialized medical educational LLM would be a very interesting topic for research. The creation of standardized guidelines for appropriate LLM use in medical education represents another extremely important area for future work. These guidelines can be developed by current implementations as well as future shifts in AI technology. Strengths and Limitations The study benefits a lot from its key strengths, like comprehensive evaluation of different anatomic topics and use of various currently available LLMs for benchmarking. The large question bank of 325 MCQs and the ability to perform multiple attempts provide strong data for analysis, while the comparison to historical data and random guessing provides context for the interpretation of the results. Despite these advantages, it is important to note that there are a number of limitations in this study. The exclusion of questions with images and tables, while necessary for our study design, limits the generalizability of our results to the full scope of medical education. Also, our focus on MCQs, while providing clear metrics for comparison, does not address other important aspects of medical assessment, such as clinical reasoning and practical skills. The study was also limited to specific versions of LLMs available during the study period, and the rapid pace of AI development means that newer versions may show different performance characteristics. Conclusions AI-driven LLMs today do significantly better than a year ago on anatomical multiple-choice questions, representing a new frontier in AI application for medical education. Features of advancement were universal across all tested models, indicating that a real step forward has been achieved in the technology's capability to understand and utilize medical information. In the analysis of different anatomical topics, LLMs’ performance revealed significant variations, with some topics being more accurately addressed than others. The differences were statistically significant irrespective of the models tested, which means they are related to the knowledge gaps in some topics, which affected AI performance. These results show that the special tuning of subject matter and the discipline's specificity should be done to improve LLMs reliability. In the comparative analysis of different models, a clear superiority was demonstrated by GPT-4o, which consistently and most accurately answered MCQs in all anatomical topics compared to other models. Claude and Copilot also performed well but were inconsistent on some topics. Such difference in the degree of reliability and accuracy of results shown by the models indicates the need for caution in selecting the model for particular educational purposes. These results encourage the possible incorporation of LLMs in teaching anatomy and, simultaneously, warm against their over-exploitation across different subjects. LLMs should only act as plausible additions to conventional medical methods, not in place of them. Declarations Clinical trial number The clinical trial number is not pertinent to this study as it does not involve medicinal products or therapeutic interventions. Ethics declarations Ethics approval, Consent to Participate, and Consent to Publish declarations: not applicable. Acknowledgment The authors thank Dr. Inna Shypilova and Dr. Larysa Sankova for their help reviewing the questions. Statement of Contribution O. Bolgova designed the research. O. Bolgova and V. Mavrych revived the questions and collected and analyzed the data. V. Mavrych did the statistical analysis. All authors were involved in interpreting data, drafting the article, and revising it critically. All have approved the submitted and final versions. Funding The authors received no funding for this study. Conflict of interest The authors declare no conflicts of interest, financial or otherwise. Data availability The data supporting this study's findings are available on request from the corresponding author. References Abd-Alrazaq A, AlSaad R, Alhuwail D, et al. Large Language Models in Medical Education: Opportunities, Challenges, and Future Directions. JMIR Med Educ. 2023;9:e48291. Published 2023 Jun 1. doi:10.2196/48291 Boscardin CK, Gin B, Golde PB, et al. ChatGPT and Generative Artificial Intelligence for Medical Education: Potential Impact and Opportunity. Acad Med . 2024;99(1):22-27. doi:10.1097/ACM.0000000000005439 Cook DA. Creating virtual patients using large language models: scalable, global, and low cost. Med Teach . 2025;47(1):40-42. doi:10.1080/0142159X.2024.2376879 Wilson RN, Holman PJ, Dragan M, et al. The effects of supplemental instruction derived from peer leaders on student outcomes in undergraduate human anatomy. Anat Sci Educ . 2024;17(6):1239-1250. doi:10.1002/ase.2464 Jin HK, Lee HE, Kim E. Performance of ChatGPT-3.5 and GPT-4 in national licensing examinations for medicine, pharmacy, dentistry, and nursing: a systematic review and meta-analysis. BMC Med Educ . 2024;24(1):1013. doi:10.1186/s12909-024-05944-8 Liu M, Okuhara T, Chang X, et al. Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis. J Med Internet Res . 2024;26:e60807. doi:10.2196/60807 Han Z, Battaglia F, Udaiyar A, et al. An explorative assessment of ChatGPT as an aid in medical education: Use it with caution. Med Teach . 2024;46(5):657-664. doi:10.1080/0142159X.2023.2271159 Mavrych V, Ganguly P, Bolgova O. Using large language models (ChatGPT, Copilot, PaLM, Bard, and Gemini) in Gross Anatomy course: Comparative analysis. Clin Anat . 2025;38(2):200-210. doi:10.1002/ca.24244 Totlis T, Natsis K, Filos D, et al. The potential role of ChatGPT and artificial intelligence in anatomy education: a conversation with ChatGPT. Surg Radiol Anat. 2023;45(10):1321-1329. doi:10.1007/s00276-023-03229-1 Chen A, Chen DO, Tian L. Benchmarking the symptom-checking capabilities of ChatGPT for a broad range of diseases. J Am Med Inform Assoc . 2024;31(9):2084-2088. doi:10.1093/jamia/ocad245 Shemer A, Cohen M, Altarescu A, et al. Diagnostic capabilities of ChatGPT in ophthalmology. Graefes Arch Clin Exp Ophthalmol . 2024;262(7):2345-2352. doi:10.1007/s00417-023-06363-z Waldock A, Riese J, Streichert T. Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study. JMIR Med Educ . 2024;10:e50965. doi:10.2196/50965 Surapaneni KM, Rajajagadeesan A, Goudhaman L, et al. Evaluating ChatGPT as a self-learning tool in medical biochemistry: A performance assessment in undergraduate medical university examination. Biochem Mol Biol Educ . 2024;52(2):237-248. doi:10.1002/bmb.21808 Waldock WJ, Lam G, Baptista A, et al. Which curriculum components do medical students find most helpful for evaluating AI outputs?. BMC Med Educ . 2025;25(1):195. doi:10.1186/s12909-025-06735-5 Bolgova, O., Shypilova, I., Sankova, L., et al. How Well Did ChatGPT Perform in Answering Questions on Different Topics in Gross Anatomy?. European Journal of Medical and Health Sciences, 2023;5(6):94-100. doi:10.24018/ejmed.2023.5.6.1989 Abbas A, Rehman MS, Rehman SS. Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions. Cureus. 2024;16(3):e55991. doi:10.7759/cureus.55991 Wei B. Performance Evaluation and Implications of Large Language Models in Radiology Board Exams: Prospective Comparative Analysis. JMIR Med Educ. 2025;11:e64284. Published 2025 Jan 16. doi:10.2196/64284 Banerjee A, Ahmad A, Bhalla P, Goyal K. Assessing the Efficacy of ChatGPT in Solving Questions Based on the Core Concepts in Physiology. Cureus. 2023 Aug 10;15(8):e43314. doi: 10.7759/cureus.43314. PMID: 37700949; PMCID: PMC10492920. Mavrych V, Yaqinuddin A, Bolgova O. Claude, ChatGPT, Copilot, and Gemini Performance versus Students in Different Topics of Neuroscience. Adv Physiol Educ. Published online January 17, 2025. doi:10.1152/advan.00093.2024 Vaira LA, Lechien JR, Abbate V, et al. Accuracy of ChatGPT-Generated Information on Head and Neck and Oromaxillofacial Surgery: A Multicenter Collaborative Analysis. Otolaryngol Head Neck Surg . 2024;170(6):1492-1503. doi:10.1002/ohn.489 Du W, Jin X, Harris JC, et al. Large language models in pathology: A comparative study of ChatGPT and Bard with pathology trainees on multiple-choice questions. Ann Diagn Pathol. 2024;73:152392. doi:10.1016/j.anndiagpath.2024.152392 Artsi Y, Sorin V, Konen E, Glicksberg BS, Nadkarni G, Klang E. Large language models for generating medical examinations: systematic review. BMC Med Educ . 2024;24(1):354. doi:10.1186/s12909-024-05239-y Law AK, So J, Lui CT, et al. AI versus human-generated multiple-choice questions for medical education: a cohort study in a high-stakes examination. BMC Med Educ . 2025;25(1):208. doi:10.1186/s12909-025-06796-6 Additional Declarations No competing interests reported. Cite Share Download PDF Status: Published Journal Publication published 28 Oct, 2025 Read the published version in Scientific Reports → Version 1 posted Editorial decision: Revision requested 14 Aug, 2025 Reviews received at journal 05 Aug, 2025 Reviewers agreed at journal 24 Jul, 2025 Reviews received at journal 25 May, 2025 Reviewers agreed at journal 14 May, 2025 Reviewers invited by journal 20 Mar, 2025 Editor assigned by journal 20 Mar, 2025 Editor invited by journal 20 Mar, 2025 Submission checks completed at journal 20 Mar, 2025 First submitted to journal 13 Mar, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-6219785","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":431749588,"identity":"0ca1e091-aac0-41e8-8231-54143cfbd570","order_by":0,"name":"Olena Bolgova","email":"","orcid":"","institution":"Alfaisal University","correspondingAuthor":false,"prefix":"","firstName":"Olena","middleName":"","lastName":"Bolgova","suffix":""},{"id":431749589,"identity":"991ee1d7-8243-40d2-af45-fb9855eef86b","order_by":1,"name":"Volodymyr Mavrych","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA8klEQVRIiWNgGAWjYBCDBH4G5oYPDAYMDHxAngRRWiQbGBtngLSwEa3F4ABICwMRWgyuHX4mXdlml2d8/GBjw4+COnk2BuaDt3nwabmdZiZ5ti252OxMYmNjj8FhwzYGtmRrfFokZycYGza2HUjcdiCx/QEP0HltDDxm0vi1pH8Ga9nc/7Cx8Y9BnX0bA/83vFr4pXMMH4K0bJBIbGzmMWBOBNrCRkhL4cOGc8mJM248bGyWMTic3MbMZmw5B48WNun0DQcbyuwS+/uTDza++VNn28/e/PDGGzxawICRDZnHTEg5GPwhStUoGAWjYBSMVAAAb0lOmkgpHsoAAAAASUVORK5CYII=","orcid":"","institution":"Alfaisal University","correspondingAuthor":true,"prefix":"","firstName":"Volodymyr","middleName":"","lastName":"Mavrych","suffix":""}],"badges":[],"createdAt":"2025-03-13 11:53:14","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-6219785/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-6219785/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1038/s41598-025-22437-w","type":"published","date":"2025-10-28T15:57:15+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":79072272,"identity":"fc53cbd2-b342-4087-94f3-267c6e3dab54","added_by":"auto","created_at":"2025-03-24 06:24:56","extension":"jpg","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":116680,"visible":true,"origin":"","legend":"\u003cp\u003ePercentile of correct answers from different chatbots on 325 MCQs from the Gross Anatomy course. Y-axis: % of correct answers; X-axis: different LLMs’ results\u003c/p\u003e","description":"","filename":"Fig.1.jpg","url":"https://assets-eu.researchsquare.com/files/rs-6219785/v1/2ffc76e1088437b7e57b19c1.jpg"},{"id":79072273,"identity":"85a488ea-5074-4e45-b9fb-9903af17cccf","added_by":"auto","created_at":"2025-03-24 06:24:56","extension":"jpg","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":52633,"visible":true,"origin":"","legend":"\u003cp\u003eHeatmap of LLMs' topic-wise performance in the Gross Anatomy course. Numbers are % of correct answers in the specific topic for each chatbot.\u003c/p\u003e","description":"","filename":"Fig.2.jpg","url":"https://assets-eu.researchsquare.com/files/rs-6219785/v1/e3bb2896ca47c061f953da4a.jpg"},{"id":79072275,"identity":"e94fb0a2-c11b-4257-8f7a-e966a1a65e6a","added_by":"auto","created_at":"2025-03-24 06:24:56","extension":"jpg","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":61343,"visible":true,"origin":"","legend":"\u003cp\u003ePercentile of correct answers GPT-4o and GPT-3.5 on 325 MCQs from the Gross Anatomy course. Y-axis: % of correct answers; X-axis: topics/regions\u003c/p\u003e","description":"","filename":"Fig.3.jpg","url":"https://assets-eu.researchsquare.com/files/rs-6219785/v1/2ad32c00943f0bbca5ad5eca.jpg"},{"id":95039921,"identity":"cca6f779-b1ad-4742-8704-16a3f4e0b559","added_by":"auto","created_at":"2025-11-03 16:05:46","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":856907,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6219785/v1/2262d828-6c14-45bf-b0a1-674b1efa2afa.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Evolution of AI in Anatomy Education: Comparing Current Large Language Models Against Historical ChatGPT Performance on USMLE-Style Questions","fulltext":[{"header":"Introduction","content":"\u003cp\u003eThe introduction of AI-driven large language models (LLMs) raised great interest in their use in medical education and assessment. Ever since ChatGPT\u0026rsquo;s first launch in November 2022, these models have excelled in elaborate processes like text processing and human-like speech generation [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e, \u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eHowever, the rapid advancement of AI in medical education brings important issues regarding an AI-integrated future of medical assessment and training. As LLMs become more sophisticated, it is important to understand how these educational technologies can be incorporated into existing frameworks without compromising the quality of medical education. Many studies have demonstrated the ability of AI to improve students\u0026rsquo; learning. However, it is still best utilized alongside traditional teaching practices [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e, \u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eTo evaluate LLMs\u0026rsquo; proficiency and reliability, many researchers are studying how these models manage sophisticated concepts of medicine and clinical reasoning in different examinations. It has been observed that these models have been performing remarkably well on a variety of medical licensing examinations worldwide, especially GPT-4. For instance, in comparison to GPT-3.5, which gained 36\u0026ndash;77% accuracy, GPT-4 achieved 64.4% \u0026minus;\u0026thinsp;100% accuracy across numerous medical licensing examinations compared to its predecessors [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e]. These conclusions have been substantiated by meta-analyses of LLM results in their different versions and formats of examinations, where on average GPT-4 achieved an overall accuracy rate of 81%, significantly surpassing the 58% accuracy rate of GPT-3.5 [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eThe integration of AI into medical education presents both opportunities and challenges. While these tools add value for learning and examination purposes, there is some skepticism about their dependability and scope for errors. Studies indicate that LLMs tend to \"hallucinate,\" which points to a generated piece of information that is false yet asserts confidence, demonstrating why caution is needed when deploying these LLMs [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e]. The usefulness of AI when creating educational materials, especially multiple-choice questions and assessments, has been researched, and the results do not show a consensus in comparison to human content [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e, \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eSome researchers have found that LLMs\u0026rsquo; performance results significantly depending on the particular disciplines and nature of the assessments. For example, although ChatGPT proved useful as an engaging pedagogic tool for anatomy education, its ability to give detailed descriptions of anatomy as well as create acceptable images was limited [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e]. In the same manner, for some of these models, the performance metrics of some clinical specialties are similar to the performance of junior medical residents but not of experienced clinicians [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e, \u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eIn addition, the effects of AI on the education of medical students around the world are not uniform regarding different languages and healthcare systems. There are several studies that suggest that LLM performances between different language versions have shown significant variations in accuracy and reliability [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e]. This also demonstrates the need to factor in culture and language when considering the application of AI-based educational materials in other regions, especially those in medical education of other languages apart from English.\u003c/p\u003e \u003cp\u003eIt has been established that AI-generated content, when infused into the curricula, is a very effective way of reinforcing what is taught in the classroom, provided such content is subjected to careful consideration from qualified instructors [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e]. Recent research has specifically focused on identifying which curriculum components best prepare students to evaluate AI outputs critically [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e]. Chatbots\u0026rsquo; ability to perform content generation and knowledge assessment is impressive, however, their limitations and possible risks must be considered.\u003c/p\u003e \u003cp\u003eThe aim of this research was to evaluate the progress that ChatGPT has made over the last year and answer the following research questions:\u003c/p\u003e \u003cp\u003e \u003cul\u003e \u003cli\u003e \u003cp\u003e- What is the performance accuracy of GPT-4o compared to the previous version of GPT-3.5 in answering USMLE-style MCQs across different anatomical topics?\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003e- How does the performance of different LLMs (Claude, Copilot, and Gemini) differ across various anatomical topics compared to GPT-4o results?\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003e- How do the different LLMs' accuracy and reliability compare to each other?\u003c/p\u003e \u003c/li\u003e \u003c/ul\u003e \u003c/p\u003e"},{"header":"Methods","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003eStudy design\u003c/h2\u003e \u003cp\u003eThis research evaluated the performance of the four currently available most popular large language models, GPT-4o (OpenAI), Claude 3.5 Sonnet (Anthropic), Gemini 1.5 Flash (Google), and Copilot (Microsoft), versus the previous version of ChatGPT-3.5 on their proficiency in different anatomical topics. 325 USMLE-style MCQs, with five options and a single correct answer among them, were randomly chosen from the Gross Anatomy course's examination database for medical students and validated by three independent experts in our previous research [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e]. The study did not include questions with images and tables. The selected questions encompassed various levels of complexity. They were distributed across seven distinctive topics/regions: Abdomen (50 MCQs), Back (25 MCQs), Head and Neck (50 MCQs), Lower Limb (50 MCQs), Pelvis (50 MCQs), Thorax (50 MCQs), and Upper limb (50 MCQs), so the entire questionary was 325 questions.\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eData collection\u003c/h3\u003e\n\u003cp\u003eEach selected chatbot was required to answer the full questionnaire for the testing phase. GPT-4-1106, Claude 3.5 Sonnet, Gemini 1.5 Flash, and Copilot proficiency in responding to multiple-choice questions was assessed in January 2025. ChatGPT (GPT-3.5) responses were recorded in October 2023 [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e]. Each chatbot was given a prompt: \u0026ldquo;Generate the list of correct answers for the following MCQs:\u0026rdquo; following the MCQ set from each specific topic one by one. After that, this data collection was repeated 3 times with no particular time period between the attempts assigned. The results of these three successive attempts by each chatbot to answer this questionnaire were meticulously recorded in a Microsoft Excel spreadsheet (Microsoft\u0026reg;365) and evaluated based on accuracy. A total of 4,875 answers from LLMs were analyzed.\u003c/p\u003e \u003cp\u003eTo compare chatbots\u0026rsquo; results with random guessing, three random sets of answers were generated for the same questionnaire utilizing the RAND() function in Microsoft Excel and analyzed.\u003c/p\u003e \u003cdiv id=\"Sec5\" class=\"Section2\"\u003e \u003ch2\u003eData analysis\u003c/h2\u003e \u003cp\u003eThe data from each of the three attempts was matched with the answer key and compared with results from previous attempts, finding the percentage of correct and repeated answers. After that, a detailed item analysis was performed regarding different topics and questions for each LLM.\u003c/p\u003e \u003cp\u003eBasic data statistics was conducted using Statistica 13.5.0.17 (TIBC\u0026reg; Statistica\u0026trade;), with the Pearson chi-squared test employed to compare performance between different topics and LLMs using a significance threshold of p\u0026thinsp;\u0026le;\u0026thinsp;0.05.\u003c/p\u003e \u003c/div\u003e"},{"header":"Results","content":"\u003cp\u003eAccording to our data, on average, four tested LLMs (GPT-4o, Claude, Copilot, and Gemini) accurately answered 76.8\u0026thinsp;\u0026plusmn;\u0026thinsp;12.2% out of 325 MCQs from 7 topics in the Gross Anatomy course. This result was 27.7% above the GPT-3.5 year-ago results (44.4\u0026thinsp;\u0026plusmn;\u0026thinsp;8.5%) and 3.7 times better than randomly generated responses (19.4\u0026thinsp;\u0026plusmn;\u0026thinsp;5.9%) for the same questionnaire (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eThere was a significant variation in correct responses among the current version of LLMs. The best results were shown by GPT-4o (92.9\u0026thinsp;\u0026plusmn;\u0026thinsp;2.5%), followed by Claude (76.7\u0026thinsp;\u0026plusmn;\u0026thinsp;5.7%), Copilot (73.9\u0026thinsp;\u0026plusmn;\u0026thinsp;11.9%), and Gemini (63.7.5\u0026thinsp;\u0026plusmn;\u0026thinsp;6.5%).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eIn the box plot analysis of AI system performance, GPT-4o demonstrates quite consistent performance across topics, with scores tightly clustered between 88% \u0026minus;\u0026thinsp;95.3%, and Copilot demonstrated the biggest results variation of 56% \u0026minus;\u0026thinsp;89.3%.\u003c/p\u003e \u003cp\u003eChi-square analysis revealed that all LLMs showed statistically significant deviation from the expected uniform distribution of correct answers χ\u0026sup2; = 182.11\u0026ndash;518.32 (p\u0026thinsp;\u0026lt;\u0026thinsp;0.001). This means that the null hypothesis of uniform performance across topics and models can be rejected, and there is a statistically significant relationship between LLM performance and both model type and topic / anatomical region.\u003c/p\u003e \u003cp\u003eAfter that, a detailed topic-vise evaluation of the results received from all up-to-date LLMs (GPT-4o, Claude, Copilot, and Gemini) was performed and compared to ChatGPT-3.5 year-ago performance (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eIn all attempts, only 29.5% (96/325) of questions were answered correctly by GPT-4o, Claude, Copilot, and Gemini. General item analysis revealed that Head \u0026amp; Neck and Abdomen were the two best categories, with the average results for these LLMs being 79.5% and 78.7%, respectively. In contrast, the lowest results were recorded for Upper Limb questions \u0026minus;\u0026thinsp;72.9%. Statistical analysis reveals statistically significant differences between different topics' performances across all LLMs (all p-values\u0026thinsp;\u0026lt;\u0026thinsp;0.001). The highest variation was calculated for the Upper limb questions (χ\u0026sup2; = 243.88) and the lowest for the Back (χ\u0026sup2; = 109.25).\u003c/p\u003e \u003cp\u003e2.5% (8/325) of the questions were never answered correctly by any LLM. Item analysis revealed that all of them were high-level critical-thinking questions, equally (1\u0026ndash;2) distributed among the different topics.\u003c/p\u003e\n\u003ch3\u003eComparative analysis of GPT-4o and GPT-3.5 performance (Open AI)\u003c/h3\u003e\n\u003cp\u003eThe results of three successive GPT-4o attempts to answer the 325 Gross Anatomy MCQs in January 2025 showed 92.9\u0026thinsp;\u0026plusmn;\u0026thinsp;2.5% correct answers, 48.5% (χ\u0026sup2; = 270.67, \u003cem\u003ep\u0026thinsp;\u0026lt;\u0026thinsp;0.001\u003c/em\u003e) better than GPT-3.5 performance in October 2023 (44.4\u0026thinsp;\u0026plusmn;\u0026thinsp;8.5%). Interestingly, for both generations of ChatGPT, the results gradually increased in each consequent attempt: 91.7%, 93.2%, 94.8% and 42.8%, 43.1%, 44% percentage of correct answers, for GPT-4o and GPT3.5 correspondingly.\u003c/p\u003e \u003cp\u003eThe coincidence generated by GPT-4o's answers with the earlier attempts was 96.6% \u0026minus;\u0026thinsp;98.2%, and among them, the coincidence of correct answers was 91.4% \u0026minus;\u0026thinsp;93.2%, so consistency and reliability were very good. The previous model, GPT-3.5, did not show such results a year ago: coincidence with previously generated answers was 56%-61.8%, and correct among them were only 31.7%-32.3%, so the answers were mostly unreliable.\u003c/p\u003e \u003cp\u003eTopic-wise analysis revealed the largest performance gaps for the following topics: Thorax, Upper and Lower limbs, and the lowest - for Back, Head \u0026amp; Neck, and Pelvis (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eGPT-4o's best-performing topics were Pelvis (0.953 mean, 46/49 perfect scores), Upper limb (0.947 mean, 45/49 perfect scores), and Thorax (0.94 mean, 46/49 perfect scores). GPT-3.5 demonstrated the best results answering questions in the following topics: Back (0.60 mean, 11/24 perfect scores), Head \u0026amp; Neck (0.50 mean, 17/49 perfect scores), and Pelvis (0.46 mean, 18/49 perfect scores).\u003c/p\u003e \u003cp\u003e91.1% (296/325) of questions were answered correctly across three attempts by GPT-4o, which is a phenomenal result compared to the year-ago GPT-3.5 performance when only 28.3% (92/325) of questions were constantly answered correctly.\u003c/p\u003e \u003cp\u003eGPT-4o did not answer only 5.2% (17/325) of MCQs from the entire questionnaire in any one out of 3 attempts, unlike GPT-3.5 was unable to answer 37.8% (123/325) of the questions.\u003c/p\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003eClaude 3.5 Sonnet (Anthropic)\u003c/h2\u003e \u003cp\u003eClaude, across three attempts, provided 76.7\u0026thinsp;\u0026plusmn;\u0026thinsp;5.7% correct answers to the same questionnaire, 16.2% less (\u003cem\u003ep\u0026thinsp;\u0026lt;\u0026thinsp;0.001\u003c/em\u003e) than GPT-4o. The first attempt was the most successful, with 78.8% correct answers, followed by 76% and 75.5% in the second and third attempts, so its attempts' dynamic is opposite to ChatGPT models. The coincidence generated by Claude's answers with the previous attempts was 86.8% \u0026minus;\u0026thinsp;89.2%, and among them, the coincidence of correct answers was 71.7% \u0026minus;\u0026thinsp;73.5%, with relatively good consistency. The item analysis suggested that Claude correctly answered 80.7% \u0026minus;\u0026thinsp;86.7% of questions from Lower limb topics and Pelvis, and the worst two topics were Upper limb and Abdomen, 69.3% \u0026minus;\u0026thinsp;72%. Results for the rest of the topics were in the mid-70s. Claude answered correctly 70.5% (229/325) questions across all attempts and did not solve 17.2% (56/325) of MCQs. These were comprehensive questions from different topics.\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eCopilot (Microsoft)\u003c/h3\u003e\n\u003cp\u003eThe disadvantage of working with Copilot is that it can only accept up to 4000 characters in the prompt, so only 15\u0026ndash;25 MCQs can be answered simultaneously. However, the big advantage of this LLM is that Copilot is integrated into Microsoft's working space (windows, office, web browser) and is always available. Copilot generated 73.9\u0026thinsp;\u0026plusmn;\u0026thinsp;11.9% accurate answers for 325 MCQs from the Gross Anatomy course, showing the third-best result. It is 19% (\u003cem\u003ep\u0026thinsp;\u0026lt;\u0026thinsp;0.001\u003c/em\u003e) below GPT-4o but only 2.8% less than Claude's results. Attempts-wise, it shows the same dynamic as ChatGPT - the results are rising: 65.5%, 72%, and 80.6%. The coincidence generated by Copilot answers with the earlier attempts was 74.8% \u0026minus;\u0026thinsp;85.2%; among them, the coincidence of correct answers was 60.6% \u0026minus;\u0026thinsp;69.8%. The high standard deviation (11.9%) suggested more variability in its performance and, subsequently, low reliability. Copilot solved 59.1% of MCQs (192/325) across all three attempts, however, it could not answer 16% (52/325) of the questions. These MCQs are mostly from Thorax and Pelvis material. The item analysis revealed that Copilot performed well in Abdomen and Back questions (87.3%-89.3%), and the two lowest results were in Pelvis and Thorax (56%-64.8%) material.\u003c/p\u003e\n\u003ch3\u003eGemini 1.5 Flash (Google)\u003c/h3\u003e\n\u003cp\u003eAmong current LLMs, Gemini finished last with 63.7.5\u0026thinsp;\u0026plusmn;\u0026thinsp;6.5% correct answers to the same set of questions. This result was 28.5% below GPT-4o\u0026rsquo;s performance but 19.3% above GPT-3.5 performance; both differences were statistically significant (\u003cem\u003ep\u0026thinsp;\u0026lt;\u0026thinsp;0.001\u003c/em\u003e). The first two attempts showed almost identical results, 60.9% and 60% correct answers; the third one was the most successful, with 71.4% success. The coincidence generated by Gemini's answers with the previous attempts was 62.8% \u0026minus;\u0026thinsp;85.2%, and among them, the coincidence of correct answers was 50.8% \u0026minus;\u0026thinsp;55.4%, with a moderate standard deviation of 6.5%.\u003c/p\u003e \u003cp\u003eGemini answered correctly 47.7% (155/325) across all attempts and did not solve 17.8% of MCQs (58/325). Item performance analysis revealed that Gemini's two best topics were Pelvis and Head \u0026amp; Neck (71.3%-72.6%), and the lowest result was answering Upper Limb questions \u0026minus;\u0026thinsp;56%.\u003c/p\u003e \u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003eDifference in LLMs performance\u003c/h2\u003e \u003cp\u003eDue to the binary nature of the data, we employed the Pearson Chi-square test to compare the performance of the different AI-driven chatbots against each other (Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e).\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eResults of Pearson Chi-square test to compare the performance of Copilot, Claude, GPT-4o, and Gemini against each other\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"4\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLLMs\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eChi-square\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003edf\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eP-value\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGPT-4o vs Claude\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e46.29\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e3.54E-10*\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGPT-4o vs Copilot\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e93.56\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e6.49E-20*\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGPT-4o vs Gemini\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e150.53\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e1.52E-32*\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGPT-4o vs GPT-3.5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e270.67\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e1.87E-58*\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eClaude vs Copilot\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e18.14\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e2.72E-04*\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eClaude vs Gemini\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e49.76\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e2.00E-11*\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eClaude vs GPT-3.5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e121.01\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e5.94E-26*\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCopilot vs Gemini\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e17.6\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e2.08E-04*\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCopilot vs GPT-3.5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e86.59\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e1.99E-19*\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGemini vs GPT-3.5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e41.85\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e3.83E-09*\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e* - Statistically significant difference.\u003c/p\u003e \u003cp\u003eAll p-values were extremely small (much smaller than 0.05 or even 0.001), indicating that the performance differences between all model pairs are highly statistically significant. The smallest p-values are observed in comparisons involving GPT-4o with other models. The relatively larger (but still very small) p-values are found in Copilot vs Gemini and Claude vs Copilot.\u003c/p\u003e \u003cp\u003eThese results quantify the statistical significance of the performance differences we observed, with all comparisons showing extremely strong evidence of real differences in performance distributions between the models.\u003c/p\u003e \u003c/div\u003e"},{"header":"Discussion","content":"\u003cdiv id=\"Sec13\" class=\"Section2\"\u003e \u003ch2\u003ePrincipal Findings\u003c/h2\u003e \u003cp\u003eA thorough evaluation of our data explains the dramatic progress achieved by contemporary LLMs in resolving anatomical multiple-choice questions. Currently, LLMs achieve an average accuracy of 76.8\u0026thinsp;\u0026plusmn;\u0026thinsp;12.2%. This represents a dramatic increase over last year's GPT-3.5 performance (44.4\u0026thinsp;\u0026plusmn;\u0026thinsp;8.5%) and random answers (19.4\u0026thinsp;\u0026plusmn;\u0026thinsp;5.9%). This improvement reflects considerable strides in the AI's ability to understand and utilize medical information.\u003c/p\u003e \u003cp\u003eAmong all the models tested, GPT-4o stood out as the best performer with a remarkable accuracy of 92.9\u0026thinsp;\u0026plusmn;\u0026thinsp;2.5%, followed by Claude (76.7\u0026thinsp;\u0026plusmn;\u0026thinsp;5.7%), Copilot (73.9\u0026thinsp;\u0026plusmn;\u0026thinsp;11.9%) and Gemini (63.7\u0026thinsp;\u0026plusmn;\u0026thinsp;6.5%). The ranking of LLMs\u0026rsquo; performance remained the same during different test tries, although some models were more reliable than others. Most striking was GPT-4o's accuracy across different anatomical topics, which ranged from 88\u0026ndash;95.3%, while Copilot ranged from 56\u0026ndash;89.3%.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec14\" class=\"Section2\"\u003e \u003ch2\u003eComparison to Literature\u003c/h2\u003e \u003cp\u003eRecent research supports and extends our findings in the field of AI-assisted medical education. In comparison with those, GPT-4 managed a perfect score of 100%, far better than GPT-3.5 (82.21%), Claude (84.66%), and Bard (75.46%) [\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e]. In another extensive analysis, GPT-4 scored 83.3%, which is greatly superior compared to Claude (62%), Gemini (55.3%), and even Bard (54.7%), and excelled in pattern recognition (85%) versus intervention planning (71%) [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e]. Meta-analyses of medical licensing examinations have shown that GPT-4 achieves an overall accuracy rate of 81% (95% CI 78\u0026ndash;84%), significantly outperforming GPT-3.5's accuracy rate of 58% (95% CI 53\u0026ndash;63%) [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eRegarding specific medical course performances, variable quality in anatomical responses has been documented, with accuracy rates ranging from extremely good to very poor quality [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e, \u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e]. ChatGPT showed its effectiveness in tackling reasoning questions across diverse physiology modules, achieving an impressive 74% correctness [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e]. Neuroscience testing revealed topic-specific variations. The strongest performance was seen in Neurocytology, Embryology, and Diencephalon (75\u0026ndash;83%), while Brainstem, Cerebellum, and Special senses showed lower results (49\u0026ndash;54%). On average, GPT-4 led with 81.7% accuracy, followed by Copilot (59.5%), GPT-3.5 (58.3%), and Gemini (53.6%) [\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eIn clinical specialties, studies have shown 68% accuracy rates in diagnostic tasks, with performance decreasing when dealing with image-based scenarios [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e]. In head and neck surgery, it responded correctly to 84.7% of closed-ended questions. It provided accurate diagnoses in 81.7% of clinical scenarios, with room for improvement in procedural details and bibliographic references [\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e]. Pathological diagnosis achieved an accuracy of 89.1%, achieving good results in infectious pneumonia and atelectasis; diffuse alveolar disease, however, was more difficult (66.7% accuracy) [\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e]. The progression in model capabilities is further evidenced by documented increases in performance from 37.2% for GPT-3.5 to 67.8% for GPT-4 in anesthesiology examinations [\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eStudies of Japanese medical licensing examinations have documented GPT-4o achieving accuracy rates of 89.2%, with approximately a 10% accuracy gap between image and non-image questions [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e]. Evaluations of German medical licensing examinations have shown GPT-4 achieving average scores of 85% and ranking in the 92.8th to 99.5th percentile among exam takers [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eStudies of AI versus human-generated multiple-choice questions have found AI-generated questions to be easier (mean difficulty index\u0026thinsp;=\u0026thinsp;0.78\u0026thinsp;\u0026plusmn;\u0026thinsp;0.22 vs. 0.69\u0026thinsp;\u0026plusmn;\u0026thinsp;0.23, p\u0026thinsp;\u0026lt;\u0026thinsp;0.01) but with similar discrimination indices [\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e]. Research focusing on curriculum components has shown that interactive case-based and pathology teaching are most helpful in evaluating AI outputs [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e].\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec15\" class=\"Section2\"\u003e \u003ch2\u003eImplications of Findings\u003c/h2\u003e \u003cp\u003eThe remarkable development of LLMs has dramatic consequences for medical education. Given the high accuracy of GPT-4o (92.9 percent), there are possibilities of using it as an additional educational assistance tool, especially in self-assessment and examination. The proportion of queries that remain unanswered or are answered incorrectly is considerable (2.5% \u0026minus;\u0026thinsp;8% depending on LLM) and brings the need for instructor supervision, which is well correlated with recent studies highlighting the importance of balancing the use of AI technologies with conventional instructional methods.\u003c/p\u003e \u003cp\u003eThe varying performance across different topics highlights the importance of subject-specific validation before implementing these tools in educational settings. It has been proven that performance can differ considerably across specialties and subjects [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e, \u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e], thus implying the possible need for a more focused approach towards training or LLMs in particular medical subjects as opposed to using a common model.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec16\" class=\"Section2\"\u003e \u003ch2\u003eFurther Developments\u003c/h2\u003e \u003cp\u003eThe fast-changing world of AI in medical education opens multiple research possibilities. To evaluate proficiency and reliability of LLMs more studies are needed due to the release of new versions. Studies utilizing image-based questions and clinical scenarios are necessary, as these areas are important in medical education.\u003c/p\u003e \u003cp\u003eWith a focus on addressing the performance variations observed across different medical specialties, the development of specialized medical educational LLM would be a very interesting topic for research.\u003c/p\u003e \u003cp\u003e The creation of standardized guidelines for appropriate LLM use in medical education represents another extremely important area for future work. These guidelines can be developed by current implementations as well as future shifts in AI technology.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec17\" class=\"Section2\"\u003e \u003ch2\u003eStrengths and Limitations\u003c/h2\u003e \u003cp\u003eThe study benefits a lot from its key strengths, like comprehensive evaluation of different anatomic topics and use of various currently available LLMs for benchmarking. The large question bank of 325 MCQs and the ability to perform multiple attempts provide strong data for analysis, while the comparison to historical data and random guessing provides context for the interpretation of the results.\u003c/p\u003e \u003cp\u003eDespite these advantages, it is important to note that there are a number of limitations in this study. The exclusion of questions with images and tables, while necessary for our study design, limits the generalizability of our results to the full scope of medical education. Also, our focus on MCQs, while providing clear metrics for comparison, does not address other important aspects of medical assessment, such as clinical reasoning and practical skills. The study was also limited to specific versions of LLMs available during the study period, and the rapid pace of AI development means that newer versions may show different performance characteristics.\u003c/p\u003e \u003c/div\u003e"},{"header":"Conclusions","content":"\u003cp\u003eAI-driven LLMs today do significantly better than a year ago on anatomical multiple-choice questions, representing a new frontier in AI application for medical education. Features of advancement were universal across all tested models, indicating that a real step forward has been achieved in the technology's capability to understand and utilize medical information.\u003c/p\u003e \u003cp\u003eIn the analysis of different anatomical topics, LLMs\u0026rsquo; performance revealed significant variations, with some topics being more accurately addressed than others. The differences were statistically significant irrespective of the models tested, which means they are related to the knowledge gaps in some topics, which affected AI performance. These results show that the special tuning of subject matter and the discipline's specificity should be done to improve LLMs reliability.\u003c/p\u003e \u003cp\u003eIn the comparative analysis of different models, a clear superiority was demonstrated by GPT-4o, which consistently and most accurately answered MCQs in all anatomical topics compared to other models. Claude and Copilot also performed well but were inconsistent on some topics. Such difference in the degree of reliability and accuracy of results shown by the models indicates the need for caution in selecting the model for particular educational purposes.\u003c/p\u003e \u003cp\u003eThese results encourage the possible incorporation of LLMs in teaching anatomy and, simultaneously, warm against their over-exploitation across different subjects. LLMs should only act as plausible additions to conventional medical methods, not in place of them.\u003c/p\u003e "},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003e\u003cem\u003eClinical trial number\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe clinical trial number is not pertinent to this study as it does not involve medicinal products or therapeutic interventions.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cem\u003eEthics declarations\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eEthics approval, Consent to Participate, and Consent to Publish declarations: not applicable.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cem\u003eAcknowledgment\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors thank Dr. Inna Shypilova and Dr. Larysa Sankova for their help reviewing the questions.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cem\u003eStatement of Contribution\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eO. Bolgova designed the research. O. Bolgova and V. Mavrych revived the questions and collected and analyzed the data. V. Mavrych did the statistical analysis. All authors were involved in interpreting data, drafting the article, and revising it critically. All have approved the submitted and final versions.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cem\u003eFunding\u0026nbsp;\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors received no funding for this study.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cem\u003eConflict of interest\u0026nbsp;\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors declare no conflicts of interest, financial or otherwise.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cem\u003eData availability\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe data supporting this study's findings are available on request from the corresponding author.\u0026nbsp;\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eAbd-Alrazaq A, AlSaad R, Alhuwail D, et al. Large Language Models in Medical Education: Opportunities, Challenges, and Future Directions. JMIR Med Educ. 2023;9:e48291. Published 2023 Jun 1. doi:10.2196/48291\u003c/li\u003e\n\u003cli\u003eBoscardin CK, Gin B, Golde PB, et al. ChatGPT and Generative Artificial Intelligence for Medical Education: Potential Impact and Opportunity. \u003cem\u003eAcad Med\u003c/em\u003e. 2024;99(1):22-27. doi:10.1097/ACM.0000000000005439\u003c/li\u003e\n\u003cli\u003eCook DA. Creating virtual patients using large language models: scalable, global, and low cost. \u003cem\u003eMed Teach\u003c/em\u003e. 2025;47(1):40-42. doi:10.1080/0142159X.2024.2376879\u003c/li\u003e\n\u003cli\u003eWilson RN, Holman PJ, Dragan M, et al. The effects of supplemental instruction derived from peer leaders on student outcomes in undergraduate human anatomy. \u003cem\u003eAnat Sci Educ\u003c/em\u003e. 2024;17(6):1239-1250. doi:10.1002/ase.2464\u003c/li\u003e\n\u003cli\u003eJin HK, Lee HE, Kim E. Performance of ChatGPT-3.5 and GPT-4 in national licensing examinations for medicine, pharmacy, dentistry, and nursing: a systematic review and meta-analysis. \u003cem\u003eBMC Med Educ\u003c/em\u003e. 2024;24(1):1013. doi:10.1186/s12909-024-05944-8\u003c/li\u003e\n\u003cli\u003eLiu M, Okuhara T, Chang X, et al. Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis. \u003cem\u003eJ Med Internet Res\u003c/em\u003e. 2024;26:e60807. doi:10.2196/60807\u003c/li\u003e\n\u003cli\u003eHan Z, Battaglia F, Udaiyar A, et al. An explorative assessment of ChatGPT as an aid in medical education: Use it with caution. \u003cem\u003eMed Teach\u003c/em\u003e. 2024;46(5):657-664. doi:10.1080/0142159X.2023.2271159\u003c/li\u003e\n\u003cli\u003eMavrych V, Ganguly P, Bolgova O. Using large language models (ChatGPT, Copilot, PaLM, Bard, and Gemini) in Gross Anatomy course: Comparative analysis. \u003cem\u003eClin Anat\u003c/em\u003e. 2025;38(2):200-210. doi:10.1002/ca.24244\u003c/li\u003e\n\u003cli\u003eTotlis T, Natsis K, Filos D, et al. The potential role of ChatGPT and artificial intelligence in anatomy education: a conversation with ChatGPT. Surg Radiol Anat. 2023;45(10):1321-1329. doi:10.1007/s00276-023-03229-1\u003c/li\u003e\n\u003cli\u003eChen A, Chen DO, Tian L. Benchmarking the symptom-checking capabilities of ChatGPT for a broad range of diseases. \u003cem\u003eJ Am Med Inform Assoc\u003c/em\u003e. 2024;31(9):2084-2088. doi:10.1093/jamia/ocad245\u003c/li\u003e\n\u003cli\u003eShemer A, Cohen M, Altarescu A, et al. Diagnostic capabilities of ChatGPT in ophthalmology. \u003cem\u003eGraefes Arch Clin Exp Ophthalmol\u003c/em\u003e. 2024;262(7):2345-2352. doi:10.1007/s00417-023-06363-z\u003c/li\u003e\n\u003cli\u003eWaldock A, Riese J, Streichert T. Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study. \u003cem\u003eJMIR Med Educ\u003c/em\u003e. 2024;10:e50965. doi:10.2196/50965\u003c/li\u003e\n\u003cli\u003eSurapaneni KM, Rajajagadeesan A, Goudhaman L, et al. Evaluating ChatGPT as a self-learning tool in medical biochemistry: A performance assessment in undergraduate medical university examination. \u003cem\u003eBiochem Mol Biol Educ\u003c/em\u003e. 2024;52(2):237-248. doi:10.1002/bmb.21808\u003c/li\u003e\n\u003cli\u003eWaldock WJ, Lam G, Baptista A, et al. Which curriculum components do medical students find most helpful for evaluating AI outputs?. \u003cem\u003eBMC Med Educ\u003c/em\u003e. 2025;25(1):195. doi:10.1186/s12909-025-06735-5\u003c/li\u003e\n\u003cli\u003eBolgova, O., Shypilova, I., Sankova, L., et al. How Well Did ChatGPT Perform in Answering Questions on Different Topics in Gross Anatomy?. European Journal of Medical and Health Sciences, 2023;5(6):94-100. doi:10.24018/ejmed.2023.5.6.1989\u003c/li\u003e\n\u003cli\u003eAbbas A, Rehman MS, Rehman SS. Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions. Cureus. 2024;16(3):e55991. doi:10.7759/cureus.55991\u003c/li\u003e\n\u003cli\u003eWei B. Performance Evaluation and Implications of Large Language Models in Radiology Board Exams: Prospective Comparative Analysis. JMIR Med Educ. 2025;11:e64284. Published 2025 Jan 16. doi:10.2196/64284\u003c/li\u003e\n\u003cli\u003eBanerjee A, Ahmad A, Bhalla P, Goyal K. Assessing the Efficacy of ChatGPT in Solving Questions Based on the Core Concepts in Physiology. Cureus. 2023 Aug 10;15(8):e43314. doi: 10.7759/cureus.43314. PMID: 37700949; PMCID: PMC10492920.\u003c/li\u003e\n\u003cli\u003eMavrych V, Yaqinuddin A, Bolgova O. Claude, ChatGPT, Copilot, and Gemini Performance versus Students in Different Topics of Neuroscience. Adv Physiol Educ. Published online January 17, 2025. doi:10.1152/advan.00093.2024\u003c/li\u003e\n\u003cli\u003eVaira LA, Lechien JR, Abbate V, et al. Accuracy of ChatGPT-Generated Information on Head and Neck and Oromaxillofacial Surgery: A Multicenter Collaborative Analysis. \u003cem\u003eOtolaryngol Head Neck Surg\u003c/em\u003e. 2024;170(6):1492-1503. doi:10.1002/ohn.489\u003c/li\u003e\n\u003cli\u003eDu W, Jin X, Harris JC, et al. Large language models in pathology: A comparative study of ChatGPT and Bard with pathology trainees on multiple-choice questions. Ann Diagn Pathol. 2024;73:152392. doi:10.1016/j.anndiagpath.2024.152392\u003c/li\u003e\n\u003cli\u003eArtsi Y, Sorin V, Konen E, Glicksberg BS, Nadkarni G, Klang E. Large language models for generating medical examinations: systematic review. \u003cem\u003eBMC Med Educ\u003c/em\u003e. 2024;24(1):354. doi:10.1186/s12909-024-05239-y\u003c/li\u003e\n\u003cli\u003eLaw AK, So J, Lui CT, et al. AI versus human-generated multiple-choice questions for medical education: a cohort study in a high-stakes examination. \u003cem\u003eBMC Med Educ\u003c/em\u003e. 2025;25(1):208. doi:10.1186/s12909-025-06796-6\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"Artificial Intelligence, Medical Education, Anatomy, Large Language Models, Assessment, ChatGPT","lastPublishedDoi":"10.21203/rs.3.rs-6219785/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-6219785/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003e\u003cstrong\u003eBackground\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe integration of Large Language Models (LLMs) in medical education has gained significant attention, particularly in their ability to handle complex medical knowledge assessments. However, comprehensive evaluation of their performance in anatomical education remains limited. To evaluate the performance accuracy of current LLMs compared to previous versions in answering anatomical multiple-choice questions and assessing their reliability across different anatomical topics.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eMethods\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eWe analyzed the performance of four LLMs (GPT-4o, Claude, Copilot, and Gemini) on 325 USMLE-style MCQs covering seven anatomical topics. Each model attempted the questions three times. Results were compared with the previous year's GPT-3.5 performance and random guessing. Statistical analysis included chi-square tests for performance differences.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eResults\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eCurrent LLMs achieved an average accuracy of 76.8 ± 12.2%, significantly higher than GPT-3.5 (44.4 ± 8.5%) and random responses (19.4 ± 5.9%). GPT-4o demonstrated the highest accuracy (92.9 ± 2.5%), followed by Claude (76.7 ± 5.7%), Copilot (73.9 ± 11.9%), and Gemini (63.7 ± 6.5%). Performance varied significantly across anatomical topics, with Head \u0026amp; Neck (79.5%) and Abdomen (78.7%) showing the highest accuracy rates, while Upper Limb questions showed the lowest performance (72.9%). Only 29.5% of questions were answered correctly by all LLMs, and 2.5% were never answered correctly. Statistical analysis confirmed significant differences between models and across topics (χ² = 182.11–518.32, p \u0026lt; 0.001).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConclusions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eCurrent LLMs show markedly improved performance in anatomical knowledge assessment compared to previous versions, with GPT-4o demonstrating superior accuracy and consistency. However, performance variations across anatomical topics and between models suggest the need for careful consideration in educational applications. These tools show promise as supplementary resources in medical education while highlighting the continued necessity for human expertise.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eClinical trial number\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e","manuscriptTitle":"Evolution of AI in Anatomy Education: Comparing Current Large Language Models Against Historical ChatGPT Performance on USMLE-Style Questions","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-03-24 06:24:51","doi":"10.21203/rs.3.rs-6219785/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2025-08-14T17:35:58+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-08-05T07:05:47+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"57392388573846662598208178200598350578","date":"2025-07-24T10:48:24+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-05-25T20:15:44+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"80724490620399907599474009546773480405","date":"2025-05-14T12:33:43+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2025-03-20T19:13:10+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2025-03-20T19:12:46+00:00","index":"","fulltext":""},{"type":"editorInvited","content":"","date":"2025-03-20T15:53:13+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2025-03-20T15:47:41+00:00","index":"","fulltext":""},{"type":"submitted","content":"Scientific Reports","date":"2025-03-13T11:41:57+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"53e6ae4b-7630-4ac7-b0ee-fd747cefde5f","owner":[],"postedDate":"March 24th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[{"id":45989205,"name":"Health sciences/Medical research/Pre clinical studies"},{"id":45989206,"name":"Health sciences/Anatomy"},{"id":45989207,"name":"Health sciences/Anatomy/Endocrine system"},{"id":45989208,"name":"Health sciences/Anatomy/Gastrointestinal system"},{"id":45989209,"name":"Health sciences/Anatomy/Musculoskeletal system"},{"id":45989210,"name":"Health sciences/Anatomy/Nervous system"},{"id":45989211,"name":"Health sciences/Anatomy/Oral anatomy"},{"id":45989212,"name":"Health sciences/Anatomy/Urinary tract"}],"tags":[],"updatedAt":"2025-11-03T16:00:21+00:00","versionOfRecord":{"articleIdentity":"rs-6219785","link":"https://doi.org/10.1038/s41598-025-22437-w","journal":{"identity":"scientific-reports","isVorOnly":false,"title":"Scientific Reports"},"publishedOn":"2025-10-28 15:57:15","publishedOnDateReadable":"October 28th, 2025"},"versionCreatedAt":"2025-03-24 06:24:51","video":"","vorDoi":"10.1038/s41598-025-22437-w","vorDoiUrl":"https://doi.org/10.1038/s41598-025-22437-w","workflowStages":[]},"version":"v1","identity":"rs-6219785","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-6219785","identity":"rs-6219785","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.