Performance of Artificial Intelligence and Large Language Models (LLMs) on Neurosurgical Board Examinations Across Text and Visual Modalities | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Performance of Artificial Intelligence and Large Language Models (LLMs) on Neurosurgical Board Examinations Across Text and Visual Modalities Rommi Kashlan, Hithardhi Duggireddy, Jacob J. Smith, Razan Faraj, and 3 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7744596/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Background: Large language models (LLMs) are increasingly applied in clinical use and medical education, yet their reliability across both textual and visual modalities in highly specialized domains such as neurosurgery remains unclear. Prior evaluations have been limited to text-only questions or narrow subsets of models, leaving the role of multimodal reasoning poorly defined. Methods: We benchmarked the latest LLMs, including 12 text-only and 10 multimodal systems, on 476 board-style neurosurgical questions spanning 13 subspecialties. Models were queried synchronously, temperature=0 under standardized prompting with no reasoning allowed mirroring examination conditions. Accuracy was compared across text and visual modalities, stratified by subspecialty and imaging type. Robustness was assessed using latency, parsing failures, and ablation experiments withholding clinical vignette components. Results: Text only and multimodal models achieved nearly identical mean accuracies of 67.9% and 68.5%, indicating that visual inputs did not provide a consistent overall benefit. Performance differed markedly across individual models. Gemini 2.5 Pro, Grok, and GPT 5 exceeded 80% accuracy, approaching resident performance. GPT 4.0 and GPT 4.5 followed in the high 70s. Claude Sonnet 3.7 and Claude Opus 4.1 performed in the mid 70s, while MedGemma and Llama 4 clustered in the low 70s. DeepSeek R1V3 performed close to chance. On image-based questions Gemini 2.5 Pro again led, while Grok, GPT 4.0, GPT 5, and Claude Opus clustered near 70% and Llama 4 models dropped to approximately 50%. Subspecialty analysis showed that visual input improved performance in neuroradiology, tumor, pediatrics, and spine. Trauma, vascular, and pain questions became less accurate with images, producing a bimodal pattern of benefit. Ablation experiments showed that removal of history produced the largest decline in accuracy (19.3% reduction), while withholding physical exam or lab data produced smaller effects (6.0% and 5.9%). A set of questions that no model could answer correctly accounted for 4% of the dataset. These questions were clustered in neuroradiology, vascular anatomy, and rare pediatric condition. Operational findings highlighted practical issues. The most accurate models were often slower to respond. Latency ranged from 0.22 seconds to more than 27 seconds. Parsing failures were uncommon in GPT 5, GPT 4.5, and Llama 4 but exceeded 13% in Claude Opus. Conclusions: Current LLMs can approach resident-level performance in structured neurosurgical domains and demonstrate selective benefits from visual input, but remain unreliable in anatomy-heavy, high-stakes contexts such as vascular and trauma. Their dependence on clinical history and susceptibility to systematic visual errors highlight the need for improved vision–language alignment before unsupervised clinical use. Until then, their role is best suited to supervised educational support with explicit safeguards. Health sciences/Health care Health sciences/Medical research Health sciences/Neurology Biological sciences/Neuroscience Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Introduction Artificial intelligence (AI) and large language models (LLMs) are rapidly entering medical education and clinical practice, raising concerns about reliability, reasoning, quality, and hallucination risk that could compromise safe use by clinicians. The landscape is evolving quickly, characterized by frequent releases of both general-purpose and clinical models. 1,2,3,4 Electronic health record and imaging platforms are piloting AI for clinical summarization, risk stratification, and pharmaceutical recommendation, with limited deployments underway and broader rollouts anticipated very soon. 5 , 6 , 7 Despite these advances, peer-reviewed evidence demonstrating fit-for-purpose performance remains limited and heterogeneous. 1 , 4 , 7 In this context, rigorous and transparent evaluation is necessary to clarify where models succeed, where they fail, and why. Neurosurgery provides a stringent setting for such evaluation. Board examinations require multimodal integration of information, incorporating subjective metrics, such as clinical history, and objective metrics, including physical examination, laboratory data, and medical imaging. Through this process, both declarative knowledge and structured clinical reasoning are assessed. 8 Prior studies have shown that LLMs can approach passing thresholds on medical examinations, and can even generate helpful, albeit inconsistent, explanations for textbook-style questions. 9,10,11,12,13 However, these metrics paint an incomplete picture, as most work has been limited to text-only datasets, had narrow model coverage, or have not probed robustness when information is limited or incomplete. These gaps are essential to study and especially critical in neurosurgery, which operates in a highly visual and anatomy-dense domain. This limitation is further emphasized in subspecialties such as neuroradiology and vascular neurosurgery, where model performance on image interpretation and anatomy-driven reasoning remains poorly understood. 14 As a result, the impact of information from variable clinical sub-categories and question archetypes on model performance is not well defined. To address these gaps, we evaluated 12 contemporary LLMs on a multimodal set of neurosurgical board-style questions spanning 13 subspecialties. We quantified aggregate and subspeciality-specific accuracy, characterized recurrent error patterns linked to cognitive demands in specific domains (e.g., anatomic localization, angiographic interpretation, and recognition of rare entities), and evaluated robustness using prespecified ablation that withheld clinical vignette elements and introduced controlled errors within a single modality. Our primary objective was to provide a head-to-head performance comparison of contemporary LLMs in a clinically relevant context using multimodal neurosurgical items. Secondary analyses mapped error patterns to underlying cognitive domains and described operational reliability to determine where safeguards may be needed. By situating model performance in this context, we aim not to substitute for human expertise but to delineate where LLMs may augment resident education, where they are prone to error, and what advances are needed before any supervised clinical role can be considered. Methods Data Source We constructed a consolidated benchmark of 472 neurosurgical board–style questions requiring multimodal clinical reasoning. The primary source was from the Primary Board Examination from the Congress for Neurological Surgeons (CNS) SANS American Board of Neurological Surgery (ABNS). The exam includes the surgical and non-surgical management of central nervous system tumors involving the brain, spinal cord, hypophysis, and skull base; vascular malformations of the intracranial, extracranial, and spinal vasculature; and a wide range of traumatic injuries affecting cranial, spinal, and peripheral nerves. It further evaluates pain syndromes and functional disorders, pathologies of peripheral and autonomic nervous systems, and conditions involving supporting structures of the nervous system, including meninges, skull, and vertebral column. Histopathology relevant to these processes is also represented. Subspecialty classification and imaging requirements were extracted from the exam and confirmed by authors. Subspecialty labels included pediatrics, tumor, spine, vascular, functional, general, trauma, peripheral nerve, and neuroradiology. No human data or clinical data was utilized in the project, and Emory University Institutional Review Board (IRB) determined the study exempt from approval. Models We evaluated 22 large language models (LLMs), including 12 text-only and 10 visual language models. Representative model families included OpenAI ChatGPT (GPT-5.0, GPT-4.5, GPT-4.0), Anthropic Claude (Sonnet 3.7, Opus 4.1), Google Gemini (2.5 Pro), Grok, Mistral L2, Llama4 (Maverick, Scout), DeepSeek R1V3, and MedGemma (4B, 27B). Visual-capable models were assessed in both text-only and visual input modes. All models were accessed through their respective application programming interfaces (APIs), temperature set to 0. Max tokens were 512 for text only models and 2048 for multimodal models. We implemented a standardized pipeline that ensured consistency and minimized confounding effects across runs. For each model, a new session was initiated before every question, preventing memory carryover or context leakage between items. Each question was then issued in real time, with models queried simultaneously to control for temporal drift in API performance. Prompt formatting was harmonized across providers, ChatGPT-5.0 AI assistance was utilized for best prompt generation, and the same multiple-choice structure was preserved in all requests: “I am going to give you a question to answer. The question may reference an image or not. Only return one of the answer choices in the exact spelling and nothing else. Do not return any output except for an answer choice. You can only choose 1 answer from those listed. Answer choice 1 Answer choice 2 Answer choice 3 Answer choice 4 Answer choice 5” We chose a no-reasoning prompt strategy for several reasons. It eliminates variability in response format across, mirrors actual neurosurgery board exam conditions where only final answers are chosen, ensures 100% consistent answer extraction without interpretation bias, and prevents confounding between medical knowledge and explanation ability. Additionally, we ran a pilot (n=50 questions) comparing no reasoning (94.2% parsing success), chain-of-thought (87.8% parsing success, 15% longer responses), and confidence-based (91.3% parsing success). Despite temperature=0, we observed 8.3% response variability and therefore implemented 3-sample majority voting for results. 91.7% of questions showed perfect consistency across samples. Inconsistent questions flagged for manual review The primary outcome was model accuracy, defined as exact match between the model-predicted option and the reference key for 5-choice multiple-choice questions. Model responses were automatically parsed, logged, and adjudicated against the reference key using both in-script and manual extraction routines. For visual-capable models, performance on image-containing items was assessed separately. All models were synchronously run September 13, 2025, 21:53. Experimental Framework First, a text-only baseline was established after filtering API errors to ensure consistent coverage across models. Second, the accuracy of visual-capable models was compared on identical items across text and visual input modes, with delta performance calculated and further stratified by imaging modality, including CT, MRI, angiography, and clinical photographs. Third, subspecialty-level accuracy was computed for each model and contextualized against benchmark values from neurosurgery resident cohorts reported in the CNS Exam Portal. Then, an ablation study was conducted to quantify the contribution of clinical history, physical examination, and laboratory findings, defining performance drop as the difference in accuracy between the baseline and each corresponding ablation variant. Finally, “impossible questions,” were defined as items with at least 7/10 incorrect visual models, and explored their distribution by subspecialty and imaging modality to highlight consistent areas of failure Statistical Analysis All analyses were performed in Python 3.11. Accuracy estimates were reported with 95% confidence intervals, and paired t-tests were applied to within-model comparisons (visual vs text). Welch’s t-tests were used for independent comparisons (baseline vs ablation variants). Statistical significance was defined as p<0.05, with thresholds of **p<0.01 and ***p<0.001 annotated in figures. We additionally recorded robustness metrics, including parsing failures (missing extracted answer letter) and inference latency (median and 90th percentile) No questions, stems, or images from the exam are reproduced, displayed, or distributed in this manuscript. All analyses were performed on model outputs relative to reference keys, and only aggregated results are reported. Results Unified Model Performance Across more than 4,800 text responses and 1,000 visual responses covering 13 neurosurgical subspecialties, performance clustered tightly at the group level. Shown in Figure 1, across all questions, Gemini 2.5 Pro was the clear winner with Grok4 coming in close in the 80s accuracy, GPT models were in the upper 70s range, followed by Claude, mistral, and Medgemma in the mid 70s. Llama models and DeepSeek R1V3 performed the worst. Similarly, in text, Gemini 2.5 Pro, Grok4, and GPT-5 were the most accurate systems, each exceeding 80% accuracy, while GPT-4.0 and GPT-4.5 followed close behind in the upper 70% range (Figure2A). Claude Sonnet 3.7 and Claude Opus 4.1 occupied the mid-70% tier, with Llama-4 Maverick and MedGemma-27B slightly lower in the low 70s. At the low end, DeepSeek R1V3 performed poorly. Visual vs. Textual Performance When restricted to image-based questions, performance gaps widened further. Gemini 2.5 Pro again led while Grok-4, GPT-4.0, GPT-5, and Claude Opus 4.1 formed a second cluster in the low-to-mid 70% range. Claude Sonnet 3.7 and GPT-4.5 trailed slightly, and the Llama-4 models showed the weakest visual reasoning, with accuracies barely above 50% (Figure 3A). When stratified by modality neuroradiology questions showed the largest visual advantage followed by tumor and pediatrics. Spine showed moderate increased performance. In contrast, trauma, vascular, and pain accuracy fell (Figure 3B). Robustness and efficiency are shown in Figure 3B–D. Median response latency ranged from 0.221 seconds (Llama-4 Maverick) to 27.236 seconds (DeepSeek R1V3). Parsing failure rates (defined as absence of an extractable answer letter) ranged from 0 in GPT-5, GPT-4.5, and Llama-4 Maverick to 0.133 in Claude Opus 4.1. The scatter plot in Figure 2D and Figure 3D illustrates trade-off between accuracy and latency, with the most accurate models also generally among the slowest. Subspecialty-Level Performance Figure 4A presents a heatmap of accuracy across 13 subspecialties. Neuroradiology, tumor, pediatrics, and spine benefited most from visual information, while trauma, vascular, and pain questions showed performance losses. When benchmarked against resident performance standards (Figure 4B), LLMs remained below human accuracy in every subspecialty, with the largest gaps in vascular neurosurgery and trauma (NS). Additional stratifications are shown in Figure 4C – D, where some subspecialties showed modest gains with images while others showed degradation. Clinical Context and Ablation Study We evaluated the contribution of different clinical information types using ablation. Figure 5D-F shows that removal of history produced the largest mean performance drop (19.3%), followed by physical exam (6%) and laboratory findings (5.9%). Figure 5C ranks component importance by absolute contribution with history as the most important and physical exam/labs providing smaller but measurable benefits. Statistical comparisons demonstrated significance for baseline versus history removal (p<0.001), while exam and lab ablations showed weaker effects. Impossible Questions We identified 19 “impossible” questions representing 4.0% of the dataset. Category dominance was evident: 52.6% (10/19) were neuroradiology items which required radiographic interpretation; 21.1% (4/19) were vascular questions involving complex anatomy; and 10.5% (2/19) were pediatric questions on rare congenital conditions (Figure 5A). Analysis by image modality revealed that 26.3% (5/19) of failures were CT-based questions emphasizing cross-sectional anatomy, 21.1% (4/19) were angiography requiring vascular interpretation, 21.1% (4/19) were MRI cases focusing on soft tissue pathology, and 15.8% (3/19) were real photographs, including intraoperative and autopsy specimens. Further inspection highlighted consistent question types that drive failure. Complex anatomy dominated, including vascular structures 6/19, cranial nerve identification 1/19, and venous system localization 2/19. Radiology interpretation was another major source of error, spanning CT and MRI pathology 8/19, plain radiographs 2/19, and histology slides 1/19. Rare pediatric conditions also featured prominently, including craniosynostosis syndromes and Chiari malformations. Figure 5B illustrates a representative cerebral angiogram with a 90% error rate across all models. Discussion In this benchmark of 12 contemporary LLMs on neurosurgical board-style questions, the leading models surpassed 80% accuracy on our dataset, approaching, but not matching, resident performance across subspecialities (neuropathology, vascular, pain, neuroradiology, etc.) Aggregate accuracy for text-only models centered near 68% and was comparable to multimodal performance with image-containing items (~ 69%). Visual input yielded heterogeneous effects by subspecialty domain and modality, and ablation experiments highlighted a strong dependence on clinical history for accurate answer prediction. A small but consequential subset of “impossible” items exposed persistent blind spots, most often in neuroradiology and vascular anatomy. 14 Comparison with Resident Performance When benchmarked against CNS resident standards, models performed inferiorly to human performance across every subspecialty. Differences in performance were smallest in tumor and spine subspeciality domains, which are areas where pattern recognition and guideline-concordant decision pathways are more structured. Vascular and trauma subspeciality domains which demand precise spatial reasoning, error-free interpretation of time-critical imaging, and integration of nuanced physiologic context, had the largest performance gaps. These findings suggest that contemporary LLMs can perform well on “structured” pathologies but remain below resident-level clinical reasoning on items where anatomy is complex and decisions hinge on subtle image features or evolving physiology. Interpreting Subspecialty and Modality Variation Visual input improved performance in neuroradiology, pediatrics, tumor, and spine, but attenuated performance in trauma, vascular in pain. This finding likely reflects the differential utilization of pattern recognition (e.g., classic radiographic features) versus spatial/temporal integration (e.g., vessel-territory mapping, injury kinetics) of tasks across these domains resulting in a bimodal difficulty distribution. 15 Many common radiographic patterns are learnable from large public corpora, while a long tail of anatomy-heavy or rare-pathology items requires expertise in modality-specific conventions (e.g., angiographic views, venous variants, cranial nerve segmental anatomy) that are underrepresented in general web-scale training data. 16 , 17 , 18 , 19 Additionally, the underperformance on vascular and trauma items with images suggests that model visual encoders lack robustness to the domain-specific invariances used by clinicians, such as rotating mental 3D anatomy, reconciling multi-sequence MRI, or interpreting catheter-based angiography under variable projections. 20 , 21 Heterogeneous Impact of Imaging on Model Performance As highlighted by the bimodality in performance with visual input, most models did not exhibit a uniform advantage with images; several performed worse on image-containing items. Likely contributors include medical domain shift in visual encoders trained predominantly on natural images, rather than CT/MRI/DSA distributions, limited alignment between textual rationales and pixel-level features during pretraining, and sparse exposure to modality-specific formats, such as angiography, CT bone windows, or intraoperative photographs. 18 , 19 , 20 Notably, one model (Gemini 2.5 Pro) demonstrated consistent performance gains with visual input, suggesting that better vision-language alignment and medical-domain pretraining can yield greater accuracy, but, at this time, these benefits limited across contemporary models. 22 Significance of Clinical History Ablation experiments reveled a significant reduction in accuracy when clinical history was withheld (-19.3%), with a smaller impact on accuracy when physical examination (-6.0%) and laboratory results (-5.9%) were withheld (Fig. 4 C-E). This finding highlights how LLMs weigh narrative context, with long-form textual history providing key priors that narrow the diagnostic and management space. 15 , 17 , 18 , 19 Our analysis also suggests that brief physical exam findings or lab results may be underweighted without deliberate prompting or structured schemas. Practically, this means prompts that preserve important differential-shaping elements of the history are critical for reliability. 23 Error Modes and “impossible” questions We identified a small set of “impossible” items, which included 19 questions (~ 4%) missed by ≥ 7/10 visual-capable models, clustered in neuroradiology (~ 53%), vascular (~ 21%), and pediatrics (~ 11%) (Fig. 4 A). These items typically required detailed anatomic localization, interpretation of complex vascular studies, or recognition of rare pediatric conditions. When stratified by modality, the highest rate of failures spanned items with CT (~ 26%), angiography (~ 21%), MRI (~ 21%), and clinical photographs (~ 16%). These errors reflected limitations beyond simple mislabeling including, improper localization within vascular trees, confusion among adjacent cranial nerves or venous sinuses, and reliance on superficial image cues without integrating clinical priors. From a safety standpoint, this analysis highlights the scenarios in which unvetted model output could be most harmful without expert oversight. Robustness, Reliability, and Practical Use Operational metrics are vital prior to deployment in any educational or clinical context. We observed non-trivial parsing failures in several models and substantial latency variability. These issues translate into user friction and intermittent non-answers that would degrade an educational or clinical workflow. Together with the ablation sensitivity to missing history, our results argue for the following safeguards in deployment for a reliable educational workflow: strict output schemas, abstention when uncertainty is high, and prompt templates that explicitly include proper context (e.g., clinical history) to generate strong priors to narrow differentials. 17 , 18 , 19 , 23 For resident learning purposes, models may function as rapid feedback engines for multiple-choice practice, especially in tumor, spine, and common neuroradiology patterns. 24, 25 For high-stakes preparation or clinical decision support, reliance should remain supervised and selective, with particular caution in vascular and trauma content. Limitations Our analysis is grounded in a single large exam corpus of multiple-choice questions and does not simulate oral boards where longitudinal case discussion, evolving data, and justification of management are central. Image presentation through APIs may not fully match diagnostic workstation quality, and we did not evaluate sequential imaging or multi-view angiographic series. Vendor APIs evolve rapidly; although we synchronized queries to reduce temporal drift, results reflect model snapshots in time. “Resident” performance score was quantified using the CNS average mean scores for each subsection. Finally, we evaluated output accuracy rather than explanation fidelity, as models can sometimes arrive at the correct answer for the wrong reasons. Future directions Future studies should test models on oral board–style cases and evolving clinical scenarios, not just multiple-choice questions. Work is also needed to improve vision-language alignment with CT, MRI, and angiography to close gaps in vascular and trauma performance. Finally, models must be evaluated for explanation quality and built-in safeguards to ensure safe use in neurosurgical training and supervised care. Conclusion Current LLMs can perform competitively on many neurosurgical board–style questions and offer tangible value as study aids, particularly in structured domains. However, they remain below resident-level performance, are sensitive to history omission, and show inconsistent benefit from images, with pronounced deficits in neuroradiology’s hardest cases and in vascular and trauma content. Their variability in multimodal performance, dependence on complete history, and concentrated failures in high-stakes neurovascular domains preclude independent clinical decision support. Until visual encoders and medical alignment improve, deployment should remain supervised, with clear expectations about where models help, where they may mislead, and how to design prompts and guardrails that prioritize safety. Declarations Conflict of Interest The authors have no conflict of interest Consent to Participate: This manuscript does not involve human subjects. Consent to participate is not required. Human Ethics and Consent to Participate declarations: not applicable Author Contribution R.K,H.D, and J.J.S Synthesized literature review, Collected data, wrote manuscript, performed analysis, prepared figures, and guided revisionsR.F Collected data and guided revisionsS.G guided revisionsJ.A.G and J.W.G wrote manuscript, guided revisions, and guided project direciton Acknowledgement We thank the Congress of Neurological Surgeons (CNS) for developing and maintaining the Self-Assessment Neurosurgery (SANS) Indications Exam, which served as the foundation for this benchmarking study. No Material from the exams is distributed in this manuscript. Data Availability This study did not generate or analyze new patient, genomic, or experimental data. The neurosurgical board–style questions used in this benchmarking analysis were derived from the Congress of Neurological Surgeons (CNS) Self-Assessment in Neurological Surgery (SANS) Indications Exam and the American Board of Neurological Surgery (ABNS) primary examination framework. These materials are proprietary and not publicly available. All results are provided in aggregated form within the manuscript and supplementary materials. No raw exam questions, stems, or images are distributed. Financial Disclosure The authors have no financial disclosures Clinical Trial: Clinical trial number: not applicable.’ References Artsi Y, Sorin V, Glicksberg BS, et al. Challenges of Implementing LLMs in Clinical Practice: Perspectives. J Clin Med . 2025;14(17):6169. Published 2025 Sep 1. doi:10.3390/jcm14176169 Roustan D, Bastardot F. The Clinicians' Guide to Large Language Models: A General Perspective With a Focus on Hallucinations. Interact J Med Res . 2025;14:e59823. Published 2025 Jan 28. doi:10.2196/59823 Su H, Sun Y, Li R, et al. Large Language Models in Medical Diagnostics: Scoping Review With Bibliometric Analysis. J Med Internet Res . 2025;27:e72062. Published 2025 Jun 9. doi:10.2196/72062 Han T, Nebelung S, Khader F, et al. Medical large language models are susceptible to targeted misinformation attacks. NPJ Digit Med . 2024;7(1):288. Published 2024 Oct 23. doi:10.1038/s41746-024-01282-7 Rohren E, Ahmadzade M, Colella S, et al. Post-deployment Monitoring of AI Performance in Intracranial Hemorrhage Detection by ChatGPT. Acad Radiol . Published online August 11, 2025. doi:10.1016/j.acra.2025.07.055 Mac Donald CL, Yuh EL, Vande Vyvere T, et al. Neuroimaging Characterization of Acute Traumatic Brain Injury with Focus on Frontline Clinicians: Recommendations from the 2024 National Institute of Neurological Disorders and Stroke Traumatic Brain Injury Classification and Nomenclature Initiative Imaging Working Group. J Neurotrauma . 2025;42(13-14):1056-1064. doi:10.1089/neu.2025.0079 Kamel Rahimi A, Pienaar O, Ghadimi M, et al. Implementing AI in Hospitals to Achieve a Learning Health System: Systematic Review of Current Enablers and Barriers. J Med Internet Res . 2024;26:e49655. Published 2024 Aug 2. doi:10.2196/49655 Lin JJ, Klopfenstein J, Maldonado A, McCall T, Tsung A, Dinh DH. We Tabulated and Organized American Board of Neurological Surgeons Primary Exam Keywords (2015-2023) so You Don't Have to. Cureus . 2023;15(5):e39402. Published 2023 May 23. doi:10.7759/cureus.39402 Ali R, Tang OY, Connolly ID, et al. Performance of ChatGPT and GPT-4 on Neurosurgery Written Board Examinations. Neurosurgery . 2023;93(6):1353-1365. doi:10.1227/neu.0000000000002632 Hopkins BS, Nguyen VN, Dallas J, et al. ChatGPT versus the neurosurgical written boards: a comparative analysis of artificial intelligence/machine learning performance on neurosurgical board-style questions. J Neurosurg . 2023;139(3):904-911. Published 2023 Mar 24. doi:10.3171/2023.2.JNS23419 Stengel FC, Stienen MN, Ivanov M, et al. Can AI pass the written European Board Examination in Neurological Surgery? - Ethical and practical issues. Brain Spine . 2024;4:102765. Published 2024 Feb 13. doi:10.1016/j.bas.2024.102765 McNulty AM, Valluri H, Gajjar AA, Custozzo A, Field NC, Paul AR. Performance evaluation of ChatGPT-4.0 and Gemini on image-based neurosurgery board practice questions: A comparative analysis. J Clin Neurosci . 2025;134:111097. doi:10.1016/j.jocn.2025.111097 Guerra GA, Hofmann H, Sobhani S, et al. GPT-4 Artificial Intelligence Model Outperforms ChatGPT, Medical Students, and Neurosurgery Residents on Neurosurgery Written Board-Like Questions. World Neurosurg . 2023;179:e160-e165. doi:10.1016/j.wneu.2023.08.042 Szmyd B, Podstawka M, Wiśniewski K, et al. AI-Driven Innovations in Neuroradiology and Neurosurgery: Scoping Review of Current Evidence and Future Directions. Cancers (Basel) . 2025;17(16):2625. Published 2025 Aug 11. doi:10.3390/cancers17162625 Kumar, V. et al. (2024). Large-Language-Models (LLM)-Based AI Chatbots: Architecture, In-Depth Analysis and Their Performance Evaluation. In: Santosh, K., et al. Recent Trends in Image Processing and Pattern Recognition. RTIP2R 2023. Communications in Computer and Information Science, vol 2027. Springer, Cham. https://doi.org/10.1007/978-3-031-53085-2_20 Zhengbao Jiang, Frank F. Xu, Jun Araki, Graham Neubig; How Can We Know What Language Models Know?. Transactions of the Association for Computational Linguistics 2020; 8 423–438. doi: https://doi.org/10.1162/tacl_a_00324 Zhao, Wayne & Zhou, Kun & Junyi, Li & Tianyi, Tang & Wang, Xiaolei & Hou, Yupeng & Min, Yingqian & Zhang, Beichen & Zhang, Junjie & Dong, Zican & Du, Yifan & Yang, Chen & Chen, Yushuo & Chen, Zhipeng & Jiang, Jinhao & Ren, Ruiyang & Li, Yifan & Tang, Xinyu & Liu, Zikang & Wen, Ji-Rong. (2023). A Survey of Large Language Models. 10.48550/arXiv.2303.18223. Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med . 2023;29(8):1930-1940. doi:10.1038/s41591-023-02448-8 Shah NH, Entwistle D, Pfeffer MA. Creation and Adoption of Large Language Models in Medicine. JAMA . 2023;330(9):866-869. doi:10.1001/jama.2023.14217 Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, Enhong Chen, A survey on multimodal large language models, National Science Review , Volume 11, Issue 12, December 2024, nwae403, https://doi.org/10.1093/nsr/nwae403 Krupinski EA. The role of perception in imaging: past and future. Semin Nucl Med . 2011;41(6):392-400. doi:10.1053/j.semnuclmed.2011.05.002 S. Khan, M. R. Biswas, A. Murad, H. Ali and Z. Shah, "An Early Investigation into the Utility of Multimodal Large Language Models in Medical Imaging," 2024 IEEE International Conference on Information Reuse and Integration for Data Science (IRI), San Jose, CA, USA, 2024, pp. 234-239, doi: 10.1109/IRI62200.2024.00056. Kulkarni, Nilesh & Tupsakhare, Preeti. (2024). Crafting Effective Prompts: Enhancing AI Performance through Structured Input Design. JOURNAL OF RECENT TRENDS IN COMPUTER SCIENCE AND ENGINEERING. 12. 1-10. 10.70589/JRTCSE.2024.5.1. Safranek CW, Sidamon-Eristoff AE, Gilson A, Chartash D.The Role of Large Language Models in Medical Education: Applications and Implications. JMIR Med Educ 2023;9:e50945 doi: 10.2196/50945 Presbitero P, Gasparini GL, Pagnotta P. Images in cardiovascular medicine. Intra-arterial thrombolysis for left middle cerebral artery embolic stroke during coronary angiography. Circulation . 2006;113(5):e64-e66. doi:10.1161/CIRCULATIONAHA.105.552802 Additional Declarations No competing interests reported. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7744596","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":529999662,"identity":"6aebbcad-bcd5-4b50-9a77-7b21641e18da","order_by":0,"name":"Rommi Kashlan","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA90lEQVRIiWNgGAWjYBAC+wYgkQCEDBLMQGYFTJwNtxaDA3AtjEAtZ4jVwgDTwthGjJbj3akbHjCkycvPbmz8XDjPJppf+owBw4eyw7j90nN2240EhhzDDXcONkvP3JaWO7Mvx4BxxjncWuwkckFaKhg3SCQ2SPNuO5y74QyPATNvG24txlAt9vNnJDb/5p3zH6LlLx4thjPAWnISG24ktknzNhyAaGHEo8XgDMgvBmnJG4BarHmOJefO7GErONhzLh23luO9227+qEi2nT8j+fBtnhq73H4e5o0PfpRZ49QC1YjC44BGFgmA/QGpOkbBKBgFo2B4AwCBTF76HdprtAAAAABJRU5ErkJggg==","orcid":"","institution":"Emory University School of Medicine","correspondingAuthor":true,"prefix":"","firstName":"Rommi","middleName":"","lastName":"Kashlan","suffix":""},{"id":529999663,"identity":"60a995f4-6d2c-4077-9783-6797e5a8464d","order_by":1,"name":"Hithardhi Duggireddy","email":"","orcid":"","institution":"Emory University School of Medicine","correspondingAuthor":false,"prefix":"","firstName":"Hithardhi","middleName":"","lastName":"Duggireddy","suffix":""},{"id":529999664,"identity":"5a1835e8-cf80-4f7d-836f-1057ddf31777","order_by":2,"name":"Jacob J. Smith","email":"","orcid":"","institution":"Imperial College London School of Medicine","correspondingAuthor":false,"prefix":"","firstName":"Jacob","middleName":"J.","lastName":"Smith","suffix":""},{"id":529999665,"identity":"796a7d5b-dc18-487e-bfeb-ac71a3cfc5ec","order_by":3,"name":"Razan Faraj","email":"","orcid":"","institution":"Emory University School of Medicine","correspondingAuthor":false,"prefix":"","firstName":"Razan","middleName":"","lastName":"Faraj","suffix":""},{"id":529999667,"identity":"4d0d9efa-174a-40e6-926c-fe8ea4342a16","order_by":4,"name":"Sandra Gattas","email":"","orcid":"","institution":"Emory University School of Medicine","correspondingAuthor":false,"prefix":"","firstName":"Sandra","middleName":"","lastName":"Gattas","suffix":""},{"id":529999668,"identity":"e7b62559-817e-49c7-9d7f-694eb02fe940","order_by":5,"name":"Jonathan A. Grossberg","email":"","orcid":"","institution":"Emory University School of Medicine","correspondingAuthor":false,"prefix":"","firstName":"Jonathan","middleName":"A.","lastName":"Grossberg","suffix":""},{"id":529999669,"identity":"debe82e9-5eaa-4e30-8657-7e8dc45b9d62","order_by":6,"name":"Judy W. Gichoya","email":"","orcid":"","institution":"Emory University School of Medicine","correspondingAuthor":false,"prefix":"","firstName":"Judy","middleName":"W.","lastName":"Gichoya","suffix":""}],"badges":[],"createdAt":"2025-09-29 18:08:22","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-7744596/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7744596/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":94047294,"identity":"bc5ac9e8-8227-4d40-9234-03d2a810f5eb","added_by":"auto","created_at":"2025-10-21 23:13:57","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":7039341,"visible":true,"origin":"","legend":"","description":"","filename":"revisedllmmanuscript.docx","url":"https://assets-eu.researchsquare.com/files/rs-7744596/v1/2ab272ac37ca3489dd1061e0.docx"},{"id":94047288,"identity":"3d13d7aa-8d94-4ed5-98bc-2f599e6fc00a","added_by":"auto","created_at":"2025-10-21 23:13:57","extension":"json","order_by":1,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":10566,"visible":true,"origin":"","legend":"","description":"","filename":"64d4b4490d6e404ea383495081ba92ee.json","url":"https://assets-eu.researchsquare.com/files/rs-7744596/v1/c1d2f441cd90d50c2fa3890e.json"},{"id":94047299,"identity":"7a50053c-0a39-4f6d-a287-0f7bd6f8b891","added_by":"auto","created_at":"2025-10-21 23:13:57","extension":"xml","order_by":2,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":86898,"visible":true,"origin":"","legend":"","description":"","filename":"64d4b4490d6e404ea383495081ba92ee1enriched.xml","url":"https://assets-eu.researchsquare.com/files/rs-7744596/v1/1ac3cb1a2b8b20b155d017ae.xml"},{"id":94047302,"identity":"ce5d7259-420d-41ce-96aa-fa575e9cb678","added_by":"auto","created_at":"2025-10-21 23:13:57","extension":"png","order_by":3,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":495022,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-7744596/v1/14a51c9b3396d1323ab720bb.png"},{"id":94047307,"identity":"29e515db-5e57-4fe0-bc93-3a01b75a3998","added_by":"auto","created_at":"2025-10-21 23:13:58","extension":"png","order_by":4,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":1159084,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-7744596/v1/d445201d0b5da06a07dac7f7.png"},{"id":94047297,"identity":"956ecc5f-a14d-4b65-b572-f4bd8293c2ea","added_by":"auto","created_at":"2025-10-21 23:13:57","extension":"png","order_by":5,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":1036252,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-7744596/v1/4ca7b27705d8b72c12c7bee5.png"},{"id":94047298,"identity":"72181407-90a5-405f-b6a0-35f73de1aec2","added_by":"auto","created_at":"2025-10-21 23:13:57","extension":"png","order_by":6,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":1525878,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-7744596/v1/5eca6a47c10080ff445ca6c3.png"},{"id":94048208,"identity":"9c88ee11-167b-4bd7-888c-9c7f1e1a7f02","added_by":"auto","created_at":"2025-10-21 23:21:57","extension":"png","order_by":7,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":2764276,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-7744596/v1/719f167a17bb6e8815d5219a.png"},{"id":94047303,"identity":"ed529e26-f90b-4a97-9527-0b6ebb355c60","added_by":"auto","created_at":"2025-10-21 23:13:57","extension":"png","order_by":8,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":117548,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-7744596/v1/459e8096f7ed4ae8aa7e509e.png"},{"id":94047300,"identity":"e3ea14c1-5f21-4997-9b76-cc3cfea090c5","added_by":"auto","created_at":"2025-10-21 23:13:57","extension":"png","order_by":9,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":221106,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-7744596/v1/765d0cd97f0a1d9b28e7cfb8.png"},{"id":94048209,"identity":"deed612c-ee85-4991-a03d-7421f0f86c37","added_by":"auto","created_at":"2025-10-21 23:21:58","extension":"png","order_by":10,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":207305,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-7744596/v1/2fcc0762b791f58e93fe383e.png"},{"id":94047295,"identity":"55010ffa-4ca4-45b6-a9b9-37b6b6a3c1c5","added_by":"auto","created_at":"2025-10-21 23:13:57","extension":"png","order_by":11,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":331396,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-7744596/v1/a3811e9d11bfdc8843c60026.png"},{"id":94047305,"identity":"b08acb22-648a-4b0e-80d2-9369f63722f8","added_by":"auto","created_at":"2025-10-21 23:13:58","extension":"png","order_by":12,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":434874,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-7744596/v1/21e87d01d4fed3920aa0b064.png"},{"id":94047292,"identity":"9f8bdd79-a744-4f1d-a115-0f9cad633f1d","added_by":"auto","created_at":"2025-10-21 23:13:57","extension":"xml","order_by":13,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":81697,"visible":true,"origin":"","legend":"","description":"","filename":"64d4b4490d6e404ea383495081ba92ee1structuring.xml","url":"https://assets-eu.researchsquare.com/files/rs-7744596/v1/c620f4c5e3907da62fcbc47b.xml"},{"id":94047301,"identity":"28ddb4ee-fdec-4308-b6bd-f9ac7806632c","added_by":"auto","created_at":"2025-10-21 23:13:57","extension":"html","order_by":14,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":97243,"visible":true,"origin":"","legend":"","description":"","filename":"earlyproof.html","url":"https://assets-eu.researchsquare.com/files/rs-7744596/v1/0a90fdd07919d85f2eb84694.html"},{"id":94047289,"identity":"f5c34a66-2ed9-41be-984f-b83f5041a516","added_by":"auto","created_at":"2025-10-21 23:13:57","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":495022,"visible":true,"origin":"","legend":"\u003cp\u003eAccuracy across all question subsets including text LLMs and visual input LLMs.\u003c/p\u003e","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-7744596/v1/93ab6195f6970833d4452c8a.png"},{"id":94047290,"identity":"0df32fbf-caa6-46f8-ad97-1b74d823ef04","added_by":"auto","created_at":"2025-10-21 23:13:57","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":1159084,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eText-only model performance and operational reliability on neurosurgical board–style questions.\u003c/strong\u003e\u003cbr\u003e\n \u003cstrong\u003eA.\u003c/strong\u003e Accuracy by model with 95% CIs on the consolidated benchmark (4,836 text-only responses across 476 questions in 13 subspecialties; for multimodal families, panel A reflects their visual runs on image-containing items). Mean accuracy was 0.679 for text-only models and 0.685 for visual runs (Δ≈+0.005). Top text model: Gemini 2.5 Pro, 84.4% (95% CI 0.805–0.883); lowest: DeepSeek R1V3, 31.7% (0.269–0.366). \u003cstrong\u003eB.\u003c/strong\u003e Response parsing failure rate (no extractable single-letter answer), ranging from 0 (GPT-5, GPT-4.5, Llama-4 Maverick) to 0.133 (Claude Opus 4.1). \u003cstrong\u003eC.\u003c/strong\u003e Latency distributions by model; median latency spanned 0.221 s (Llama-4 Maverick) to 27.236 s (DeepSeek R1V3). \u003cstrong\u003eD.\u003c/strong\u003eAccuracy versus median latency, with points colored by parsing-failure rate, illustrating that higher-accuracy models were generally slower.\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-7744596/v1/145d84b9af37776cecc1df27.png"},{"id":94047293,"identity":"265bcbb7-09a6-4690-8497-708fa7f3d266","added_by":"auto","created_at":"2025-10-21 23:13:57","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":1036252,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eVisual performance across models and imaging modalities.\u003c/strong\u003e\u003cbr\u003e\n \u003cstrong\u003eA.\u003c/strong\u003e Visual‐only accuracy by multimodal model with 95% CIs. Performance was heterogeneous: Gemini 2.5 Pro achieved 84.6%, followed by Grok-4 (74.0%) and ChatGPT-4.0 (71.2%); ChatGPT-5, Claude Opus 4.1, and MedGemma-4B clustered near ~70%; Claude Sonnet 3.7 and ChatGPT-4.5 were 69.2% and 67.3%, respectively; Llama-4 Scout and Maverick were 54.8% and 52.9%. \u003cstrong\u003eB.\u003c/strong\u003eVisual accuracy by image modality (n shown above bars); higher performance appears on MRI/nuclear studies, with lower accuracy on angiography and real photographs. \u003cstrong\u003eC.\u003c/strong\u003e Change in accuracy (visual − text) by model with significance markers, showing that several models gained with images while others showed minimal benefit. \u003cstrong\u003eD.\u003c/strong\u003e Visual accuracy versus median latency, colored by the visual–text delta, illustrating persistent heterogeneity and a tendency for higher-accuracy models to incur longer response times.\u003c/p\u003e","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-7744596/v1/6dead662e0164e12a239f724.png"},{"id":94047304,"identity":"7ab32f83-0e33-4a95-8921-d8c261fa3acd","added_by":"auto","created_at":"2025-10-21 23:13:57","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":1525878,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eSubspecialty performance landscape and comparison with neurosurgery residents.\u003c/strong\u003e\u003cbr\u003e\n \u003cstrong\u003eA.\u003c/strong\u003e Heatmap of model accuracy across 13 subspecialties, showing wide variability by model and topic. \u003cstrong\u003eB.\u003c/strong\u003eBenchmark against resident standards: LLMs underperform humans in every subspecialty; largest deficits are in vascular neurosurgery, with other sizable gaps in neuroradiology and trauma. “ns” denotes a non-significant difference for that subspecialty in this cohort. \u003cstrong\u003eC.\u003c/strong\u003e Accuracy on image-containing items by model × subspecialty; relative gains are most evident in neuroradiology, tumor, pediatrics, and spine, whereas trauma, vascular, and pain show decrements with images. \u003cstrong\u003eD.\u003c/strong\u003e Human–LLM performance gap (percentage points) by subspecialty, highlighting where deficits are most pronounced and where parity or small LLM advantages occur in limited strata.\u003c/p\u003e","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-7744596/v1/d8f2713530e473c29d459f7c.png"},{"id":94048210,"identity":"85a1b62d-69e1-4e35-bb34-c8c193432c2b","added_by":"auto","created_at":"2025-10-21 23:21:58","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":2764276,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eChallenging items and the importance of clinical context.\u003c/strong\u003e\u003cbr\u003e\n \u003cstrong\u003eA.\u003c/strong\u003e Failure rate by image modality for “impossible” items (missed by ≥7/10 visual-capable models); bars show percentage with counts (failures/total items per modality). \u003cstrong\u003eB.\u003c/strong\u003eRepresentative sample cerebral angiogram which similar picture on exam was associated with a 90% error rate across models. \u003cstrong\u003eC.\u003c/strong\u003e Aggregate component-importance ranking (|baseline − without component|) highlights the dominant contribution of history (0.193) relative to exam (0.060) and labs (0.059). Overall, hard failures concentrated in neuroradiology and vascular content, while model accuracy was most sensitive to the presence of clinical history. \u003cstrong\u003eD-F.\u003c/strong\u003e Ablation analysis showing the mean accuracy drop when vignette components are removed (history, exam, labs) for each model; history removal produced the largest decrement (mean −19.3%), with smaller effects for exam (−6.0%) and labs (−5.9%); history vs baseline was significant (***p\u0026lt;0.001).\u003c/p\u003e","description":"","filename":"floatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-7744596/v1/1777c7376d2ff590905d2728.png"},{"id":94470797,"identity":"03abdf73-01b4-440e-a154-4c43bc64337c","added_by":"auto","created_at":"2025-10-27 15:33:58","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":6710071,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7744596/v1/ce904131-ea9a-431d-8029-3f3c5578a871.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Performance of Artificial Intelligence and Large Language Models (LLMs) on Neurosurgical Board Examinations Across Text and Visual Modalities","fulltext":[{"header":"Introduction","content":"\u003cp\u003eArtificial intelligence (AI) and large language models (LLMs) are rapidly entering medical education and clinical practice, raising concerns about reliability, reasoning, quality, and hallucination risk that could compromise safe use by clinicians. The landscape is evolving quickly, characterized by frequent releases of both general-purpose and clinical models. \u003csup\u003e1,2,3,4\u003c/sup\u003e Electronic health record and imaging platforms are piloting AI for clinical summarization, risk stratification, and pharmaceutical recommendation, with limited deployments underway and broader rollouts anticipated very soon.\u003csup\u003e\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e,\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e,\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u003c/sup\u003e Despite these advances, peer-reviewed evidence demonstrating fit-for-purpose performance remains limited and heterogeneous.\u003csup\u003e\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e,\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e,\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u003c/sup\u003e In this context, rigorous and transparent evaluation is necessary to clarify where models succeed, where they fail, and why.\u003c/p\u003e\u003cp\u003eNeurosurgery provides a stringent setting for such evaluation. Board examinations require multimodal integration of information, incorporating subjective metrics, such as clinical history, and objective metrics, including physical examination, laboratory data, and medical imaging. Through this process, both declarative knowledge and structured clinical reasoning are assessed.\u003csup\u003e\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e\u003c/sup\u003e Prior studies have shown that LLMs can approach passing thresholds on medical examinations, and can even generate helpful, albeit inconsistent, explanations for textbook-style questions. \u003csup\u003e9,10,11,12,13\u003c/sup\u003e However, these metrics paint an incomplete picture, as most work has been limited to text-only datasets, had narrow model coverage, or have not probed robustness when information is limited or incomplete. These gaps are essential to study and especially critical in neurosurgery, which operates in a highly visual and anatomy-dense domain. This limitation is further emphasized in subspecialties such as neuroradiology and vascular neurosurgery, where model performance on image interpretation and anatomy-driven reasoning remains poorly understood.\u003csup\u003e\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e\u003c/sup\u003e As a result, the impact of information from variable clinical sub-categories and question archetypes on model performance is not well defined.\u003c/p\u003e\u003cp\u003eTo address these gaps, we evaluated 12 contemporary LLMs on a multimodal set of neurosurgical board-style questions spanning 13 subspecialties. We quantified aggregate and subspeciality-specific accuracy, characterized recurrent error patterns linked to cognitive demands in specific domains (e.g., anatomic localization, angiographic interpretation, and recognition of rare entities), and evaluated robustness using prespecified ablation that withheld clinical vignette elements and introduced controlled errors within a single modality. Our primary objective was to provide a head-to-head performance comparison of contemporary LLMs in a clinically relevant context using multimodal neurosurgical items. Secondary analyses mapped error patterns to underlying cognitive domains and described operational reliability to determine where safeguards may be needed. By situating model performance in this context, we aim not to substitute for human expertise but to delineate where LLMs may augment resident education, where they are prone to error, and what advances are needed before any supervised clinical role can be considered.\u003c/p\u003e"},{"header":"Methods","content":"\u003cp\u003e\u003cem\u003eData Source\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003eWe constructed a consolidated benchmark of 472 neurosurgical board\u0026ndash;style questions requiring multimodal clinical reasoning. The primary source was from the Primary Board Examination from the Congress for Neurological Surgeons (CNS) SANS American Board of Neurological Surgery (ABNS). The exam includes the surgical and non-surgical management of central nervous system tumors involving the brain, spinal cord, hypophysis, and skull base; vascular malformations of the intracranial, extracranial, and spinal vasculature; and a wide range of traumatic injuries affecting cranial, spinal, and peripheral nerves. It further evaluates pain syndromes and functional disorders, pathologies of peripheral and autonomic nervous systems, and conditions involving supporting structures of the nervous system, including meninges, skull, and vertebral column. Histopathology relevant to these processes is also represented. Subspecialty classification and imaging requirements were extracted from the exam and confirmed by authors. Subspecialty labels included pediatrics, tumor, spine, vascular, functional, general, trauma, peripheral nerve, and neuroradiology. No human data or clinical data was utilized in the project, and Emory University Institutional Review Board (IRB) determined the study exempt from approval.\u003c/p\u003e\n\u003cp\u003e\u003cem\u003eModels\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003eWe evaluated 22 large language models (LLMs), including 12 text-only and 10 visual language models. Representative model families included OpenAI ChatGPT (GPT-5.0, GPT-4.5, GPT-4.0), Anthropic Claude (Sonnet 3.7, Opus 4.1), Google Gemini (2.5 Pro), Grok, Mistral L2, Llama4 (Maverick, Scout), DeepSeek R1V3, and MedGemma (4B, 27B). Visual-capable models were assessed in both text-only and visual input modes. All models were accessed through their respective application programming interfaces (APIs), temperature set to 0. Max tokens were 512 for text only models and 2048 for multimodal models. We implemented a standardized pipeline that ensured consistency and minimized confounding effects across runs. For each model, a new session was initiated before every question, preventing memory carryover or context leakage between items. Each question was then issued in real time, with models queried simultaneously to control for temporal drift in API performance. Prompt formatting was harmonized across providers, ChatGPT-5.0 AI assistance was utilized for best prompt generation, and the same multiple-choice structure was preserved in all requests:\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u0026ldquo;I am going to give you a question to answer. The question may reference an image or not. Only return one of the answer choices in the exact spelling and nothing else. Do not return any output except for an answer choice. You can only choose 1 answer from those listed.\u003c/p\u003e\n\u003col\u003e\n \u003cli\u003eAnswer choice 1\u003c/li\u003e\n \u003cli\u003eAnswer choice 2\u003c/li\u003e\n \u003cli\u003eAnswer choice 3\u003c/li\u003e\n \u003cli\u003eAnswer choice 4\u003c/li\u003e\n \u003cli\u003eAnswer choice 5\u0026rdquo;\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eWe chose a no-reasoning prompt strategy for several reasons. It eliminates variability in response format across, mirrors actual neurosurgery board exam conditions where only final answers are chosen, ensures 100% consistent answer extraction without interpretation bias, and prevents confounding between medical knowledge and explanation ability. Additionally, we ran a pilot (n=50 questions) comparing no reasoning (94.2% parsing success), chain-of-thought (87.8% parsing success, 15% longer responses), and confidence-based (91.3% parsing success). Despite temperature=0, we observed 8.3% response variability and therefore implemented 3-sample majority voting for results. 91.7% of questions showed perfect consistency across samples. Inconsistent questions flagged for manual review\u003c/p\u003e\n\u003cp\u003eThe primary outcome was model accuracy, defined as exact match between the model-predicted option and the reference key for 5-choice multiple-choice questions. Model responses were automatically parsed, logged, and adjudicated against the reference key using both in-script and manual extraction routines. For visual-capable models, performance on image-containing items was assessed separately. All models were synchronously run September 13, 2025, 21:53.\u003c/p\u003e\n\u003cp\u003e\u003cem\u003eExperimental Framework\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003eFirst, a text-only baseline was established after filtering API errors to ensure consistent coverage across models. Second, the accuracy of visual-capable models was compared on identical items across text and visual input modes, with delta performance calculated and further stratified by imaging modality, including CT, MRI, angiography, and clinical photographs. Third, subspecialty-level accuracy was computed for each model and contextualized against benchmark values from neurosurgery resident cohorts reported in the CNS Exam Portal. Then, an ablation study was conducted to quantify the contribution of clinical history, physical examination, and laboratory findings, defining performance drop as the difference in accuracy between the baseline and each corresponding ablation variant. Finally, \u0026ldquo;impossible questions,\u0026rdquo; were defined as items with at least 7/10 incorrect visual models, and explored their distribution by subspecialty and imaging modality to highlight consistent areas of failure\u003c/p\u003e\n\u003cp\u003e\u003cem\u003eStatistical Analysis\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003eAll analyses were performed in Python 3.11. Accuracy estimates were reported with 95% confidence intervals, and paired t-tests were applied to within-model comparisons (visual vs text). Welch\u0026rsquo;s t-tests were used for independent comparisons (baseline vs ablation variants). Statistical significance was defined as p\u0026lt;0.05, with thresholds of **p\u0026lt;0.01 and ***p\u0026lt;0.001 annotated in figures. We additionally recorded robustness metrics, including parsing failures (missing extracted answer letter) and inference latency (median and 90th percentile)\u003c/p\u003e\n\u003cp\u003eNo questions, stems, or images from the exam are reproduced, displayed, or distributed in this manuscript. All analyses were performed on model outputs relative to reference keys, and only aggregated results are reported.\u003c/p\u003e"},{"header":"Results","content":"\u003ch3\u003e\u003cem\u003eUnified Model Performance\u003c/em\u003e\u003c/h3\u003e\n\u003cp\u003eAcross more than 4,800 text responses and 1,000 visual responses covering 13 neurosurgical subspecialties, performance clustered tightly at the group level. Shown in Figure 1, across all questions, Gemini 2.5 Pro was the clear winner with Grok4 coming in close in the 80s accuracy, GPT models were in the upper 70s range, followed by Claude, mistral, and Medgemma in the mid 70s. Llama models and DeepSeek R1V3 performed the worst.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eSimilarly, in text, Gemini 2.5 Pro, Grok4, and GPT-5 were the most accurate systems, each exceeding 80% accuracy, while GPT-4.0 and GPT-4.5 followed close behind in the upper 70% range (Figure2A). Claude Sonnet 3.7 and Claude Opus 4.1 occupied the mid-70% tier, with Llama-4 Maverick and MedGemma-27B slightly lower in the low 70s. At the low end, DeepSeek R1V3 performed poorly.\u003c/p\u003e\n\u003ch3\u003e\u003cem\u003eVisual vs. Textual Performance\u003c/em\u003e\u003c/h3\u003e\n\u003cp\u003eWhen restricted to image-based questions, performance gaps widened further. Gemini 2.5 Pro again led while Grok-4, GPT-4.0, GPT-5, and Claude Opus 4.1 formed a second cluster in the low-to-mid 70% range. Claude Sonnet 3.7 and GPT-4.5 trailed slightly, and the Llama-4 models showed the weakest visual reasoning, with accuracies barely above 50% (Figure 3A).\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eWhen stratified by modality neuroradiology questions showed the largest visual advantage followed by tumor and pediatrics. Spine showed moderate increased performance. In contrast, trauma, vascular, and pain accuracy fell (Figure 3B).\u003c/p\u003e\n\u003cp\u003eRobustness and efficiency are shown in Figure 3B\u0026ndash;D. Median response latency ranged from 0.221 seconds (Llama-4 Maverick) to 27.236 seconds (DeepSeek R1V3). Parsing failure rates (defined as absence of an extractable answer letter) ranged from 0 in GPT-5, GPT-4.5, and Llama-4 Maverick to 0.133 in Claude Opus 4.1. The scatter plot in Figure 2D and Figure 3D illustrates trade-off between accuracy and latency, with the most accurate models also generally among the slowest.\u003c/p\u003e\n\u003ch3\u003e\u003cem\u003eSubspecialty-Level Performance\u003c/em\u003e\u003c/h3\u003e\n\u003cp\u003eFigure 4A presents a heatmap of accuracy across 13 subspecialties. Neuroradiology, tumor, pediatrics, and spine benefited most from visual information, while trauma, vascular, and pain questions showed performance losses. When benchmarked against resident performance standards (Figure 4B), LLMs remained below human accuracy in every subspecialty, with the largest gaps in vascular neurosurgery and trauma (NS). Additional stratifications are shown in Figure 4C\u003cstrong\u003e\u0026ndash;\u003c/strong\u003eD, where some subspecialties showed modest gains with images while others showed degradation.\u003c/p\u003e\n\u003ch3\u003e\u003cem\u003eClinical Context and Ablation Study\u003c/em\u003e\u003c/h3\u003e\n\u003cp\u003eWe evaluated the contribution of different clinical information types using ablation. Figure 5D-F shows that removal of history produced the largest mean performance drop (19.3%), followed by physical exam (6%) and laboratory findings (5.9%). Figure 5C ranks component importance by absolute contribution with history as the most important and physical exam/labs providing smaller but measurable benefits. Statistical comparisons demonstrated significance for baseline versus history removal (p\u0026lt;0.001), while exam and lab ablations showed weaker effects.\u003c/p\u003e\n\u003cp\u003e\u003cem\u003eImpossible Questions\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003eWe identified 19 \u0026ldquo;impossible\u0026rdquo; questions representing 4.0% of the dataset. Category dominance was evident: 52.6% (10/19) were neuroradiology items which required radiographic interpretation; 21.1% (4/19) were vascular questions involving complex anatomy; and 10.5% (2/19) were pediatric questions on rare congenital conditions (Figure 5A).\u003c/p\u003e\n\u003cp\u003eAnalysis by image modality revealed that 26.3% (5/19) of failures were CT-based questions emphasizing cross-sectional anatomy, 21.1% (4/19) were angiography requiring vascular interpretation, 21.1% (4/19) were MRI cases focusing on soft tissue pathology, and 15.8% (3/19) were real photographs, including intraoperative and autopsy specimens. Further inspection highlighted consistent question types that drive failure. Complex anatomy dominated, including vascular structures 6/19, cranial nerve identification 1/19, and venous system localization 2/19. Radiology interpretation was another major source of error, spanning CT and MRI pathology 8/19, plain radiographs 2/19, and histology slides 1/19. Rare pediatric conditions also featured prominently, including craniosynostosis syndromes and Chiari malformations. Figure 5B illustrates a representative cerebral angiogram with a 90% error rate across all models.\u0026nbsp;\u003c/p\u003e"},{"header":"Discussion","content":"\u003cp\u003eIn this benchmark of 12 contemporary LLMs on neurosurgical board-style questions, the leading models surpassed 80% accuracy on our dataset, approaching, but not matching, resident performance across subspecialities (neuropathology, vascular, pain, neuroradiology, etc.) Aggregate accuracy for text-only models centered near 68% and was comparable to multimodal performance with image-containing items (~\u0026thinsp;69%). Visual input yielded heterogeneous effects by subspecialty domain and modality, and ablation experiments highlighted a strong dependence on clinical history for accurate answer prediction. A small but consequential subset of \u0026ldquo;impossible\u0026rdquo; items exposed persistent blind spots, most often in neuroradiology and vascular anatomy.\u003csup\u003e\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e\u003c/sup\u003e\u003c/p\u003e\u003cdiv id=\"Sec14\" class=\"Section2\"\u003e\u003ch2\u003eComparison with Resident Performance\u003c/h2\u003e\u003cp\u003eWhen benchmarked against CNS resident standards, models performed inferiorly to human performance across every subspecialty. Differences in performance were smallest in tumor and spine subspeciality domains, which are areas where pattern recognition and guideline-concordant decision pathways are more structured. Vascular and trauma subspeciality domains which demand precise spatial reasoning, error-free interpretation of time-critical imaging, and integration of nuanced physiologic context, had the largest performance gaps. These findings suggest that contemporary LLMs can perform well on \u0026ldquo;structured\u0026rdquo; pathologies but remain below resident-level clinical reasoning on items where anatomy is complex and decisions hinge on subtle image features or evolving physiology.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec15\" class=\"Section2\"\u003e\u003ch2\u003eInterpreting Subspecialty and Modality Variation\u003c/h2\u003e\u003cp\u003eVisual input improved performance in neuroradiology, pediatrics, tumor, and spine, but attenuated performance in trauma, vascular in pain. This finding likely reflects the differential utilization of pattern recognition (e.g., classic radiographic features) versus spatial/temporal integration (e.g., vessel-territory mapping, injury kinetics) of tasks across these domains resulting in a bimodal difficulty distribution.\u003csup\u003e\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e\u003c/sup\u003e Many common radiographic patterns are learnable from large public corpora, while a long tail of anatomy-heavy or rare-pathology items requires expertise in modality-specific conventions (e.g., angiographic views, venous variants, cranial nerve segmental anatomy) that are underrepresented in general web-scale training data.\u003csup\u003e\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e,\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e,\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e,\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e\u003c/sup\u003e Additionally, the underperformance on vascular and trauma items with images suggests that model visual encoders lack robustness to the domain-specific invariances used by clinicians, such as rotating mental 3D anatomy, reconciling multi-sequence MRI, or interpreting catheter-based angiography under variable projections.\u003csup\u003e\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e,\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e\u003c/sup\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec16\" class=\"Section2\"\u003e\u003ch2\u003eHeterogeneous Impact of Imaging on Model Performance\u003c/h2\u003e\u003cp\u003eAs highlighted by the bimodality in performance with visual input, most models did not exhibit a uniform advantage with images; several performed worse on image-containing items. Likely contributors include medical domain shift in visual encoders trained predominantly on natural images, rather than CT/MRI/DSA distributions, limited alignment between textual rationales and pixel-level features during pretraining, and sparse exposure to modality-specific formats, such as angiography, CT bone windows, or intraoperative photographs.\u003csup\u003e\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e,\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e,\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e\u003c/sup\u003e Notably, one model (Gemini 2.5 Pro) demonstrated consistent performance gains with visual input, suggesting that better vision-language alignment and medical-domain pretraining can yield greater accuracy, but, at this time, these benefits limited across contemporary models.\u003csup\u003e\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e\u003c/sup\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec17\" class=\"Section2\"\u003e\u003ch2\u003eSignificance of Clinical History\u003c/h2\u003e\u003cp\u003eAblation experiments reveled a significant reduction in accuracy when clinical history was withheld (-19.3%), with a smaller impact on accuracy when physical examination (-6.0%) and laboratory results (-5.9%) were withheld (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eC-E). This finding highlights how LLMs weigh narrative context, with long-form textual history providing key priors that narrow the diagnostic and management space.\u003csup\u003e\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e,\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e,\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e,\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e\u003c/sup\u003e Our analysis also suggests that brief physical exam findings or lab results may be underweighted without deliberate prompting or structured schemas. Practically, this means prompts that preserve important differential-shaping elements of the history are critical for reliability.\u003csup\u003e\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e\u003c/sup\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec18\" class=\"Section2\"\u003e\u003ch2\u003eError Modes and \u0026ldquo;impossible\u0026rdquo; questions\u003c/h2\u003e\u003cp\u003eWe identified a small set of \u0026ldquo;impossible\u0026rdquo; items, which included 19 questions (~\u0026thinsp;4%) missed by \u0026ge;\u0026thinsp;7/10 visual-capable models, clustered in neuroradiology (~\u0026thinsp;53%), vascular (~\u0026thinsp;21%), and pediatrics (~\u0026thinsp;11%) (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eA). These items typically required detailed anatomic localization, interpretation of complex vascular studies, or recognition of rare pediatric conditions. When stratified by modality, the highest rate of failures spanned items with CT (~\u0026thinsp;26%), angiography (~\u0026thinsp;21%), MRI (~\u0026thinsp;21%), and clinical photographs (~\u0026thinsp;16%). These errors reflected limitations beyond simple mislabeling including, improper localization within vascular trees, confusion among adjacent cranial nerves or venous sinuses, and reliance on superficial image cues without integrating clinical priors. From a safety standpoint, this analysis highlights the scenarios in which unvetted model output could be most harmful without expert oversight.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec19\" class=\"Section2\"\u003e\u003ch2\u003eRobustness, Reliability, and Practical Use\u003c/h2\u003e\u003cp\u003eOperational metrics are vital prior to deployment in any educational or clinical context. We observed non-trivial parsing failures in several models and substantial latency variability. These issues translate into user friction and intermittent non-answers that would degrade an educational or clinical workflow. Together with the ablation sensitivity to missing history, our results argue for the following safeguards in deployment for a reliable educational workflow: strict output schemas, abstention when uncertainty is high, and prompt templates that explicitly include proper context (e.g., clinical history) to generate strong priors to narrow differentials.\u003csup\u003e\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e,\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e,\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e,\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e\u003c/sup\u003e For resident learning purposes, models may function as rapid feedback engines for multiple-choice practice, especially in tumor, spine, and common neuroradiology patterns.\u003csup\u003e24,\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e25\u003c/span\u003e\u003c/sup\u003e For high-stakes preparation or clinical decision support, reliance should remain supervised and selective, with particular caution in vascular and trauma content.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec20\" class=\"Section2\"\u003e\u003ch2\u003eLimitations\u003c/h2\u003e\u003cp\u003eOur analysis is grounded in a single large exam corpus of multiple-choice questions and does not simulate oral boards where longitudinal case discussion, evolving data, and justification of management are central. Image presentation through APIs may not fully match diagnostic workstation quality, and we did not evaluate sequential imaging or multi-view angiographic series. Vendor APIs evolve rapidly; although we synchronized queries to reduce temporal drift, results reflect model snapshots in time. \u0026ldquo;Resident\u0026rdquo; performance score was quantified using the CNS average mean scores for each subsection. Finally, we evaluated output accuracy rather than explanation fidelity, as models can sometimes arrive at the correct answer for the wrong reasons.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec21\" class=\"Section2\"\u003e\u003ch2\u003eFuture directions\u003c/h2\u003e\u003cp\u003eFuture studies should test models on oral board\u0026ndash;style cases and evolving clinical scenarios, not just multiple-choice questions. Work is also needed to improve vision-language alignment with CT, MRI, and angiography to close gaps in vascular and trauma performance. Finally, models must be evaluated for explanation quality and built-in safeguards to ensure safe use in neurosurgical training and supervised care.\u003c/p\u003e\u003c/div\u003e"},{"header":"Conclusion","content":"\u003cp\u003eCurrent LLMs can perform competitively on many neurosurgical board\u0026ndash;style questions and offer tangible value as study aids, particularly in structured domains. However, they remain below resident-level performance, are sensitive to history omission, and show inconsistent benefit from images, with pronounced deficits in neuroradiology\u0026rsquo;s hardest cases and in vascular and trauma content. Their variability in multimodal performance, dependence on complete history, and concentrated failures in high-stakes neurovascular domains preclude independent clinical decision support. Until visual encoders and medical alignment improve, deployment should remain supervised, with clear expectations about where models help, where they may mislead, and how to design prompts and guardrails that prioritize safety.\u003c/p\u003e"},{"header":"Declarations","content":"\u003ch2\u003eConflict of Interest\u003c/h2\u003e\n\u003cp\u003eThe authors have no conflict of interest\u003c/p\u003e\n\u003ch2\u003eConsent to Participate:\u003c/h2\u003e\n\u003cp\u003eThis manuscript does not involve human subjects. Consent to participate is not required. Human Ethics and Consent to Participate declarations: not applicable\u003c/p\u003e\n\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\n\u003cp\u003eR.K,H.D, and J.J.S Synthesized literature review, Collected data, wrote manuscript, performed analysis, prepared figures, and guided revisionsR.F Collected data and guided revisionsS.G guided revisionsJ.A.G and J.W.G wrote manuscript, guided revisions, and guided project direciton\u003c/p\u003e\n\u003ch2\u003eAcknowledgement\u003c/h2\u003e\n\u003cp\u003eWe thank the Congress of Neurological Surgeons (CNS) for developing and maintaining the Self-Assessment Neurosurgery (SANS) Indications Exam, which served as the foundation for this benchmarking study. No Material from the exams is distributed in this manuscript.\u003c/p\u003e\n\u003ch2\u003eData Availability\u003c/h2\u003e\n\u003cp\u003eThis study did not generate or analyze new patient, genomic, or experimental data. The neurosurgical board\u0026ndash;style questions used in this benchmarking analysis were derived from the Congress of Neurological Surgeons (CNS) Self-Assessment in Neurological Surgery (SANS) Indications Exam and the American Board of Neurological Surgery (ABNS) primary examination framework. These materials are proprietary and not publicly available. All results are provided in aggregated form within the manuscript and supplementary materials. No raw exam questions, stems, or images are distributed.\u003c/p\u003e\n\u003ch2\u003eFinancial Disclosure\u003c/h2\u003e\n\u003cp\u003eThe authors have no financial disclosures\u003c/p\u003e\n\u003ch2\u003eClinical Trial:\u0026nbsp;\u003c/h2\u003e\n\u003cp\u003eClinical trial number: not applicable.\u0026rsquo;\u003c/p\u003e\n\u003cp\u003e\u0026nbsp;\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eArtsi Y, Sorin V, Glicksberg BS, et al. Challenges of Implementing LLMs in Clinical Practice: Perspectives. \u003cem\u003eJ Clin Med\u003c/em\u003e. 2025;14(17):6169. Published 2025 Sep 1. doi:10.3390/jcm14176169\u003c/li\u003e\n\u003cli\u003eRoustan D, Bastardot F. The Clinicians\u0026apos; Guide to Large Language Models: A General Perspective With a Focus on Hallucinations. \u003cem\u003eInteract J Med Res\u003c/em\u003e. 2025;14:e59823. Published 2025 Jan 28. doi:10.2196/59823\u003c/li\u003e\n\u003cli\u003eSu H, Sun Y, Li R, et al. Large Language Models in Medical Diagnostics: Scoping Review With Bibliometric Analysis. \u003cem\u003eJ Med Internet Res\u003c/em\u003e. 2025;27:e72062. Published 2025 Jun 9. doi:10.2196/72062\u003c/li\u003e\n\u003cli\u003eHan T, Nebelung S, Khader F, et al. Medical large language models are susceptible to targeted misinformation attacks. \u003cem\u003eNPJ Digit Med\u003c/em\u003e. 2024;7(1):288. Published 2024 Oct 23. doi:10.1038/s41746-024-01282-7\u003c/li\u003e\n\u003cli\u003eRohren E, Ahmadzade M, Colella S, et al. Post-deployment Monitoring of AI Performance in Intracranial Hemorrhage Detection by ChatGPT. \u003cem\u003eAcad Radiol\u003c/em\u003e. Published online August 11, 2025. doi:10.1016/j.acra.2025.07.055\u003c/li\u003e\n\u003cli\u003eMac Donald CL, Yuh EL, Vande Vyvere T, et al. Neuroimaging Characterization of Acute Traumatic Brain Injury with Focus on Frontline Clinicians: Recommendations from the 2024 National Institute of Neurological Disorders and Stroke Traumatic Brain Injury Classification and Nomenclature Initiative Imaging Working Group. \u003cem\u003eJ Neurotrauma\u003c/em\u003e. 2025;42(13-14):1056-1064. doi:10.1089/neu.2025.0079\u003c/li\u003e\n\u003cli\u003eKamel Rahimi A, Pienaar O, Ghadimi M, et al. Implementing AI in Hospitals to Achieve a Learning Health System: Systematic Review of Current Enablers and Barriers. \u003cem\u003eJ Med Internet Res\u003c/em\u003e. 2024;26:e49655. Published 2024 Aug 2. doi:10.2196/49655\u003c/li\u003e\n\u003cli\u003eLin JJ, Klopfenstein J, Maldonado A, McCall T, Tsung A, Dinh DH. We Tabulated and Organized American Board of Neurological Surgeons Primary Exam Keywords (2015-2023) so You Don\u0026apos;t Have to. \u003cem\u003eCureus\u003c/em\u003e. 2023;15(5):e39402. Published 2023 May 23. doi:10.7759/cureus.39402\u003c/li\u003e\n\u003cli\u003eAli R, Tang OY, Connolly ID, et al. Performance of ChatGPT and GPT-4 on Neurosurgery Written Board Examinations. \u003cem\u003eNeurosurgery\u003c/em\u003e. 2023;93(6):1353-1365. doi:10.1227/neu.0000000000002632\u003c/li\u003e\n\u003cli\u003eHopkins BS, Nguyen VN, Dallas J, et al. ChatGPT versus the neurosurgical written boards: a comparative analysis of artificial intelligence/machine learning performance on neurosurgical board-style questions. \u003cem\u003eJ Neurosurg\u003c/em\u003e. 2023;139(3):904-911. Published 2023 Mar 24. doi:10.3171/2023.2.JNS23419\u003c/li\u003e\n\u003cli\u003eStengel FC, Stienen MN, Ivanov M, et al. Can AI pass the written European Board Examination in Neurological Surgery? - Ethical and practical issues. \u003cem\u003eBrain Spine\u003c/em\u003e. 2024;4:102765. Published 2024 Feb 13. doi:10.1016/j.bas.2024.102765\u003c/li\u003e\n\u003cli\u003eMcNulty AM, Valluri H, Gajjar AA, Custozzo A, Field NC, Paul AR. Performance evaluation of ChatGPT-4.0 and Gemini on image-based neurosurgery board practice questions: A comparative analysis. \u003cem\u003eJ Clin Neurosci\u003c/em\u003e. 2025;134:111097. doi:10.1016/j.jocn.2025.111097\u003c/li\u003e\n\u003cli\u003eGuerra GA, Hofmann H, Sobhani S, et al. GPT-4 Artificial Intelligence Model Outperforms ChatGPT, Medical Students, and Neurosurgery Residents on Neurosurgery Written Board-Like Questions. \u003cem\u003eWorld Neurosurg\u003c/em\u003e. 2023;179:e160-e165. doi:10.1016/j.wneu.2023.08.042\u003c/li\u003e\n\u003cli\u003eSzmyd B, Podstawka M, Wiśniewski K, et al. AI-Driven Innovations in Neuroradiology and Neurosurgery: Scoping Review of Current Evidence and Future Directions. \u003cem\u003eCancers (Basel)\u003c/em\u003e. 2025;17(16):2625. Published 2025 Aug 11. doi:10.3390/cancers17162625\u003c/li\u003e\n\u003cli\u003eKumar, V. \u003cem\u003eet al.\u003c/em\u003e (2024). Large-Language-Models (LLM)-Based AI Chatbots: Architecture, In-Depth Analysis and Their Performance Evaluation. In: Santosh, K., \u003cem\u003eet al.\u003c/em\u003e Recent Trends in Image Processing and Pattern Recognition. RTIP2R 2023. Communications in Computer and Information Science, vol 2027. Springer, Cham. https://doi.org/10.1007/978-3-031-53085-2_20\u003c/li\u003e\n\u003cli\u003eZhengbao Jiang, Frank F. Xu, Jun Araki, Graham Neubig; How Can We Know What Language Models Know?. \u003cem\u003eTransactions of the Association for Computational Linguistics\u003c/em\u003e 2020; 8 423\u0026ndash;438. doi: https://doi.org/10.1162/tacl_a_00324\u003c/li\u003e\n\u003cli\u003eZhao, Wayne \u0026amp; Zhou, Kun \u0026amp; Junyi, Li \u0026amp; Tianyi, Tang \u0026amp; Wang, Xiaolei \u0026amp; Hou, Yupeng \u0026amp; Min, Yingqian \u0026amp; Zhang, Beichen \u0026amp; Zhang, Junjie \u0026amp; Dong, Zican \u0026amp; Du, Yifan \u0026amp; Yang, Chen \u0026amp; Chen, Yushuo \u0026amp; Chen, Zhipeng \u0026amp; Jiang, Jinhao \u0026amp; Ren, Ruiyang \u0026amp; Li, Yifan \u0026amp; Tang, Xinyu \u0026amp; Liu, Zikang \u0026amp; Wen, Ji-Rong. (2023). A Survey of Large Language Models. 10.48550/arXiv.2303.18223.\u003c/li\u003e\n\u003cli\u003eThirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. \u003cem\u003eNat Med\u003c/em\u003e. 2023;29(8):1930-1940. doi:10.1038/s41591-023-02448-8\u003c/li\u003e\n\u003cli\u003eShah NH, Entwistle D, Pfeffer MA. Creation and Adoption of Large Language Models in Medicine. \u003cem\u003eJAMA\u003c/em\u003e. 2023;330(9):866-869. doi:10.1001/jama.2023.14217\u003c/li\u003e\n\u003cli\u003eShukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, Enhong Chen, A survey on multimodal large language models, \u003cem\u003eNational Science Review\u003c/em\u003e, Volume 11, Issue 12, December 2024, nwae403, https://doi.org/10.1093/nsr/nwae403\u003c/li\u003e\n\u003cli\u003eKrupinski EA. The role of perception in imaging: past and future. \u003cem\u003eSemin Nucl Med\u003c/em\u003e. 2011;41(6):392-400. doi:10.1053/j.semnuclmed.2011.05.002\u003c/li\u003e\n\u003cli\u003eS. Khan, M. R. Biswas, A. Murad, H. Ali and Z. Shah, \u0026quot;An Early Investigation into the Utility of Multimodal Large Language Models in Medical Imaging,\u0026quot; 2024 IEEE International Conference on Information Reuse and Integration for Data Science (IRI), San Jose, CA, USA, 2024, pp. 234-239, doi: 10.1109/IRI62200.2024.00056. \u003c/li\u003e\n\u003cli\u003eKulkarni, Nilesh \u0026amp; Tupsakhare, Preeti. (2024). Crafting Effective Prompts: Enhancing AI Performance through Structured Input Design. JOURNAL OF RECENT TRENDS IN COMPUTER SCIENCE AND ENGINEERING. 12. 1-10. 10.70589/JRTCSE.2024.5.1.\u003c/li\u003e\n\u003cli\u003eSafranek CW, Sidamon-Eristoff AE, Gilson A, Chartash D.The Role of Large Language Models in Medical Education: Applications and Implications. JMIR Med Educ 2023;9:e50945\u003cbr\u003edoi: 10.2196/50945\u003c/li\u003e\n\u003cli\u003ePresbitero P, Gasparini GL, Pagnotta P. Images in cardiovascular medicine. Intra-arterial thrombolysis for left middle cerebral artery embolic stroke during coronary angiography. \u003cem\u003eCirculation\u003c/em\u003e. 2006;113(5):e64-e66. doi:10.1161/CIRCULATIONAHA.105.552802\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-7744596/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7744596/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003e\u003cstrong\u003eBackground:\u003c/strong\u003e Large language models (LLMs) are increasingly applied in clinical use and medical education, yet their reliability across both textual and visual modalities in highly specialized domains such as neurosurgery remains unclear. Prior evaluations have been limited to text-only questions or narrow subsets of models, leaving the role of multimodal reasoning poorly defined.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eMethods:\u003c/strong\u003e We benchmarked the latest LLMs, including 12 text-only and 10 multimodal systems, on 476 board-style neurosurgical questions spanning 13 subspecialties. Models were queried synchronously, temperature=0 under standardized prompting with no reasoning allowed mirroring examination conditions. Accuracy was compared across text and visual modalities, stratified by subspecialty and imaging type. Robustness was assessed using latency, parsing failures, and ablation experiments withholding clinical vignette components.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eResults:\u003c/strong\u003e Text only and multimodal models achieved nearly identical mean accuracies of 67.9% and 68.5%, indicating that visual inputs did not provide a consistent overall benefit. Performance differed markedly across individual models. Gemini 2.5 Pro, Grok, and GPT 5 exceeded 80% accuracy, approaching resident performance. GPT 4.0 and GPT 4.5 followed in the high 70s. Claude Sonnet 3.7 and Claude Opus 4.1 performed in the mid 70s, while MedGemma and Llama 4 clustered in the low 70s. DeepSeek R1V3 performed close to chance. On image-based questions Gemini 2.5 Pro again led, while Grok, GPT 4.0, GPT 5, and Claude Opus clustered near 70% and Llama 4 models dropped to approximately 50%.\u003c/p\u003e\n\u003cp\u003eSubspecialty analysis showed that visual input improved performance in neuroradiology, tumor, pediatrics, and spine. Trauma, vascular, and pain questions became less accurate with images, producing a bimodal pattern of benefit. Ablation experiments showed that removal of history produced the largest decline in accuracy (19.3% reduction), while withholding physical exam or lab data produced smaller effects (6.0% and 5.9%). A set of questions that no model could answer correctly accounted for 4% of the dataset. These questions were clustered in neuroradiology, vascular anatomy, and rare pediatric condition.\u003c/p\u003e\n\u003cp\u003eOperational findings highlighted practical issues. The most accurate models were often slower to respond. Latency ranged from 0.22 seconds to more than 27 seconds. Parsing failures were uncommon in GPT 5, GPT 4.5, and Llama 4 but exceeded 13% in Claude Opus.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConclusions:\u003c/strong\u003e Current LLMs can approach resident-level performance in structured neurosurgical domains and demonstrate selective benefits from visual input, but remain unreliable in anatomy-heavy, high-stakes contexts such as vascular and trauma. Their dependence on clinical history and susceptibility to systematic visual errors highlight the need for improved vision–language alignment before unsupervised clinical use. Until then, their role is best suited to supervised educational support with explicit safeguards.\u003c/p\u003e","manuscriptTitle":"Performance of Artificial Intelligence and Large Language Models (LLMs) on Neurosurgical Board Examinations Across Text and Visual Modalities","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-10-21 23:13:52","doi":"10.21203/rs.3.rs-7744596/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"a6462866-3bad-453a-a285-ce38aac5e2db","owner":[],"postedDate":"October 21st, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":56339143,"name":"Health sciences/Health care"},{"id":56339144,"name":"Health sciences/Medical research"},{"id":56339145,"name":"Health sciences/Neurology"},{"id":56339146,"name":"Biological sciences/Neuroscience"}],"tags":[],"updatedAt":"2025-10-27T14:01:03+00:00","versionOfRecord":[],"versionCreatedAt":"2025-10-21 23:13:52","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-7744596","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7744596","identity":"rs-7744596","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.