Multimodal Large Language Models Challenge NEJM Image Challenge

doi:10.21203/rs.3.rs-8028355/v1

Multimodal Large Language Models Challenge NEJM Image Challenge

2025 · doi:10.21203/rs.3.rs-8028355/v1

preprint OA: closed CC-BY-4.0

📄 Open PDF Full text JSON View at publisher

Full text 142,922 characters · extracted from preprint-html · click to expand

Multimodal Large Language Models Challenge NEJM Image Challenge | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Multimodal Large Language Models Challenge NEJM Image Challenge Chiyu Sheng, Shumin Shen, Lin Wang, Jie Chen, Wei Chen, Nianfei Wang, and 1 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8028355/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 10 Feb, 2026 Read the published version in Scientific Reports → Version 1 posted 10 You are reading this latest preprint version Abstract Background Theoretically, multimodal large language models better reflect real-world clinical scenarios in disease diagnosis compared to text-only large language models. The New England Journal of Medicine Image Challenge contains real clinical cases with images and textual materials, making it the optimal resource for testing multimodal LLM diagnostic accuracy. Methods We analyzed 272 Image Challenge cases (June 2009 to March 2025) containing both images and clinical text. Three LLMs—GPT-4o, Claude 3.7, and Doubao—were evaluated against responses from 16,401,888 physicians worldwide (mean, 60,301 per case). Models were tested with images alone and with combined image-text inputs. The primary outcome was diagnostic accuracy in the multimodal condition. Results All LLMs significantly outperformed physicians (P < 0.001). Diagnostic accuracy in multimodal testing was 89.0% (95% CI, 84.9 to 92.3) with Claude 3.7, 88.6% (95% CI, 84.5 to 92.0) with GPT-4o, and 71.0% (95% CI, 65.3 to 76.2) with Doubao, compared with 46.7% (95% CI, 40.7 to 52.7) for physician majority vote—an absolute difference exceeding 40 percentage points for top-performing models. In diagnostically challenging cases where fewer than 40% of physicians were correct, Claude 3.7 maintained 86.5% accuracy versus 33.4% for physicians. Despite high accuracy, model-physician concordance was low (Cohen's κ, 0.08 to 0.24), with a 15.4:1 ratio of model-advantage to physician-advantage cases for Claude 3.7. Adding clinical text to images improved accuracy by 28 to 42 percentage points across models. At least one model was correct in 96.3% of cases. Conclusions Multimodal testing achieved significantly higher diagnostic accuracy than image-only evaluation and substantially exceeded physician diagnostic performance. High AI accuracy coupled with low physician-AI concordance indicates that multimodal large language models utilize fundamentally different diagnostic reasoning processes. These findings suggest multimodal LLMs may function as valuable diagnostic assistants, augmenting rather than replacing physician clinical decision-making. Biological sciences/Computational biology and bioinformatics Health sciences/Diseases Health sciences/Health care Physical sciences/Mathematics and computing Health sciences/Medical research Artificial Intelligence Multimodal Large Language Models Diagnostic Accuracy Rare Disease Medical Imaging Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Figure 8 Figure 9 Figure 10 Figure 11 Introduction Accurate diagnosis is fundamental to effective medical treatment, yet diagnostic errors affect millions of patients annually. A meta-analysis of 22 studies involving 80,026 hospitalized patients found a harmful diagnostic error rate of 0.7% (95% CI, 0.5%-1.1%), translating to approximately 249,900 harmful diagnostic errors annually in the United States 1 . The burden is substantially higher in outpatient settings, where diagnostic errors affect 5.08% of adults—approximately 12 million Americans each year—with at least 1 in 20 adults experiencing a diagnostic error and half of these errors potentially causing harm 2 . This challenge is magnified for rare diseases, where patients endure a median diagnostic delay of 4.7 years and 40% receive multiple incorrect diagnoses before the correct one is identified 3 . For the estimated 300 million individuals worldwide affected by rare diseases, the scarcity of specialized expertise and the tendency of rare conditions to mimic common diseases create particularly formidable diagnostic obstacles 4 . Recent advances in large language models (LLMs) have demonstrated remarkable performance on standardized medical examinations, suggesting potential diagnostic capabilities 5 – 7 . GPT-4 exceeded the United States Medical Licensing Examination passing threshold by more than 20 points, achieving 86.5% accuracy on complex clinical questions 8 . Med-PaLM 2 reached 86.5% accuracy on medical question-answering datasets, with physician evaluators preferring its responses over human physicians' on eight of nine clinical utility metrics 9 . These achievements indicate that current LLMs possess substantial medical knowledge that could theoretically be applied to diagnostic challenges. However, translating examination performance into clinical utility reveals critical limitations. A 2024 systematic review found that while AI models achieved diagnostic accuracy comparable to non-expert physicians (52.1%), they remained significantly inferior to specialists by 15.8% (P = 0.007) 10 . The recent development of multimodal LLMs capable of processing both images and text simultaneously now enables direct comparison of AI and physician diagnostic performance in a format that mirrors clinical practice 11 – 14 . In a clinical evaluation of 150 dermatological cases, SkinGPT-4, a multimodal diagnostic system, achieved 80.63% accuracy validated by board-certified dermatologists 15 . Han et al. demonstrated that GPT-4V achieved superior diagnostic accuracy compared with unimodal predecessors (GPT-4, GPT-3.5) and contemporary models (Gemini Pro, Llama 2, Med42) across both JAMA Clinical Challenge and NEJM Image Challenge datasets, establishing that multimodal capabilities enable medical image interpretation without specialized fine-tuning 16 . However, GPT-4V demonstrated poor diagnostic performance in radiological contexts, achieving accuracy rates of only 8% without clinical context and 29% with contextualization when requiring the most likely diagnosis 17 . Fenglin Liu et al. developed Med-MLLM, a medical multimodal large language model evaluated across five COVID-19 datasets, demonstrating superior performance in COVID-19 reporting, diagnosis, and prognosis tasks even with minimal labeled data (1%) 18 . In COVID-19 diagnostic image-text classification tasks, the model achieved 90.3% diagnostic accuracy (AUC) when trained with complete datasets, indicating high efficiency and accuracy for rare disease diagnosis. Given these limitations in image-only interpretation, we sought to evaluate the diagnostic performance of three state-of-the-art multimodal LLMs—GPT-4o, Claude 3.7, and Doubao—using both image-alone and image-plus-text modalities across NEJM Image Challenge cases, comparing their accuracy against global physician performance to determine the clinical potential of AI-assisted rare disease diagnosis. Methods 2.1 Data Sources We analyzed 272 consecutive cases from the NEJM Image Challenge published between June 27, 2009, and March 27, 2025. Cases were included if they contained both clinical images and text descriptions. Cases with images only or those flagged for content violations during testing were excluded. 2.2 Model Selection and Testing Three publicly available multimodal large language models were evaluated: GPT-4o (OpenAI), Claude 3.7 (Anthropic), and Doubao (ByteDance). Each model was tested through its official web interface using standardized prompts instructing selection of the correct answer from five options with supporting rationale. Models underwent two-phase testing for each case: image-only followed by multimodal (image plus text). We recorded the selected answer, diagnostic choice, and complete model response. No fine-tuning or repeat querying was performed. For physician-model comparisons, we used multimodal LLM results, as NEJM respondents had access to both images and text. 2.3 Physician Benchmark Physician performance data were obtained from NEJM's published results, comprising 16,401,888 responses (mean, 60,301 physicians per case; range, 12,066–185,210). Physician accuracy was defined as the proportion selecting the correct diagnosis. 2.4 Statistical Analysis The primary outcome was diagnostic accuracy. We compared model and physician performance using McNemar's test and assessed concordance with Cohen's kappa. Secondary analyses included performance stratification by physician consensus (< 40%, 40–69%, ≥ 70%), disease category, imaging modality, age group ( 12 years), and sex. Model sensitivity was calculated for cases where physician accuracy was < 50% or < 33%. Ensemble performance was evaluated through majority vote. Subgroups with fewer than 5 cases were excluded. Sex equity was defined as accuracy differences < 5 percentage points. Statistical analyses were performed with R version 4.3.0. Two-sided P values < 0.05 were considered significant. Confidence intervals were calculated using the Wilson method. 2.5 Ethics This study used publicly available NEJM cases without patient identifiers. Institutional review board approval was not required. Results 3.1 Patient and Case Characteristics The study comprised 272 diagnostic cases from the NEJM Image Challenge. The cohort included 159 male patients (58.5%) and 113 female patients (41.5%). Age distribution spanned from infancy to advanced age: 155 patients (56.9%) were aged 13–60 years, 81 (29.8%) were older than 60 years, and 36 (13.3%) were younger than 13 years. Infectious diseases accounted for 70 cases (25.7%), immune-mediated diseases for 47 (17.3%), and neoplastic diseases for 38 (14.0%). The remaining cases were distributed among genetic disorders, vascular diseases, metabolic conditions, trauma-related pathology, drug-induced diseases, and degenerative disorders. Physical examination findings constituted 142 images (52.2%), radiologic studies 65 (23.9%), and the remainder included combination images, pathologic specimens, endoscopic findings, and electrocardiographic tracings. Physician participation averaged 60,301 per case (range, 12,066 to 185,210), totaling 16,401,888 individual responses. Mean physician diagnostic accuracy was 50.1% (SD, 11.8%; range, 26% to 88%). This variation in physician performance across cases provided a robust benchmark for evaluating model performance across different levels of diagnostic complexity. Table 1 Characteristics of Patients, Cases, and Physician Performance. Patient and Case Characteristics (N = 272) Demographic Characteristics Sex — no. (%) Female 113 (41.5) Male 159 (58.5) Age Distribution — no. (%) 60 yr 81 (29.8) Clinical Characteristics Disease Classification — no. (%) Infectious diseases 70 (25.7) Immune-mediated diseases 47 (17.3) Neoplastic diseases 38 (14) Genetic/congenital diseases 23 (8.5) Vascular diseases 23 (8.5) Metabolic/nutritional diseases 21 (7.7) Traumatic/physical diseases 19 (7) Drug/toxin-related diseases 18 (6.6) Degenerative/functional diseases 12 (4.4) Ectopic diseases 1 (0.4) Image Type — no. (%) Physical signs 142 (52.2) Radiological 65 (23.9) Combination 29 (10.7) Pathological 16 (5.9) Other 9 (3.3) Endoscopic 8 (2.9) Electrocardiographic 3 (1.1) Physician Assessment Performance Mean accuracy ± SD (%) 50.1 ± 11.8 Accuracy range (%) 26–88 Physician participants per case, mean 60,301 Physician participants range 12,066–185,210 3.2 Diagnostic Accuracy of LLMs versus Physicians 3.2.1Diagnostic Performance in Multimodal Testing In the multimodal evaluation of 272 clinical cases, all three large language models significantly outperformed physicians (Fig. 1 ). Claude 3.7 and GPT-4o achieved comparable diagnostic accuracy rates of approximately 90%. The absolute difference in accuracy between these models and physician majority vote exceeded 40 percentage points (P < 0.001 for all comparisons). Doubao, though less accurate than Claude 3.7 and GPT-4o, also significantly outperformed the physician benchmark (P < 0.001). Figure 1 . Diagnostic Accuracy of Large Language Models versus Physicians in Multimodal Testing. Bar graph shows the diagnostic accuracy of three large language models (Claude 3.7, GPT-4o, and Doubao) compared with physician majority vote for 272 multimodal clinical cases from the NEJM Image Challenge. Error bars indicate 95% confidence intervals calculated with the Wilson method. The dashed horizontal line represents chance performance (50%). All models significantly outperformed physicians (P < 0.001 for all comparisons, McNemar's test). *** P < 0.001. 3.2.2 Performance Stratified by Case Difficulty Large language model performance remained superior across all levels of diagnostic difficulty, as stratified by physician consensus (Fig. 2 ). All models achieved high accuracy in cases with strong physician agreement (≥ 70% consensus). In cases with low physician consensus (< 40% correct), where diagnostic uncertainty was greatest, Claude 3.7 maintained 86.5% accuracy, compared with mean physician accuracy of 33.4%. Line graph displays diagnostic accuracy across three levels of physician consensus. Low consensus indicates cases where fewer than 40% of physicians selected the correct diagnosis (n = 52); moderate consensus, 40% to 69% correct (n = 201); and high consensus, 70% or more correct (n = 19). Background shading corresponds to consensus levels (red, low; yellow, moderate; green, high). All language models maintained high accuracy (> 78%) even in low-consensus cases, whereas physician accuracy increased from 33.4% in low-consensus to 77.3% in high-consensus cases. The dashed horizontal line indicates chance performance. 3.2.3 Robustness of Findings The analysis included 16,401,888 physician responses, with participation ranging from 12,066 to 185,210 physicians per case. Weighted analyses accounting for differential participation rates yielded results identical to those of unweighted analyses. Effect sizes were large for both Claude 3.7 (Cohen's h = 0.96) and GPT-4o (Cohen's h = 0.95). These findings remained consistent across all analytical approaches. 3.3 Diagnostic Concordance and Complementarity 3.3.1 Performance Independence from Case Difficulty Large language model performance showed minimal correlation with physician consensus levels (Fig. 3 ). Physician accuracy ranged from 26% to 88% across cases. In contrast, Claude 3.7 and GPT-4o maintained high accuracy regardless of case difficulty. In cases where fewer than 40% of physicians were correct, Claude 3.7 achieved 86.5% accuracy and GPT-4o achieved 78.8% accuracy. Doubao showed greater variation (46.2% accuracy in low-consensus cases vs. 100% in high-consensus cases). Bubble plot analysis (Fig. 4 ) confirmed these patterns. More than 98% of cases for Claude 3.7 and GPT-4o fell above the line of equal performance with physicians, compared with 90.8% for Doubao. The concentration of large bubbles in the upper regions indicated consistent model superiority across all difficulty levels. Individual diagnostic outcomes for three large language models are shown for 272 cases. Points represent correct (1) or incorrect (0) diagnoses as a function of the proportion of physicians selecting the correct answer. Vertical jittering prevents overlap. Background shading indicates physician consensus levels: red (< 40% correct), yellow (40–69% correct), and green (≥ 70% correct). Smooth curves were fitted with locally weighted regression (LOESS) with 95% confidence intervals (shaded areas). GPT-4o and Claude 3.7 maintained consistent performance across all difficulty levels, whereas Doubao showed greater sensitivity to case difficulty. Numbers of cases: low consensus, 52; moderate consensus, 201; and high consensus, 19. Diagnostic accuracy of three large language models is plotted against physician accuracy for 272 cases. Bubble size is proportional to the number of cases at each accuracy level. The diagonal line represents equal performance. Smooth curves show locally weighted regression weighted by case frequency. Cases above the diagonal line indicate superior model performance: GPT-4o, 269 of 272 (98.9%); Claude 3.7, 267 of 272 (98.2%); and Doubao, 247 of 272 (90.8%). 3.3.2 Diagnostic Agreement and Complementarity Agreement between large language models and physicians was low despite high model accuracy (Table 2 ). Cohen's kappa values were 0.08 (95% CI, -0.04 to 0.19) for GPT-4o, 0.08 (95% CI, -0.03 to 0.20) for Claude 3.7, and 0.24 (95% CI, 0.13 to 0.35) for Doubao. The combination of low kappa values and high model accuracy suggested different diagnostic reasoning pathways. When physician majority vote was incorrect (< 50% accuracy), GPT-4o and Claude 3.7 correctly diagnosed 84.8% of cases, and Doubao diagnosed 59.3% correctly. Among the 8 cases with physician accuracy below 33%, GPT-4o and Claude 3.7 maintained 62.5% accuracy. Table 2 Agreement and Complementarity Between Large Language Models and Physicians in Clinical Diagnosis. Model Kappa Kappa_CI Sensitivity_50 Sensitivity_33 Specificity GPT4o 0.08 (-0.04-0.19) 84.8% 62.5% 89.5% Claude 0.08 (-0.03-0.20) 84.8% 62.5% 94.7% Doubao 0.24 (0.13–0.35) 59.3% 37.5% 100.0% Cohen's κ values measure agreement between each model and physician majority vote (≥ 50% of physicians correct) beyond chance; values near 0 indicate no better than random agreement. Sensitivity indicates model accuracy when physician accuracy was < 50% (145 cases) or < 33% (8 cases). Specificity indicates model accuracy when physician accuracy was ≥ 70% (19 cases). CI denotes confidence interval. 3.3.3 Patterns of Concordance and Discordance Confusion matrices revealed asymmetric agreement patterns (Fig. 5 ). For Claude 3.7, model success with physician failure occurred in 123 cases (45.2%), mutual success in 119 cases (43.8%), mutual failure in 22 cases (8.1%), and physician success with model failure in 8 cases (2.9%). This yielded a 15.4:1 ratio of model-advantage to physician-advantage cases. GPT-4o showed similar patterns. Models excelled particularly in cases with low physician consensus. Confusion matrices compare diagnostic outcomes between each model and physician majority vote (≥ 50% correct) for 272 cases. Values show the number of cases with percentages in parentheses. Shading intensity corresponds to percentage. The ratio of model-correct/physician-incorrect to physician-correct/model-incorrect cases was 11:1 for GPT-4o, 15.4:1 for Claude 3.7, and 4:1 for Doubao. 3.3.4 Ensemble Performance All three models agreed on the correct diagnosis in 171 cases (62.9%). At least one model was correct in 262 cases (96.3%), and all models were incorrect in 10 cases (3.7%). When physician majority vote was included, complete diagnostic failure (all models and physicians incorrect) occurred in 9 cases (3.3%). 3.4 Performance Across Clinical Contexts 3.4.1 Disease Category Analysis Model accuracy varied by disease category (Fig. 6 ). Claude 3.7 achieved 100% accuracy in drug- and toxin-related diseases (18 of 18 cases), 95.7% in immune-mediated diseases (45 of 47 cases), and 95.7% in genetic disorders (22 of 23 cases). GPT-4o achieved 97.9% accuracy in immune-mediated diseases (46 of 47 cases). All models had lower accuracy for traumatic diseases (Claude 3.7, 73.7% [14 of 19 cases]; GPT-4o, 78.9% [15 of 19 cases]; Doubao, 63.2% [12 of 19 cases]). The largest performance gaps between models and physicians occurred in drug-related diseases (Claude 3.7, 100% vs. physicians, 49.3%; difference, 50.7 percentage points) and genetic disorders (Claude 3.7, 95.7% vs. physicians, 45.5%; difference, 50.2 percentage points). The smallest gap occurred in vascular diseases (physicians, 52.0%; Doubao and Claude 3.7, 78.3%). Heat map showing the diagnostic accuracy (percentage of correct diagnoses) of three large language models (Claude 3.7, GPT-4o, and Doubao) and physicians across nine disease categories. Values in each cell represent the percentage accuracy for that model-disease combination. Darker blue shading indicates higher accuracy. Disease categories are ordered by overall performance. Analysis includes 272 clinical cases with text and images. 3.4.2 Performance by Image Type Model accuracy varied by imaging modality (Fig. 7 ). Claude 3.7 and GPT-4o achieved 100% accuracy with endoscopic images (8 of 8 cases), 96.6% with combination images (28 of 29 cases), and high accuracy with pathological specimens (Claude 3.7, 93.8% [15 of 16 cases]; GPT-4o, 100% [16 of 16 cases]). For physical signs (142 cases), Claude 3.7 achieved 91.5% accuracy and GPT-4o achieved 89.4% accuracy, compared with 49.1% for physicians. With radiological images (65 cases), accuracy was 81.5% for Claude 3.7 and 84.6% for GPT-4o, compared with 53.5% for physicians. Bar graph comparing the diagnostic accuracy of three large language models and physicians across different types of medical images. Error bars represent 95% confidence intervals. Analysis includes 272 clinical cases with both images and text. 3.4.3 Performance by Age and Sex Claude 3.7 achieved 100% accuracy in infants younger than 1 year (16 of 16 cases), compared with 49.6% for physicians. In children 1 to 12 years of age, GPT-4o achieved 95.0% accuracy (19 of 20 cases) and Claude 3.7 achieved 80.0% accuracy (16 of 20 cases). Among patients older than 12 years (86.8% of the cohort), model performance approximated overall averages (Table 3 ). Sex-based differences in accuracy were minimal. The largest difference was 8.8 percentage points for Doubao (females, 76.1%; males, 67.3%). Differences were 1.3 percentage points for GPT-4o and 0.7 percentage points for both Claude 3.7 and physicians. Table 3 Age and Sex-Stratified Diagnostic Performance. Category Subgroup N Claude 3.7 (%) GPT-4o (%) Doubao (%) Physician (%) Age Group Infant ( 12 years) 236 89.0 (85.0–93.0) 88.1 70.3 50.1 Sex Female 113 89.4 89.4 76.1 49.7 Male 159 88.7 88.1 67.3 50.4 Values are diagnostic accuracy percentages. Confidence intervals (95%) for Claude 3.7 were calculated with the Wilson method. Physician values represent the mean proportion selecting the correct diagnosis. Age groups: infant ( 12 years). 3.5 Multimodal Performance Enhancement 3.5.1 Overall Performance Adding clinical text to images improved diagnostic accuracy for all models (Fig. 8 ). Accuracy increased from 47.1% to 89.0% for Claude 3.7 (difference, 41.9 percentage points), from 58.8% to 88.6% for GPT-4o (difference, 29.8 percentage points), and from 42.6% to 71.0% for Doubao (difference, 28.3 percentage points). All differences were significant (P < 0.001 by McNemar's test). Diagnostic accuracy of three large language models with images alone (gray bars) and with images plus clinical text (green bars) for 272 cases. Horizontal brackets indicate pairwise comparisons (MeNemar's test). Absolute improvements in accuracy: Claude 3.7,41.9 percentage points (from 47.1%to 89.0%); GPT-40, 29.8 percentage points (from 58.8% to 88.6%); and Doubao, 28.3 percentage points (from 42.6%to 71.0%). Error bars represent 95%confidence intervals. *** P < 0.001. 3.5.2 Individual Case Patterns Among 272 cases, diagnostic outcomes after adding clinical text were as follows (Fig. 9 ): For Claude 3.7, accuracy improved in 120 cases (44.1%) and remained unchanged in 146 cases (53.7%). For GPT-4o, accuracy improved in 89 cases (32.7%) and remained unchanged in 175 cases (64.3%). Doubao showed improvement in 77 cases (28.3%) and unchanged accuracy in 195 cases (71.7%). Unexpectedly, the addition of clinical text led to diagnostic errors in previously correct cases for GPT-4o (8 cases, 2.9%) and Claude 3.7 (6 cases, 2.2%), whereas Doubao showed no such deterioration. Case 20211007 represented the intersection where both GPT-4o and Claude 3.7 changed from correct to incorrect diagnoses after text addition (Fig. 10 and Fig. 11 ). We extracted the response content from both iterations for all 13 cases in which the large language model produced erroneous diagnoses after text augmentation during testing. By analyzing their "reasoning" processes, we identified potential causes underlying this phenomenon (Table 4 ). Waterfall plots show individual patient outcomes for 272 cases when clinical text was added to image-based diagnosis. Each vertical bar represents one patient, ordered by identification number. Bar height indicates change in diagnostic outcome: +100% (green), improvement from incorrect to correct; 0% (gray), unchanged; -100% (red), deterioration from correct to incorrect. Panels show results for GPT-4o (top), Claude 3.7 (middle), and Doubao (bottom). Numbers below each panel indicate cases in each category. Claude 3.7: 120 improved (44.1%), 146 unchanged (53.7%), 6 deteriorated (2.2%). GPT-4o: 89 improved (32.7%), 175 unchanged (64.3%), 8 deteriorated (2.9%). Doubao: 77 improved (28.3%), 195 unchanged (71.7%), 0 deteriorated. Patient identifiers indicate NEJM publication dates (YYYYMMDD). Numbers in parentheses show cases in each category. Red indicates the single case (20211007) where both GPT 4o and Claude 3.7 deteriorated. MRI showing multifocal ring-enhancing brain lesions (left) and microscopy demonstrating filamentous branching Gram-positive rods (right), suggestive of nocardiosis. Both GPT-4o and Claude 3.7 AI correctly diagnosed nocardiosis using imaging alone, but incorrectly revised diagnosis to listeriosis when clinical text emphasizing elderly age, immunosuppression, and Gram-positive bacilli was added. Images reproduced from NEJM Image Challenge, Case ID 20211007 ( https://www.nejm.org/image-challenge?ci=20211007 ), ©Massachusetts Medical Society. Used under fair use for educational purposes. Table 4 Possible Explanations for Diagnostic Errors After Adding Clinical Text in 13 Image Challenge Cases Patient ID URL Likely Cause of Misdiagnosis with Text 20211007 Link Clinical context (elderly, immunocompromised, fever/confusion) caused overemphasis on Listeriosis instead of imaging-characteristic Nocardiosis . 20191205 Link Description of “soft mass increases with crying” in a neonate misled to Prolapsed Uterus ; imaging favored Hydrocolpos . 20200206 Link Non-specific swelling and elderly context led to Carcinoma of the tongue ; image was classic for Sublingual epidermoid cyst . 20200305 Link HIV/immunosuppressed background and “B symptoms” led to DLBCL ; imaging supported Disseminated Mycobacterium avium-intracellulare . 20210218 Link Elderly, weight loss, and abdominal mass misled to Abdominal aortic aneurysm ; image showed Urachal mucinous cystic tumor . 20210304 Link Subacute cough/dyspnea led to Diffuse alveolar hemorrhage ; “sandstorm” X-ray supported Pulmonary alveolar microlithiasis . 20210401 Link Text focus on tick bite, fever, and lymphadenopathy favored RMSF ; eschar and lymph nodes fit Tularemia . 20220324 Link Text mentioned cholesterol emboli—led to Livedo reticularis ; skin pattern was more consistent with Livedo racemosa . 20200220 Link Clinical context of tongue swelling in elderly—model chose Carcinoma ; image pointed to Beckwith-Wiedemann syndrome (macroglossia). 20210121 Link Young adult, fever, and mass symptoms in text—model picked Lymphoma ; image showed Castleman disease . 20210311 Link Clinical symptoms (pain/swelling) suggested Abscess ; imaging classic for Cysticercosis . 20210506 Link Textual clues (weight loss, GI symptoms) misled to Colon cancer ; image was consistent with GIST (gastrointestinal stromal tumor) . 20210520 Link Middle-aged patient, chronic symptoms—model chose Sarcoidosis ; imaging was classic for Pulmonary Langerhans cell histiocytosis . Discussion Our study encompassed a diverse patient population spanning infancy to advanced age with balanced gender distribution (58.5% male, 41.5% female) and broad disease spectrum including infectious, immune-mediated, neoplastic, and genetic conditions. This comprehensive approach offers several clinical advantages over specialized evaluations. Unlike domain-specific studies focusing on single specialties such as neuroradiology or rheumatology, our broad case selection better reflects the diagnostic challenges encountered in general clinical practice where physicians must differentiate among diverse conditions with overlapping presentations. The consistent AI performance across age groups, particularly the perfect accuracy in infants under one year where physicians achieved only 49.6%, suggests robust generalizability across patient demographics—a critical consideration for real-world implementation. The wide disease spectrum evaluation demonstrates that multimodal AI capabilities extend beyond specialty-specific pattern recognition to general diagnostic reasoning, supporting potential applications in primary care and emergency medicine settings where diagnostic breadth rather than depth is often required. Our findings diverge significantly from four recent evaluations using NEJM Image Challenge datasets. Han et al. reported GPT-4V achieving 88.7% accuracy on 348 NEJM cases versus 51.4% for human readers 16 , while Kaczmarczyk et al. found Claude 3 models reaching only 58.8-59.8% accuracy compared to 90.8% collective intelligence 19 , and Suh et al. demonstrated GPT-4o accuracy of 59.6% versus 80.9% for junior faculty radiologists 20 . A rheumatology-focused evaluation showed Claude Sonnet 3.5 achieving 81.2% accuracy in multimodal tasks versus online participants' 51.6% 21 . Our Claude 3.7 achieved 89.0% accuracy against 46.7% physician majority vote. These disparities reflect critical methodological differences: human performance benchmarks varied from individual physician responses (Han, our study) to collective intelligence aggregation (Kaczmarczyk) or expert radiologist panels (Suh), representing fundamentally different clinical scenarios. Model generation advances likely contributed, as newer versions (Claude 3.7, GPT-4V) consistently outperformed earlier iterations, while dataset composition and evaluation periods differed across studies (Han: 2017-2023; Kaczmarczyk: 2005-2023; Suh: 2005-2024; ours: 2009-2025). The Kaczmarczyk collective intelligence benchmark at 90.8%, though statistically robust, represents an idealized scenario unattainable in clinical practice where individual physicians make diagnostic decisions, explaining apparent AI superiority in studies using realistic individual physician baselines. Our findings contrast with Le Guellec et al.'s neuroradiology evaluation, where radiologists outperformed GPT-4o and Gemini 1.5 Pro in complete cases (48.0% vs 34.0%) with AI models showing minimal multimodal benefit, unlike our substantial 28-42 percentage point improvements and overall, AI superiority (Claude 3.7: 89.0% vs physicians: 46.7%) 22 . These disparities likely reflect domain-specific challenges, as neuroradiology requires specialized expertise in subtle imaging findings—an area where AI models failed in 81-94% of cases—and different human benchmarks (expert radiologists vs general physician majority vote), suggesting AI diagnostic capabilities vary significantly across medical specialties. In approximately 2%–3% of cases, adding clinical text caused models such as GPT-4o and Claude 3.7 to shift from a correct image-based diagnosis to an incorrect one. Our conclusions are based on qualitative review and remain subjective; the specific reasons are unclear. Notably, this pattern was not observed with the Doubao model, and the errors rarely overlapped between GPT-4o and Claude 3.7. Most cases involved images with highly characteristic findings, where non-specific text may have misled models—especially those more reliant on textual cues. These observations highlight model differences and underscore the need for further study. Several study limitations require acknowledgment. Selection bias in educational case collections may not reflect typical clinical practice complexity. Evaluation using static clinical vignettes differs from dynamic clinical encounters where physicians gather additional information and order sequential investigations. We cannot assess optimal AI-physician collaboration potential or account for real-world time pressures and resource constraints. Additionally, without access to proprietary model training data, possible data contamination cannot be excluded. Declarations Ethics This study used only publicly available Internet data and did not involve human subjects. Therefore, no specific ethical considerations were required in this study. Conflict of interest The authors declare no competing financial or non-financial interests. None of the authors has any financial, professional, or personal relationships with OpenAI (GPT-4o), Anthropic (Claude 3.7 Sonnet), ByteDance (Doubao), or any involvement in the development, marketing, or commercial activities of the multimodal LLM platforms evaluated in this study. Funding This study was funded by the Health Research Project of the Anhui Province (grant no. AHWJ2023A20456). The funder had no role in the study design, data collection, analysis, interpretation of data, or writing of this manuscript. Declaration of generative AI in scientific Writing The author used Claude to translate the paper and Grammarly to correct English grammar when preparing this work. The author reviewed and edited the content as needed and took full responsibility for the publication. Data availability The datasets generated or analysed during the current study are available from the corresponding author on reasonable request. Credit Authorship Contribution Statement Chiyu Sheng: Conceptualization, Data curation, Formal analysis, Investigation, Writing - original draft, Writing - review & editing. Shumin Shen: Conceptualization, Data curation, Formal analysis, Investigation, Writing - original draft, Writing - review & editing. Lin Wang: Methodology, Software, Formal analysis, Data curation, Validation, Writing - review & editing. Jie Chen: Methodology, Software, Formal analysis, Data curation, Validation, Writing - review & editing. Wei Chen: Investigation, Validation, Resources, Writing - review & editing. Nianfei Wang: Conceptualization, Methodology, Validation, Writing - review & editing, Supervision, Project administration, Funding acquisition. Shanghu Wang: Conceptualization, Resources, Writing - original draft, Supervision, Project administration, Funding acquisition. References Gunderson CG, Bilan VP, Holleck JL, et al. Prevalence of harmful diagnostic errors in hospitalised adults: a systematic review and meta-analysis. BMJ Qual Saf 2020;29(12):1008–18. Singh H, Meyer AND, Thomas EJ. The frequency of diagnostic errors in outpatient care: estimations from three large observational studies involving US adult populations. BMJ Qual Saf 2014;23(9):727–31. Faye F, Crocione C, Anido de Peña R, et al. Time to diagnosis and determinants of diagnostic delays of people living with a rare disease: Results of a rare barometer retrospective patient survey. Eur J Hum Genet 2024;32(9):1116–26. Nguengang Wakap S, Lambert DM, Olry A, et al. Estimating cumulative point prevalence of rare diseases: analysis of the Orphanet database. Eur J Hum Genet 2020;28(2):165–73. Schubert MC, Wick W, Venkataramani V, Schubert MC, Wick W, Venkataramani V. Performance of Large Language Models on a Neurology Board–Style Examination. JAMA Netw Open 2023;6(12):e2346721. Beam K, Sharma P, Kumar B, et al. Performance of a Large Language Model on Practice Questions for the Neonatal Board Examination. JAMA Pediatr 2023;177(9):977. Longwell JB, Hirsch I, Binder F, et al. Performance of Large Language Models on Medical Oncology Examination Questions. JAMA Netw Open 2024;7(6):e2417641. Nori H, King N, McKinney SM, Carignan D, Horvitz E. Capabilities of GPT-4 on Medical Challenge Problems [Internet]. 2023 [cited 2025 May 28];Available from: http://arxiv.org/abs/2303.13375 Bicknell BT, Butler D, Whalen S, et al. ChatGPT-4 Omni Performance in USMLE Disciplines and Clinical Skills: Comparative Analysis. JMIR Med Educ 2024;10:e63430–e63430. Takita H, Kabata D, Walston SL, et al. A systematic review and meta-analysis of diagnostic performance comparison between generative AI and physicians. npj Digit Med 2025;8(1):175. Ferber D, Wölflein G, Wiest IC, et al. In-context learning enables multimodal large language models to classify cancer pathology images. Nat Commun 2024;15(1):10104. Zhang D, Yu Y, Dong J, et al. MM-LLMs: Recent Advances in MultiModal Large Language Models [Internet]. 2024 [cited 2025 May 28];Available from: http://arxiv.org/abs/2401.13601 Nishino M, Ballard DH, Nishino M, Ballard DH. Multimodal Large Language Models to Solve Image-based Diagnostic Challenges: The Next Big Wave is Already Here. Radiology 2024;312(1):e241379. Bradshaw TJ, Tie X, Warner J, et al. Large Language Models and Large Multimodal Models in Medical Imaging: A Primer for Physicians. J Nucl Med 2025;66(2):173–82. Zhou J, He X, Sun L, et al. Pre-trained multimodal large language model enhances dermatological diagnosis using SkinGPT-4. Nat Commun 2024;15(1):5649. Han T, Adams LC, Bressem KK, et al. Comparative Analysis of Multimodal Large Language Model Performance on Clinical Vignette Questions. JAMA 2024;331(15):1320. Huppertz MS, Siepmann R, Topp D, et al. Revolution or risk?—Assessing the potential and challenges of GPT-4V in radiologic image interpretation. Eur Radiol 2025;35(3):1111–21. Liu F, Zhu T, Wu X, et al. A medical multimodal large language model for future pandemics. npj Digit Med 2023;6(1):1–15. Kaczmarczyk R, Wilhelm TI, Martin R, Roos J. Evaluating multimodal AI in medical diagnostics. npj Digit Med 2024;7(1):205. Suh PS, Shim WH, Suh CH, et al. Comparing Large Language Model and Human Reader Accuracy with New England Journal of Medicine Image Challenge Case Image Inputs. Radiology 2024;313(3):e241668. Omar M, Agbareia R, Klang E, et al. Large Language Models in Rheumatologic Diagnosis: A Multimodal Performance Analysis. J Rheumatol 2025;52(2):jrheum.2024-0975. Le Guellec B, Bruge C, Chalhoub N, et al. Comparison between multimodal foundation models and radiologists for the diagnosis of challenging neuroradiology cases with text and images. Diagnostic and Interventional Imaging 2025;S2211-5684(25)96-8. Additional Declarations No competing interests reported. Cite Share Download PDF Status: Published Journal Publication published 10 Feb, 2026 Read the published version in Scientific Reports → Version 1 posted Editorial decision: Revision requested 19 Dec, 2025 Reviews received at journal 12 Dec, 2025 Reviews received at journal 11 Dec, 2025 Reviewers agreed at journal 11 Dec, 2025 Reviewers agreed at journal 04 Dec, 2025 Reviewers invited by journal 04 Dec, 2025 Editor invited by journal 28 Nov, 2025 Editor assigned by journal 08 Nov, 2025 Submission checks completed at journal 08 Nov, 2025 First submitted to journal 04 Nov, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8028355","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":556154503,"identity":"5fb44022-8700-40ca-a30e-63216d62a362","order_by":0,"name":"Chiyu Sheng","email":"","orcid":"","institution":"Anhui Medical University","correspondingAuthor":false,"prefix":"","firstName":"Chiyu","middleName":"","lastName":"Sheng","suffix":""},{"id":556154504,"identity":"10f258f8-c96a-4077-96f3-7f52978865a6","order_by":1,"name":"Shumin Shen","email":"","orcid":"","institution":"Anhui Medical University","correspondingAuthor":false,"prefix":"","firstName":"Shumin","middleName":"","lastName":"Shen","suffix":""},{"id":556154505,"identity":"5a8b3955-2a11-4bcc-a293-ce4087f8ce77","order_by":2,"name":"Lin Wang","email":"","orcid":"","institution":"East China Normal University","correspondingAuthor":false,"prefix":"","firstName":"Lin","middleName":"","lastName":"Wang","suffix":""},{"id":556154508,"identity":"1d1a3874-4576-4098-9ee8-90ed6730944c","order_by":3,"name":"Jie Chen","email":"","orcid":"","institution":"Soochow University","correspondingAuthor":false,"prefix":"","firstName":"Jie","middleName":"","lastName":"Chen","suffix":""},{"id":556154509,"identity":"970aec49-83bf-4fe5-8791-d4cfc766f06a","order_by":4,"name":"Wei Chen","email":"","orcid":"","institution":"Anhui Medical University","correspondingAuthor":false,"prefix":"","firstName":"Wei","middleName":"","lastName":"Chen","suffix":""},{"id":556154510,"identity":"7ea15660-fe34-49e0-9186-db74268c092c","order_by":5,"name":"Nianfei Wang","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA80lEQVRIiWNgGAWjYNACAwYG9gbGxgdApgwDAxuRWngOMDYDKQMeIrUwgLQwsEkQpcXg+NnDrwsK7PJ42JvbKj62/eHhZ29LYPhRsQ23ljN5adYzDJKLeXgOtt2c2WbAI9lz7ABjz5nbOLWYHcgxM+YxYE7cL5HYdpsXqMXgRnoDM2MbHi3n34C01Cf2yD9sKyZOy40c48c8BocTeyQY25ghWtIO4NVif+ONGTOPwfHEHp7EZskZ54xBfkk4iM8vkv05xp95/lQn9rAff/jhQ5mcHDDEDB/8qMCtBQhA0YEGDuBTDwTMHwgoGAWjYBSMgpEOANrQVDpINW6DAAAAAElFTkSuQmCC","orcid":"","institution":"Anhui Medical University","correspondingAuthor":true,"prefix":"","firstName":"Nianfei","middleName":"","lastName":"Wang","suffix":""},{"id":556154511,"identity":"0812fe95-590d-45e2-b5dc-7f2311333b5e","order_by":6,"name":"Shanghu Wang","email":"","orcid":"","institution":"Anhui Medical University","correspondingAuthor":false,"prefix":"","firstName":"Shanghu","middleName":"","lastName":"Wang","suffix":""}],"badges":[],"createdAt":"2025-11-04 11:53:11","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8028355/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8028355/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1038/s41598-026-39201-3","type":"published","date":"2026-02-10T15:57:19+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":97694473,"identity":"a5e916e0-40fa-4d99-b19a-c98cf38442e6","added_by":"auto","created_at":"2025-12-08 11:24:18","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":1331433,"visible":true,"origin":"","legend":"","description":"","filename":"MultimodalLargeLanguageModelsChallengeNEJMImageChallenge.docx","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/c5eb6476bdd3720157c78348.docx"},{"id":97694400,"identity":"a81b671b-05df-42fd-8f9f-296f301b15a9","added_by":"auto","created_at":"2025-12-08 11:24:10","extension":"docx","order_by":1,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":18071,"visible":true,"origin":"","legend":"","description":"","filename":"Table1.docx","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/0a2401b4e775dd5da01ce88e.docx"},{"id":97694450,"identity":"862e24c2-1177-429e-8911-3b49b447e7fb","added_by":"auto","created_at":"2025-12-08 11:24:15","extension":"docx","order_by":3,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":16898,"visible":true,"origin":"","legend":"","description":"","filename":"Table2.docx","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/2a6ca217e810176d43ba4bb3.docx"},{"id":97694471,"identity":"40ed6e83-8152-4fc4-b94c-dd8f7dd7c12d","added_by":"auto","created_at":"2025-12-08 11:24:18","extension":"docx","order_by":4,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":19244,"visible":true,"origin":"","legend":"","description":"","filename":"Table3.docx","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/bf5f536c6d9ba58d2dc475cb.docx"},{"id":97694388,"identity":"49681c3e-2ff2-4dd0-86e7-ba94ac4aee99","added_by":"auto","created_at":"2025-12-08 11:24:08","extension":"docx","order_by":6,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":18392,"visible":true,"origin":"","legend":"","description":"","filename":"Table4.docx","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/21b468aa5796475fb1ef6f60.docx"},{"id":97894444,"identity":"dab6f46e-a827-45ef-be89-f0c6c6dbf524","added_by":"auto","created_at":"2025-12-10 15:32:31","extension":"jpg","order_by":7,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":103952,"visible":true,"origin":"","legend":"","description":"","filename":"Figure11.jpg","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/f00557f24bf6e7c114ef35af.jpg"},{"id":97893039,"identity":"a6361566-d6a5-47ee-a485-fedaf6405e0c","added_by":"auto","created_at":"2025-12-10 15:26:12","extension":"json","order_by":16,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":9212,"visible":true,"origin":"","legend":"","description":"","filename":"adadaaf96f4d4ea4bb2b135a4984404c.json","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/ed8ed78f1cf1a4a0cbc74de8.json"},{"id":97893531,"identity":"5b9096b2-9beb-42ab-85be-0fd76c0631ce","added_by":"auto","created_at":"2025-12-10 15:30:39","extension":"xml","order_by":17,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":121761,"visible":true,"origin":"","legend":"","description":"","filename":"adadaaf96f4d4ea4bb2b135a4984404c1enriched.xml","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/0e174edba024821e9cdab079.xml"},{"id":97694455,"identity":"fbdf65a7-7d97-49df-91a0-b27176966aec","added_by":"auto","created_at":"2025-12-08 11:24:15","extension":"pdf","order_by":18,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":145520,"visible":true,"origin":"","legend":"","description":"","filename":"Figure1.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/481ec02635ce55c0c8d0f59a.pdf"},{"id":97893736,"identity":"d9c0de77-2395-4201-ae47-5dd3a1e4b830","added_by":"auto","created_at":"2025-12-10 15:31:05","extension":"pdf","order_by":19,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":346893,"visible":true,"origin":"","legend":"","description":"","filename":"Figure10.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/b77d763d30d1c7689c5ccb14.pdf"},{"id":97894348,"identity":"b89429fc-2dd9-455b-8af4-5b89b4176266","added_by":"auto","created_at":"2025-12-10 15:32:22","extension":"jpg","order_by":20,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":103952,"visible":true,"origin":"","legend":"","description":"","filename":"Figure11.jpg","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/456df4fe26afe09690ca1175.jpg"},{"id":97694403,"identity":"b02e96c1-5aab-4bea-9929-6684debbd1dc","added_by":"auto","created_at":"2025-12-08 11:24:11","extension":"pdf","order_by":21,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":6603,"visible":true,"origin":"","legend":"","description":"","filename":"Figure2.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/69ef5d6487ab0af6892b1470.pdf"},{"id":97694458,"identity":"724e86da-3f02-46a1-8082-9bd218420a71","added_by":"auto","created_at":"2025-12-08 11:24:16","extension":"pdf","order_by":22,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":57474,"visible":true,"origin":"","legend":"","description":"","filename":"Figure3.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/a9e8cd6d8842fad25545b299.pdf"},{"id":97694454,"identity":"a1e749fe-e64c-4249-89ec-b0860bebe82b","added_by":"auto","created_at":"2025-12-08 11:24:15","extension":"pdf","order_by":23,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":17201,"visible":true,"origin":"","legend":"","description":"","filename":"Figure4.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/62a9d1b301ab3233756f8963.pdf"},{"id":97694487,"identity":"21df5c3d-f2a1-45d1-9317-f5a419ae6000","added_by":"auto","created_at":"2025-12-08 11:24:19","extension":"pdf","order_by":24,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":6254,"visible":true,"origin":"","legend":"","description":"","filename":"Figure5.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/ffb3a0ffe310448559fa7006.pdf"},{"id":97694483,"identity":"60e40cc4-1e9f-4756-861c-0d9453890971","added_by":"auto","created_at":"2025-12-08 11:24:18","extension":"pdf","order_by":25,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":6460,"visible":true,"origin":"","legend":"","description":"","filename":"Figure6.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/309a3f869ff77cdd39e8e9e4.pdf"},{"id":97694459,"identity":"a035c356-793a-42dc-9494-12729432a732","added_by":"auto","created_at":"2025-12-08 11:24:16","extension":"pdf","order_by":26,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":126702,"visible":true,"origin":"","legend":"","description":"","filename":"Figure7.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/1892875ab74ee6ffe77c8f28.pdf"},{"id":97694463,"identity":"c1ff09d7-5dda-4cff-9c77-32066187a6a8","added_by":"auto","created_at":"2025-12-08 11:24:17","extension":"pdf","order_by":27,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":156002,"visible":true,"origin":"","legend":"","description":"","filename":"Figure8.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/a33d713a29cd22dba8679c43.pdf"},{"id":97694435,"identity":"69bfb5d3-0b37-4b40-b76c-ee0d25068efa","added_by":"auto","created_at":"2025-12-08 11:24:12","extension":"pdf","order_by":28,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":8898,"visible":true,"origin":"","legend":"","description":"","filename":"Figure9.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/464d643f1773b5626b57fef1.pdf"},{"id":97694460,"identity":"c9697a8b-17bd-44bf-8566-9fcb12209426","added_by":"auto","created_at":"2025-12-08 11:24:16","extension":"png","order_by":29,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":79169,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/dcb50bfb7ee9488bb297c79b.png"},{"id":97694462,"identity":"1e3956cd-d36e-4af1-b718-61e6178c50ed","added_by":"auto","created_at":"2025-12-08 11:24:17","extension":"png","order_by":30,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":101527,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage10.png","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/414e712a7a33ce2c12cc9ca3.png"},{"id":97694385,"identity":"164c3d11-fa34-4474-b038-e0c34371bfc4","added_by":"auto","created_at":"2025-12-08 11:24:08","extension":"jpeg","order_by":31,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":139524,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage11.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/e08932c1cfcb541ea532ec36.jpeg"},{"id":97893822,"identity":"f839cd0b-c9f5-44bb-85ae-dcae9d66d791","added_by":"auto","created_at":"2025-12-10 15:31:18","extension":"png","order_by":32,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":138666,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/7258b3a9657253f71f24ea43.png"},{"id":97694456,"identity":"11ba5738-d3d7-4fb8-a8fa-5710e2c64b7d","added_by":"auto","created_at":"2025-12-08 11:24:15","extension":"png","order_by":33,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":140093,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/e82df1574c6575b113be5486.png"},{"id":97694447,"identity":"3b6a4a59-8412-43d8-ab23-24b1f7732778","added_by":"auto","created_at":"2025-12-08 11:24:15","extension":"png","order_by":34,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":171849,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/de99151930f09dfefbfa96d0.png"},{"id":97694420,"identity":"fee3be67-e41c-41c1-889a-969aa4146406","added_by":"auto","created_at":"2025-12-08 11:24:12","extension":"png","order_by":35,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":100621,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/4949cae1b28198a92e824964.png"},{"id":97893962,"identity":"eb8d1393-69a5-4fca-8641-57f40f998a7a","added_by":"auto","created_at":"2025-12-10 15:31:42","extension":"png","order_by":36,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":138457,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage6.png","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/cf0594611901177e2979d7f3.png"},{"id":97694439,"identity":"6175d6a7-21c0-4111-9039-f49442fe4079","added_by":"auto","created_at":"2025-12-08 11:24:13","extension":"png","order_by":37,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":80889,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage7.png","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/69d14255e998c125ab2ae99c.png"},{"id":97694406,"identity":"92e2515c-7428-4d76-9b39-361e4c7d8e40","added_by":"auto","created_at":"2025-12-08 11:24:11","extension":"png","order_by":38,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":82403,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage8.png","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/aa324f134381893b99fd85c9.png"},{"id":97893884,"identity":"b8e362ee-dfdf-4130-9eb8-1a60763c19fd","added_by":"auto","created_at":"2025-12-10 15:31:25","extension":"png","order_by":39,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":73347,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage9.png","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/39db64199f161e8eb4051979.png"},{"id":97694398,"identity":"b7ef6efb-93ca-4a28-bcff-e51be310642b","added_by":"auto","created_at":"2025-12-08 11:24:10","extension":"png","order_by":40,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":86844,"visible":true,"origin":"","legend":"","description":"","filename":"OnlineFigure11.png","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/14c99a465605b564ca310838.png"},{"id":97694457,"identity":"066f4f94-7e9b-47b7-837e-01e18a031f17","added_by":"auto","created_at":"2025-12-08 11:24:16","extension":"png","order_by":41,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":21908,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/0f419bec2ecd820f71a81af7.png"},{"id":97694479,"identity":"e048d035-aa38-466b-ad1b-6161640655c7","added_by":"auto","created_at":"2025-12-08 11:24:18","extension":"png","order_by":42,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":25022,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage10.png","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/9b67f616304d4942f4794aa0.png"},{"id":97694356,"identity":"3d54dd18-e211-46c8-ac54-652b25cf81e8","added_by":"auto","created_at":"2025-12-08 11:24:06","extension":"png","order_by":43,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":85273,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage11.png","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/6680d7249cd17035592a1549.png"},{"id":97694438,"identity":"9e04b754-ce84-4507-a6c7-038a8011f561","added_by":"auto","created_at":"2025-12-08 11:24:13","extension":"png","order_by":44,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":31374,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/01d0aaab0b187664d97e3adf.png"},{"id":97694474,"identity":"eff2c68c-a661-456a-91de-a66dfba49d00","added_by":"auto","created_at":"2025-12-08 11:24:18","extension":"png","order_by":45,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":33586,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/82f61e5f770e60d2cc795d6e.png"},{"id":97893903,"identity":"7317b627-2293-44d0-a57c-02c4e3c7b628","added_by":"auto","created_at":"2025-12-10 15:31:26","extension":"png","order_by":46,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":30171,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/555bf66a571bac956edb3c1c.png"},{"id":97694469,"identity":"0cb45e1d-b9fa-4044-b4ba-ee7d5d98184a","added_by":"auto","created_at":"2025-12-08 11:24:18","extension":"png","order_by":47,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":26783,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/f675af1800cafe4be4a12f97.png"},{"id":97694446,"identity":"82bd4aa7-e1bb-4b47-a764-c05bb0219c46","added_by":"auto","created_at":"2025-12-08 11:24:14","extension":"png","order_by":48,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":35461,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage6.png","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/daabd9ec6cfedebe31143228.png"},{"id":97694444,"identity":"c0ed0bc6-b675-4103-93ac-539fd86a4139","added_by":"auto","created_at":"2025-12-08 11:24:14","extension":"png","order_by":49,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":19950,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage7.png","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/644edc95dd4b115c0f309288.png"},{"id":97694392,"identity":"c8e77901-1bc4-432c-a709-8e7b2c5c3927","added_by":"auto","created_at":"2025-12-08 11:24:08","extension":"png","order_by":50,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":24222,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage8.png","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/266f824fabea320d7a21e9c9.png"},{"id":97694442,"identity":"c9f9a82a-e974-4098-9308-6845186125de","added_by":"auto","created_at":"2025-12-08 11:24:14","extension":"png","order_by":51,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":24331,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage9.png","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/23ac117218d2d06f4fb64d56.png"},{"id":97694393,"identity":"548893ba-d20c-477c-9b17-317b509096ab","added_by":"auto","created_at":"2025-12-08 11:24:09","extension":"xml","order_by":52,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":121466,"visible":true,"origin":"","legend":"","description":"","filename":"adadaaf96f4d4ea4bb2b135a4984404c1structuring.xml","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/cb08408fdb45b9a86aebd595.xml"},{"id":97694404,"identity":"c3ffbe80-1c06-4b9d-b00b-f0a8a8078354","added_by":"auto","created_at":"2025-12-08 11:24:11","extension":"html","order_by":53,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":136675,"visible":true,"origin":"","legend":"","description":"","filename":"earlyproof.html","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/5a3550979d4d7c7cd69973bb.html"},{"id":97694484,"identity":"489ee302-4a53-4c3c-bc63-cde8fc283177","added_by":"auto","created_at":"2025-12-08 11:24:19","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":37338,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eDiagnostic Accuracy of Large Language Models versus Physicians.\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/d57f1d5940c317f772fd5304.png"},{"id":97694461,"identity":"da69a2ff-605f-4b18-91e6-b0bb78191fc0","added_by":"auto","created_at":"2025-12-08 11:24:16","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":61835,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eModel Performance Stratified by Physician Consensus Level.\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/642f67fe959597c2eb40b307.png"},{"id":97694401,"identity":"d51b816c-275d-40ef-8684-687381e4f80c","added_by":"auto","created_at":"2025-12-08 11:24:11","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":58926,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eDiagnostic Performance of Large Language Models According to Physician Consensus Level.\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/0c206422673f3d5d81ef4fe5.png"},{"id":97694452,"identity":"2c185244-69b8-4df4-8fba-2732ea51a29e","added_by":"auto","created_at":"2025-12-08 11:24:15","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":73171,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eComparison of Model and Physician Diagnostic Accuracy.\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"4.png","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/3e9b4a2e57edc63df1e6485f.png"},{"id":97894474,"identity":"d7bc8e3c-636c-41ec-ae5b-6ca223b18000","added_by":"auto","created_at":"2025-12-10 15:32:35","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":48824,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eConcordance Analysis between Large Language Models and Physicians.\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"5.png","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/d417a8153bde4eed766b7915.png"},{"id":97694445,"identity":"f083cc48-610d-495e-8df5-cd4d82dc9159","added_by":"auto","created_at":"2025-12-08 11:24:14","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":66440,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eDiagnostic Performance by Disease Category.\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"6.png","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/ed010534515100d1949f8d4b.png"},{"id":97694449,"identity":"d27a5da1-b2f0-4880-b39d-d5dbd858ba5e","added_by":"auto","created_at":"2025-12-08 11:24:15","extension":"png","order_by":7,"title":"Figure 7","display":"","copyAsset":false,"role":"figure","size":33175,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eModel Performance According to Image Type.\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"7.png","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/de35ac828c7414d6612b4c2e.png"},{"id":97694465,"identity":"56b324b4-7cd6-437f-9fb3-1bf1d64b71e0","added_by":"auto","created_at":"2025-12-08 11:24:17","extension":"png","order_by":8,"title":"Figure 8","display":"","copyAsset":false,"role":"figure","size":26072,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eDiagnostic Accuracy of Large Language Models with and without Clinical Text.\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"8.png","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/11815906a6ad2717c31b8550.png"},{"id":97894285,"identity":"67fe7093-6d2d-4873-a86a-96f60988bdac","added_by":"auto","created_at":"2025-12-10 15:32:13","extension":"png","order_by":9,"title":"Figure 9","display":"","copyAsset":false,"role":"figure","size":52223,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eChanges in Diagnostic Performance after Addition of Clinical Text.\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"9.png","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/98a4e954841f1f27aeab3432.png"},{"id":97694489,"identity":"d8be5f04-9230-44c9-9aed-9dc395e25f60","added_by":"auto","created_at":"2025-12-08 11:24:19","extension":"png","order_by":10,"title":"Figure 10","display":"","copyAsset":false,"role":"figure","size":49469,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eDiagnostic discrepancy induced by clinical text.\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"10.png","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/5c5be8478a435dc48b6bed49.png"},{"id":97694453,"identity":"a4f46094-4de5-4ca5-a1ad-5413898f920c","added_by":"auto","created_at":"2025-12-08 11:24:15","extension":"png","order_by":11,"title":"Figure 11","display":"","copyAsset":false,"role":"figure","size":223999,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eA Case with Diagnostic Errors by both GPT-4o and Claude 3.7 after Addition of Clinical Text.\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"11.png","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/3188e83db2a685970b0910ff.png"},{"id":102785564,"identity":"d97d998f-22ff-4fe3-ba65-709ebefa325a","added_by":"auto","created_at":"2026-02-16 16:08:04","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":2373406,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8028355/v1/0f3c65b2-ffa9-47ad-bd14-755bad51830b.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Multimodal Large Language Models Challenge NEJM Image Challenge","fulltext":[{"header":"Introduction","content":"\u003cp\u003eAccurate diagnosis is fundamental to effective medical treatment, yet diagnostic errors affect millions of patients annually. A meta-analysis of 22 studies involving 80,026 hospitalized patients found a harmful diagnostic error rate of 0.7% (95% CI, 0.5%-1.1%), translating to approximately 249,900 harmful diagnostic errors annually in the United States\u003csup\u003e\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e\u003c/sup\u003e. The burden is substantially higher in outpatient settings, where diagnostic errors affect 5.08% of adults\u0026mdash;approximately 12\u0026nbsp;million Americans each year\u0026mdash;with at least 1 in 20 adults experiencing a diagnostic error and half of these errors potentially causing harm\u003csup\u003e\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e\u003c/sup\u003e. This challenge is magnified for rare diseases, where patients endure a median diagnostic delay of 4.7 years and 40% receive multiple incorrect diagnoses before the correct one is identified\u003csup\u003e\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u003c/sup\u003e. For the estimated 300\u0026nbsp;million individuals worldwide affected by rare diseases, the scarcity of specialized expertise and the tendency of rare conditions to mimic common diseases create particularly formidable diagnostic obstacles\u003csup\u003e\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e\u003cp\u003eRecent advances in large language models (LLMs) have demonstrated remarkable performance on standardized medical examinations, suggesting potential diagnostic capabilities\u003csup\u003e\u003cspan additionalcitationids=\"CR6\" citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u003c/sup\u003e. GPT-4 exceeded the United States Medical Licensing Examination passing threshold by more than 20 points, achieving 86.5% accuracy on complex clinical questions\u003csup\u003e\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e\u003c/sup\u003e. Med-PaLM 2 reached 86.5% accuracy on medical question-answering datasets, with physician evaluators preferring its responses over human physicians' on eight of nine clinical utility metrics\u003csup\u003e\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e\u003c/sup\u003e. These achievements indicate that current LLMs possess substantial medical knowledge that could theoretically be applied to diagnostic challenges. However, translating examination performance into clinical utility reveals critical limitations. A 2024 systematic review found that while AI models achieved diagnostic accuracy comparable to non-expert physicians (52.1%), they remained significantly inferior to specialists by 15.8% (P\u0026thinsp;=\u0026thinsp;0.007)\u003csup\u003e10\u003c/sup\u003e.\u003c/p\u003e\u003cp\u003eThe recent development of multimodal LLMs capable of processing both images and text simultaneously now enables direct comparison of AI and physician diagnostic performance in a format that mirrors clinical practice\u003csup\u003e\u003cspan additionalcitationids=\"CR12 CR13\" citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e\u003c/sup\u003e. In a clinical evaluation of 150 dermatological cases, SkinGPT-4, a multimodal diagnostic system, achieved 80.63% accuracy validated by board-certified dermatologists\u003csup\u003e\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e\u003c/sup\u003e. Han et al. demonstrated that GPT-4V achieved superior diagnostic accuracy compared with unimodal predecessors (GPT-4, GPT-3.5) and contemporary models (Gemini Pro, Llama 2, Med42) across both JAMA Clinical Challenge and NEJM Image Challenge datasets, establishing that multimodal capabilities enable medical image interpretation without specialized fine-tuning\u003csup\u003e\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e\u003c/sup\u003e. However, GPT-4V demonstrated poor diagnostic performance in radiological contexts, achieving accuracy rates of only 8% without clinical context and 29% with contextualization when requiring the most likely diagnosis\u003csup\u003e\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e\u003c/sup\u003e. Fenglin Liu et al. developed Med-MLLM, a medical multimodal large language model evaluated across five COVID-19 datasets, demonstrating superior performance in COVID-19 reporting, diagnosis, and prognosis tasks even with minimal labeled data (1%)\u003csup\u003e18\u003c/sup\u003e. In COVID-19 diagnostic image-text classification tasks, the model achieved 90.3% diagnostic accuracy (AUC) when trained with complete datasets, indicating high efficiency and accuracy for rare disease diagnosis.\u003c/p\u003e\u003cp\u003eGiven these limitations in image-only interpretation, we sought to evaluate the diagnostic performance of three state-of-the-art multimodal LLMs\u0026mdash;GPT-4o, Claude 3.7, and Doubao\u0026mdash;using both image-alone and image-plus-text modalities across NEJM Image Challenge cases, comparing their accuracy against global physician performance to determine the clinical potential of AI-assisted rare disease diagnosis.\u003c/p\u003e"},{"header":"Methods","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e\u003ch2\u003e2.1 Data Sources\u003c/h2\u003e\u003cp\u003eWe analyzed 272 consecutive cases from the NEJM Image Challenge published between June 27, 2009, and March 27, 2025. Cases were included if they contained both clinical images and text descriptions. Cases with images only or those flagged for content violations during testing were excluded.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec4\" class=\"Section2\"\u003e\u003ch2\u003e2.2 Model Selection and Testing\u003c/h2\u003e\u003cp\u003eThree publicly available multimodal large language models were evaluated: GPT-4o (OpenAI), Claude 3.7 (Anthropic), and Doubao (ByteDance). Each model was tested through its official web interface using standardized prompts instructing selection of the correct answer from five options with supporting rationale.\u003c/p\u003e\u003cp\u003eModels underwent two-phase testing for each case: image-only followed by multimodal (image plus text). We recorded the selected answer, diagnostic choice, and complete model response. No fine-tuning or repeat querying was performed. For physician-model comparisons, we used multimodal LLM results, as NEJM respondents had access to both images and text.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec5\" class=\"Section2\"\u003e\u003ch2\u003e2.3 Physician Benchmark\u003c/h2\u003e\u003cp\u003ePhysician performance data were obtained from NEJM's published results, comprising 16,401,888 responses (mean, 60,301 physicians per case; range, 12,066\u0026ndash;185,210). Physician accuracy was defined as the proportion selecting the correct diagnosis.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec6\" class=\"Section2\"\u003e\u003ch2\u003e2.4 Statistical Analysis\u003c/h2\u003e\u003cp\u003eThe primary outcome was diagnostic accuracy. We compared model and physician performance using McNemar's test and assessed concordance with Cohen's kappa. Secondary analyses included performance stratification by physician consensus (\u0026lt;\u0026thinsp;40%, 40\u0026ndash;69%, \u0026ge;\u0026thinsp;70%), disease category, imaging modality, age group (\u0026lt;\u0026thinsp;1, 1\u0026ndash;12, \u0026gt;\u0026thinsp;12 years), and sex.\u003c/p\u003e\u003cp\u003eModel sensitivity was calculated for cases where physician accuracy was \u0026lt;\u0026thinsp;50% or \u0026lt;\u0026thinsp;33%. Ensemble performance was evaluated through majority vote. Subgroups with fewer than 5 cases were excluded. Sex equity was defined as accuracy differences\u0026thinsp;\u0026lt;\u0026thinsp;5 percentage points.\u003c/p\u003e\u003cp\u003eStatistical analyses were performed with R version 4.3.0. Two-sided P values\u0026thinsp;\u0026lt;\u0026thinsp;0.05 were considered significant. Confidence intervals were calculated using the Wilson method.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec7\" class=\"Section2\"\u003e\u003ch2\u003e2.5 Ethics\u003c/h2\u003e\u003cp\u003eThis study used publicly available NEJM cases without patient identifiers. Institutional review board approval was not required.\u003c/p\u003e\u003c/div\u003e"},{"header":"Results","content":"\u003cdiv id=\"Sec9\" class=\"Section2\"\u003e\u003ch2\u003e3.1 Patient and Case Characteristics\u003c/h2\u003e\u003cp\u003eThe study comprised 272 diagnostic cases from the NEJM Image Challenge. The cohort included 159 male patients (58.5%) and 113 female patients (41.5%). Age distribution spanned from infancy to advanced age: 155 patients (56.9%) were aged 13\u0026ndash;60 years, 81 (29.8%) were older than 60 years, and 36 (13.3%) were younger than 13 years.\u003c/p\u003e\u003cp\u003eInfectious diseases accounted for 70 cases (25.7%), immune-mediated diseases for 47 (17.3%), and neoplastic diseases for 38 (14.0%). The remaining cases were distributed among genetic disorders, vascular diseases, metabolic conditions, trauma-related pathology, drug-induced diseases, and degenerative disorders. Physical examination findings constituted 142 images (52.2%), radiologic studies 65 (23.9%), and the remainder included combination images, pathologic specimens, endoscopic findings, and electrocardiographic tracings.\u003c/p\u003e\u003cp\u003ePhysician participation averaged 60,301 per case (range, 12,066 to 185,210), totaling 16,401,888 individual responses. Mean physician diagnostic accuracy was 50.1% (SD, 11.8%; range, 26% to 88%). This variation in physician performance across cases provided a robust benchmark for evaluating model performance across different levels of diagnostic complexity.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eCharacteristics of Patients, Cases, and Physician Performance.\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"2\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003ePatient and Case Characteristics\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003e(N\u0026thinsp;=\u0026thinsp;272)\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eDemographic Characteristics\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u0026nbsp;\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cem\u003eSex \u0026mdash; no. (%)\u003c/em\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u0026nbsp;\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eFemale\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e113 (41.5)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eMale\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e159 (58.5)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cem\u003eAge Distribution \u0026mdash; no. (%)\u003c/em\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u0026nbsp;\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u0026lt;1 yr\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e16 (5.9)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e1\u0026ndash;12 yr\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e20 (7.4)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e13\u0026ndash;40 yr\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e82 (30.1)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e41\u0026ndash;60 yr\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e73 (26.8)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u0026gt;60 yr\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e81 (29.8)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eClinical Characteristics\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u0026nbsp;\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cem\u003eDisease Classification \u0026mdash; no. (%)\u003c/em\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u0026nbsp;\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eInfectious diseases\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e70 (25.7)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eImmune-mediated diseases\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e47 (17.3)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eNeoplastic diseases\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e38 (14)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eGenetic/congenital diseases\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e23 (8.5)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eVascular diseases\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e23 (8.5)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eMetabolic/nutritional diseases\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e21 (7.7)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eTraumatic/physical diseases\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e19 (7)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eDrug/toxin-related diseases\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e18 (6.6)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eDegenerative/functional diseases\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e12 (4.4)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eEctopic diseases\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e1 (0.4)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cem\u003eImage Type \u0026mdash; no. (%)\u003c/em\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u0026nbsp;\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003ePhysical signs\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e142 (52.2)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eRadiological\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e65 (23.9)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eCombination\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e29 (10.7)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003ePathological\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e16 (5.9)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eOther\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e9 (3.3)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eEndoscopic\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e8 (2.9)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eElectrocardiographic\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e3 (1.1)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003ePhysician Assessment Performance\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u0026nbsp;\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cem\u003eMean accuracy\u0026thinsp;\u0026plusmn;\u0026thinsp;SD (%)\u003c/em\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e\u003cem\u003e50.1\u0026thinsp;\u0026plusmn;\u0026thinsp;11.8\u003c/em\u003e\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eAccuracy range (%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e26\u0026ndash;88\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003ePhysician participants per case, mean\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e60,301\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003ePhysician participants range\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e12,066\u0026ndash;185,210\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec10\" class=\"Section2\"\u003e\u003ch2\u003e3.2 Diagnostic Accuracy of LLMs versus Physicians\u003c/h2\u003e\u003cdiv id=\"Sec11\" class=\"Section3\"\u003e\u003ch2\u003e3.2.1Diagnostic Performance in Multimodal Testing\u003c/h2\u003e\u003cp\u003eIn the multimodal evaluation of 272 clinical cases, all three large language models significantly outperformed physicians (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). Claude 3.7 and GPT-4o achieved comparable diagnostic accuracy rates of approximately 90%. The absolute difference in accuracy between these models and physician majority vote exceeded 40 percentage points (P\u0026thinsp;\u0026lt;\u0026thinsp;0.001 for all comparisons). Doubao, though less accurate than Claude 3.7 and GPT-4o, also significantly outperformed the physician benchmark (P\u0026thinsp;\u0026lt;\u0026thinsp;0.001).\u003c/p\u003e\u003cp\u003eFigure \u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e. Diagnostic Accuracy of Large Language Models versus Physicians in Multimodal Testing.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eBar graph shows the diagnostic accuracy of three large language models (Claude 3.7, GPT-4o, and Doubao) compared with physician majority vote for 272 multimodal clinical cases from the NEJM Image Challenge. Error bars indicate 95% confidence intervals calculated with the Wilson method. The dashed horizontal line represents chance performance (50%). All models significantly outperformed physicians (P\u0026thinsp;\u0026lt;\u0026thinsp;0.001 for all comparisons, McNemar's test). *** P\u0026thinsp;\u0026lt;\u0026thinsp;0.001.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec12\" class=\"Section3\"\u003e\u003ch2\u003e3.2.2 Performance Stratified by Case Difficulty\u003c/h2\u003e\u003cp\u003eLarge language model performance remained superior across all levels of diagnostic difficulty, as stratified by physician consensus (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e). All models achieved high accuracy in cases with strong physician agreement (\u0026ge;\u0026thinsp;70% consensus). In cases with low physician consensus (\u0026lt;\u0026thinsp;40% correct), where diagnostic uncertainty was greatest, Claude 3.7 maintained 86.5% accuracy, compared with mean physician accuracy of 33.4%.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eLine graph displays diagnostic accuracy across three levels of physician consensus. Low consensus indicates cases where fewer than 40% of physicians selected the correct diagnosis (n\u0026thinsp;=\u0026thinsp;52); moderate consensus, 40% to 69% correct (n\u0026thinsp;=\u0026thinsp;201); and high consensus, 70% or more correct (n\u0026thinsp;=\u0026thinsp;19). Background shading corresponds to consensus levels (red, low; yellow, moderate; green, high). All language models maintained high accuracy (\u0026gt;\u0026thinsp;78%) even in low-consensus cases, whereas physician accuracy increased from 33.4% in low-consensus to 77.3% in high-consensus cases. The dashed horizontal line indicates chance performance.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec13\" class=\"Section3\"\u003e\u003ch2\u003e3.2.3 Robustness of Findings\u003c/h2\u003e\u003cp\u003eThe analysis included 16,401,888 physician responses, with participation ranging from 12,066 to 185,210 physicians per case. Weighted analyses accounting for differential participation rates yielded results identical to those of unweighted analyses. Effect sizes were large for both Claude 3.7 (Cohen's h\u0026thinsp;=\u0026thinsp;0.96) and GPT-4o (Cohen's h\u0026thinsp;=\u0026thinsp;0.95). These findings remained consistent across all analytical approaches.\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Sec14\" class=\"Section2\"\u003e\u003ch2\u003e3.3 Diagnostic Concordance and Complementarity\u003c/h2\u003e\u003cdiv id=\"Sec15\" class=\"Section3\"\u003e\u003ch2\u003e3.3.1 Performance Independence from Case Difficulty\u003c/h2\u003e\u003cp\u003eLarge language model performance showed minimal correlation with physician consensus levels (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e). Physician accuracy ranged from 26% to 88% across cases. In contrast, Claude 3.7 and GPT-4o maintained high accuracy regardless of case difficulty. In cases where fewer than 40% of physicians were correct, Claude 3.7 achieved 86.5% accuracy and GPT-4o achieved 78.8% accuracy. Doubao showed greater variation (46.2% accuracy in low-consensus cases vs. 100% in high-consensus cases).\u003c/p\u003e\u003cp\u003eBubble plot analysis (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e) confirmed these patterns. More than 98% of cases for Claude 3.7 and GPT-4o fell above the line of equal performance with physicians, compared with 90.8% for Doubao. The concentration of large bubbles in the upper regions indicated consistent model superiority across all difficulty levels.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eIndividual diagnostic outcomes for three large language models are shown for 272 cases. Points represent correct (1) or incorrect (0) diagnoses as a function of the proportion of physicians selecting the correct answer. Vertical jittering prevents overlap. Background shading indicates physician consensus levels: red (\u0026lt;\u0026thinsp;40% correct), yellow (40\u0026ndash;69% correct), and green (\u0026ge;\u0026thinsp;70% correct). Smooth curves were fitted with locally weighted regression (LOESS) with 95% confidence intervals (shaded areas). GPT-4o and Claude 3.7 maintained consistent performance across all difficulty levels, whereas Doubao showed greater sensitivity to case difficulty. Numbers of cases: low consensus, 52; moderate consensus, 201; and high consensus, 19.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eDiagnostic accuracy of three large language models is plotted against physician accuracy for 272 cases. Bubble size is proportional to the number of cases at each accuracy level. The diagonal line represents equal performance. Smooth curves show locally weighted regression weighted by case frequency. Cases above the diagonal line indicate superior model performance: GPT-4o, 269 of 272 (98.9%); Claude 3.7, 267 of 272 (98.2%); and Doubao, 247 of 272 (90.8%).\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec16\" class=\"Section3\"\u003e\u003ch2\u003e3.3.2 \u003cb\u003eDiagnostic Agreement and Complementarity\u003c/b\u003e\u003c/h2\u003e\u003cp\u003eAgreement between large language models and physicians was low despite high model accuracy (Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e). Cohen's kappa values were 0.08 (95% CI, -0.04 to 0.19) for GPT-4o, 0.08 (95% CI, -0.03 to 0.20) for Claude 3.7, and 0.24 (95% CI, 0.13 to 0.35) for Doubao. The combination of low kappa values and high model accuracy suggested different diagnostic reasoning pathways.\u003c/p\u003e\u003cp\u003eWhen physician majority vote was incorrect (\u0026lt;\u0026thinsp;50% accuracy), GPT-4o and Claude 3.7 correctly diagnosed 84.8% of cases, and Doubao diagnosed 59.3% correctly. Among the 8 cases with physician accuracy below 33%, GPT-4o and Claude 3.7 maintained 62.5% accuracy.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eAgreement and Complementarity Between Large Language Models and Physicians in Clinical Diagnosis.\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"6\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eModel\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eKappa\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eKappa_CI\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eSensitivity_50\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003eSensitivity_33\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c6\"\u003e\u003cp\u003eSpecificity\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eGPT4o\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.08\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e(-0.04-0.19)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e84.8%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e62.5%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e89.5%\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eClaude\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.08\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e(-0.03-0.20)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e84.8%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e62.5%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e94.7%\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eDoubao\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.24\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e(0.13\u0026ndash;0.35)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e59.3%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e37.5%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e100.0%\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eCohen's κ values measure agreement between each model and physician majority vote (\u0026ge;\u0026thinsp;50% of physicians correct) beyond chance; values near 0 indicate no better than random agreement. Sensitivity indicates model accuracy when physician accuracy was \u0026lt;\u0026thinsp;50% (145 cases) or \u0026lt;\u0026thinsp;33% (8 cases). Specificity indicates model accuracy when physician accuracy was \u0026ge;\u0026thinsp;70% (19 cases). CI denotes confidence interval.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec17\" class=\"Section3\"\u003e\u003ch2\u003e3.3.3 \u003cb\u003ePatterns of Concordance and Discordance\u003c/b\u003e\u003c/h2\u003e\u003cp\u003eConfusion matrices revealed asymmetric agreement patterns (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e). For Claude 3.7, model success with physician failure occurred in 123 cases (45.2%), mutual success in 119 cases (43.8%), mutual failure in 22 cases (8.1%), and physician success with model failure in 8 cases (2.9%). This yielded a 15.4:1 ratio of model-advantage to physician-advantage cases. GPT-4o showed similar patterns. Models excelled particularly in cases with low physician consensus.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eConfusion matrices compare diagnostic outcomes between each model and physician majority vote (\u0026ge;\u0026thinsp;50% correct) for 272 cases. Values show the number of cases with percentages in parentheses. Shading intensity corresponds to percentage. The ratio of model-correct/physician-incorrect to physician-correct/model-incorrect cases was 11:1 for GPT-4o, 15.4:1 for Claude 3.7, and 4:1 for Doubao.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec18\" class=\"Section3\"\u003e\u003ch2\u003e3.3.4 \u003cb\u003eEnsemble Performance\u003c/b\u003e\u003c/h2\u003e\u003cp\u003eAll three models agreed on the correct diagnosis in 171 cases (62.9%). At least one model was correct in 262 cases (96.3%), and all models were incorrect in 10 cases (3.7%). When physician majority vote was included, complete diagnostic failure (all models and physicians incorrect) occurred in 9 cases (3.3%).\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Sec19\" class=\"Section2\"\u003e\u003ch2\u003e3.4 Performance Across Clinical Contexts\u003c/h2\u003e\u003cdiv id=\"Sec20\" class=\"Section3\"\u003e\u003ch2\u003e3.4.1 Disease Category Analysis\u003c/h2\u003e\u003cp\u003eModel accuracy varied by disease category (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003e). Claude 3.7 achieved 100% accuracy in drug- and toxin-related diseases (18 of 18 cases), 95.7% in immune-mediated diseases (45 of 47 cases), and 95.7% in genetic disorders (22 of 23 cases). GPT-4o achieved 97.9% accuracy in immune-mediated diseases (46 of 47 cases). All models had lower accuracy for traumatic diseases (Claude 3.7, 73.7% [14 of 19 cases]; GPT-4o, 78.9% [15 of 19 cases]; Doubao, 63.2% [12 of 19 cases]).\u003c/p\u003e\u003cp\u003eThe largest performance gaps between models and physicians occurred in drug-related diseases (Claude 3.7, 100% vs. physicians, 49.3%; difference, 50.7 percentage points) and genetic disorders (Claude 3.7, 95.7% vs. physicians, 45.5%; difference, 50.2 percentage points). The smallest gap occurred in vascular diseases (physicians, 52.0%; Doubao and Claude 3.7, 78.3%).\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eHeat map showing the diagnostic accuracy (percentage of correct diagnoses) of three large language models (Claude 3.7, GPT-4o, and Doubao) and physicians across nine disease categories. Values in each cell represent the percentage accuracy for that model-disease combination. Darker blue shading indicates higher accuracy. Disease categories are ordered by overall performance. Analysis includes 272 clinical cases with text and images.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec21\" class=\"Section3\"\u003e\u003ch2\u003e3.4.2 Performance by Image Type\u003c/h2\u003e\u003cp\u003eModel accuracy varied by imaging modality (Fig.\u0026nbsp;\u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003e). Claude 3.7 and GPT-4o achieved 100% accuracy with endoscopic images (8 of 8 cases), 96.6% with combination images (28 of 29 cases), and high accuracy with pathological specimens (Claude 3.7, 93.8% [15 of 16 cases]; GPT-4o, 100% [16 of 16 cases]).\u003c/p\u003e\u003cp\u003eFor physical signs (142 cases), Claude 3.7 achieved 91.5% accuracy and GPT-4o achieved 89.4% accuracy, compared with 49.1% for physicians. With radiological images (65 cases), accuracy was 81.5% for Claude 3.7 and 84.6% for GPT-4o, compared with 53.5% for physicians.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eBar graph comparing the diagnostic accuracy of three large language models and physicians across different types of medical images. Error bars represent 95% confidence intervals. Analysis includes 272 clinical cases with both images and text.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec22\" class=\"Section3\"\u003e\u003ch2\u003e3.4.3 Performance by Age and Sex\u003c/h2\u003e\u003cp\u003eClaude 3.7 achieved 100% accuracy in infants younger than 1 year (16 of 16 cases), compared with 49.6% for physicians. In children 1 to 12 years of age, GPT-4o achieved 95.0% accuracy (19 of 20 cases) and Claude 3.7 achieved 80.0% accuracy (16 of 20 cases). Among patients older than 12 years (86.8% of the cohort), model performance approximated overall averages (Table\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eSex-based differences in accuracy were minimal. The largest difference was 8.8 percentage points for Doubao (females, 76.1%; males, 67.3%). Differences were 1.3 percentage points for GPT-4o and 0.7 percentage points for both Claude 3.7 and physicians.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eAge and Sex-Stratified Diagnostic Performance.\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"7\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eCategory\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eSubgroup\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eN\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eClaude 3.7\u003c/p\u003e\u003cp\u003e(%)\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003eGPT-4o (%)\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c6\"\u003e\u003cp\u003eDoubao\u003c/p\u003e\u003cp\u003e(%)\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c7\"\u003e\u003cp\u003ePhysician (%)\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\" morerows=\"2\" rowspan=\"3\"\u003e\u003cp\u003e\u003cb\u003eAge Group\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eInfant (\u0026lt;\u0026thinsp;1 year)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e16\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e100.0 (100.0-100.0)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e87.5\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e75.0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e49.6\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003ePediatric (1\u0026ndash;12 years)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e20\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e80.0 (62.5\u0026ndash;97.5)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e95.0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e75.0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e50.6\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eAdult (\u0026gt;\u0026thinsp;12 years)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e236\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e89.0 (85.0\u0026ndash;93.0)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e88.1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e70.3\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e50.1\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e\u003cp\u003e\u003cb\u003eSex\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eFemale\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e113\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e89.4\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e89.4\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e76.1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e49.7\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eMale\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e159\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e88.7\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e88.1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e67.3\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e50.4\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eValues are diagnostic accuracy percentages. Confidence intervals (95%) for Claude 3.7 were calculated with the Wilson method. Physician values represent the mean proportion selecting the correct diagnosis. Age groups: infant (\u0026lt;\u0026thinsp;1 year), pediatric (1\u0026ndash;12 years), and adult (\u0026gt;\u0026thinsp;12 years).\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Sec23\" class=\"Section2\"\u003e\u003ch2\u003e3.5 Multimodal Performance Enhancement\u003c/h2\u003e\u003cdiv id=\"Sec24\" class=\"Section3\"\u003e\u003ch2\u003e3.5.1 Overall Performance\u003c/h2\u003e\u003cp\u003eAdding clinical text to images improved diagnostic accuracy for all models (Fig.\u0026nbsp;\u003cspan refid=\"Fig8\" class=\"InternalRef\"\u003e8\u003c/span\u003e). Accuracy increased from 47.1% to 89.0% for Claude 3.7 (difference, 41.9 percentage points), from 58.8% to 88.6% for GPT-4o (difference, 29.8 percentage points), and from 42.6% to 71.0% for Doubao (difference, 28.3 percentage points). All differences were significant (P\u0026thinsp;\u0026lt;\u0026thinsp;0.001 by McNemar's test).\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eDiagnostic accuracy of three large language models with images alone (gray bars) and with images plus clinical text (green bars) for 272 cases. Horizontal brackets indicate pairwise comparisons (MeNemar's test). Absolute improvements in accuracy: Claude 3.7,41.9 percentage points (from 47.1%to 89.0%); GPT-40, 29.8 percentage points (from 58.8% to 88.6%); and Doubao, 28.3 percentage points (from 42.6%to 71.0%). Error bars represent 95%confidence intervals. *** P\u0026thinsp;\u0026lt;\u0026thinsp;0.001.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec25\" class=\"Section3\"\u003e\u003ch2\u003e3.5.2 Individual Case Patterns\u003c/h2\u003e\u003cp\u003eAmong 272 cases, diagnostic outcomes after adding clinical text were as follows (Fig.\u0026nbsp;\u003cspan refid=\"Fig9\" class=\"InternalRef\"\u003e9\u003c/span\u003e): For Claude 3.7, accuracy improved in 120 cases (44.1%) and remained unchanged in 146 cases (53.7%). For GPT-4o, accuracy improved in 89 cases (32.7%) and remained unchanged in 175 cases (64.3%). Doubao showed improvement in 77 cases (28.3%) and unchanged accuracy in 195 cases (71.7%).\u003c/p\u003e\u003cp\u003eUnexpectedly, the addition of clinical text led to diagnostic errors in previously correct cases for GPT-4o (8 cases, 2.9%) and Claude 3.7 (6 cases, 2.2%), whereas Doubao showed no such deterioration. Case 20211007 represented the intersection where both GPT-4o and Claude 3.7 changed from correct to incorrect diagnoses after text addition (Fig.\u0026nbsp;\u003cspan refid=\"Fig10\" class=\"InternalRef\"\u003e10\u003c/span\u003e and Fig.\u0026nbsp;\u003cspan refid=\"Fig11\" class=\"InternalRef\"\u003e11\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eWe extracted the response content from both iterations for all 13 cases in which the large language model produced erroneous diagnoses after text augmentation during testing. By analyzing their \"reasoning\" processes, we identified potential causes underlying this phenomenon (Table\u0026nbsp;\u003cspan refid=\"Tab4\" class=\"InternalRef\"\u003e4\u003c/span\u003e).\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eWaterfall plots show individual patient outcomes for 272 cases when clinical text was added to image-based diagnosis. Each vertical bar represents one patient, ordered by identification number. Bar height indicates change in diagnostic outcome: +100% (green), improvement from incorrect to correct; 0% (gray), unchanged; -100% (red), deterioration from correct to incorrect. Panels show results for GPT-4o (top), Claude 3.7 (middle), and Doubao (bottom). Numbers below each panel indicate cases in each category. Claude 3.7: 120 improved (44.1%), 146 unchanged (53.7%), 6 deteriorated (2.2%). GPT-4o: 89 improved (32.7%), 175 unchanged (64.3%), 8 deteriorated (2.9%). Doubao: 77 improved (28.3%), 195 unchanged (71.7%), 0 deteriorated.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003ePatient identifiers indicate NEJM publication dates (YYYYMMDD). Numbers in parentheses show cases in each category. Red indicates the single case (20211007) where both GPT 4o and Claude 3.7 deteriorated.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eMRI showing multifocal ring-enhancing brain lesions (left) and microscopy demonstrating filamentous branching Gram-positive rods (right), suggestive of nocardiosis. Both GPT-4o and Claude 3.7 AI correctly diagnosed nocardiosis using imaging alone, but incorrectly revised diagnosis to listeriosis when clinical text emphasizing elderly age, immunosuppression, and Gram-positive bacilli was added. Images reproduced from NEJM Image Challenge, Case ID 20211007 (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.nejm.org/image-challenge?ci=20211007\u003c/span\u003e\u003cspan address=\"https://www.nejm.org/image-challenge?ci=20211007\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e), \u0026copy;Massachusetts Medical Society. Used under fair use for educational purposes.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab4\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 4\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003ePossible Explanations for Diagnostic Errors After Adding Clinical Text in 13 Image Challenge Cases\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"3\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003ePatient ID\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eURL\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eLikely Cause of Misdiagnosis with Text\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e20211007\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eLink\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eClinical context (elderly, immunocompromised, fever/confusion) caused overemphasis on \u003cb\u003eListeriosis\u003c/b\u003e instead of imaging-characteristic \u003cb\u003eNocardiosis\u003c/b\u003e.\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e20191205\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eLink\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eDescription of \u0026ldquo;soft mass increases with crying\u0026rdquo; in a neonate misled to \u003cb\u003eProlapsed Uterus\u003c/b\u003e; imaging favored \u003cb\u003eHydrocolpos\u003c/b\u003e.\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e20200206\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eLink\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eNon-specific swelling and elderly context led to \u003cb\u003eCarcinoma of the tongue\u003c/b\u003e; image was classic for \u003cb\u003eSublingual epidermoid cyst\u003c/b\u003e.\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e20200305\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eLink\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eHIV/immunosuppressed background and \u0026ldquo;B symptoms\u0026rdquo; led to \u003cb\u003eDLBCL\u003c/b\u003e; imaging supported \u003cb\u003eDisseminated Mycobacterium avium-intracellulare\u003c/b\u003e.\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e20210218\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eLink\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eElderly, weight loss, and abdominal mass misled to \u003cb\u003eAbdominal aortic aneurysm\u003c/b\u003e; image showed \u003cb\u003eUrachal mucinous cystic tumor\u003c/b\u003e.\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e20210304\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eLink\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eSubacute cough/dyspnea led to \u003cb\u003eDiffuse alveolar hemorrhage\u003c/b\u003e; \u0026ldquo;sandstorm\u0026rdquo; X-ray supported \u003cb\u003ePulmonary alveolar microlithiasis\u003c/b\u003e.\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e20210401\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eLink\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eText focus on tick bite, fever, and lymphadenopathy favored \u003cb\u003eRMSF\u003c/b\u003e; eschar and lymph nodes fit \u003cb\u003eTularemia\u003c/b\u003e.\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e20220324\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eLink\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eText mentioned cholesterol emboli\u0026mdash;led to \u003cb\u003eLivedo reticularis\u003c/b\u003e; skin pattern was more consistent with \u003cb\u003eLivedo racemosa\u003c/b\u003e.\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e20200220\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eLink\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eClinical context of tongue swelling in elderly\u0026mdash;model chose \u003cb\u003eCarcinoma\u003c/b\u003e; image pointed to \u003cb\u003eBeckwith-Wiedemann syndrome\u003c/b\u003e (macroglossia).\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e20210121\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eLink\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eYoung adult, fever, and mass symptoms in text\u0026mdash;model picked \u003cb\u003eLymphoma\u003c/b\u003e; image showed \u003cb\u003eCastleman disease\u003c/b\u003e.\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e20210311\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eLink\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eClinical symptoms (pain/swelling) suggested \u003cb\u003eAbscess\u003c/b\u003e; imaging classic for \u003cb\u003eCysticercosis\u003c/b\u003e.\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e20210506\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eLink\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eTextual clues (weight loss, GI symptoms) misled to \u003cb\u003eColon cancer\u003c/b\u003e; image was consistent with \u003cb\u003eGIST (gastrointestinal stromal tumor)\u003c/b\u003e.\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e20210520\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eLink\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eMiddle-aged patient, chronic symptoms\u0026mdash;model chose \u003cb\u003eSarcoidosis\u003c/b\u003e; imaging was classic for \u003cb\u003ePulmonary Langerhans cell histiocytosis\u003c/b\u003e.\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e"},{"header":"Discussion","content":"\u003cp\u003eOur study encompassed a diverse patient population spanning infancy to advanced age with balanced gender distribution (58.5% male, 41.5% female) and broad disease spectrum including infectious, immune-mediated, neoplastic, and genetic conditions. This comprehensive approach offers several clinical advantages over specialized evaluations. Unlike domain-specific studies focusing on single specialties such as neuroradiology or rheumatology, our broad case selection better reflects the diagnostic challenges encountered in general clinical practice where physicians must differentiate among diverse conditions with overlapping presentations. The consistent AI performance across age groups, particularly the perfect accuracy in infants under one year where physicians achieved only 49.6%, suggests robust generalizability across patient demographics—a critical consideration for real-world implementation. The wide disease spectrum evaluation demonstrates that multimodal AI capabilities extend beyond specialty-specific pattern recognition to general diagnostic reasoning, supporting potential applications in primary care and emergency medicine settings where diagnostic breadth rather than depth is often required.\u003c/p\u003e\n\u003cp\u003eOur findings diverge significantly from four recent evaluations using NEJM Image Challenge datasets. Han et al. reported GPT-4V achieving 88.7% accuracy on 348 NEJM cases versus 51.4% for human readers\u003csup\u003e16\u003c/sup\u003e, while Kaczmarczyk et al. found Claude 3 models reaching only 58.8-59.8% accuracy compared to 90.8% collective intelligence\u003csup\u003e19\u003c/sup\u003e, and Suh et al. demonstrated GPT-4o accuracy of 59.6% versus 80.9% for junior faculty radiologists\u003csup\u003e20\u003c/sup\u003e. A rheumatology-focused evaluation showed Claude Sonnet 3.5 achieving 81.2% accuracy in multimodal tasks versus online participants' 51.6%\u003csup\u003e21\u003c/sup\u003e. Our Claude 3.7 achieved 89.0% accuracy against 46.7% physician majority vote. These disparities reflect critical methodological differences: human performance benchmarks varied from individual physician responses (Han, our study) to collective intelligence aggregation (Kaczmarczyk) or expert radiologist panels (Suh), representing fundamentally different clinical scenarios. Model generation advances likely contributed, as newer versions (Claude 3.7, GPT-4V) consistently outperformed earlier iterations, while dataset composition and evaluation periods differed across studies (Han: 2017-2023; Kaczmarczyk: 2005-2023; Suh: 2005-2024; ours: 2009-2025). The Kaczmarczyk collective intelligence benchmark at 90.8%, though statistically robust, represents an idealized scenario unattainable in clinical practice where individual physicians make diagnostic decisions, explaining apparent AI superiority in studies using realistic individual physician baselines.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eOur findings contrast with Le Guellec et al.'s neuroradiology evaluation, where radiologists outperformed GPT-4o and Gemini 1.5 Pro in complete cases (48.0% vs 34.0%) with AI models showing minimal multimodal benefit, unlike our substantial 28-42 percentage point improvements and overall, AI superiority (Claude 3.7: 89.0% vs physicians: 46.7%)\u003csup\u003e22\u003c/sup\u003e. These disparities likely reflect domain-specific challenges, as neuroradiology requires specialized expertise in subtle imaging findings—an area where AI models failed in 81-94% of cases—and different human benchmarks (expert radiologists vs general physician majority vote), suggesting AI diagnostic capabilities vary significantly across medical specialties.\u003c/p\u003e\n\u003cp\u003eIn approximately 2%–3% of cases, adding clinical text caused models such as GPT-4o and Claude 3.7 to shift from a correct image-based diagnosis to an incorrect one. Our conclusions are based on qualitative review and remain subjective; the specific reasons are unclear. Notably, this pattern was not observed with the Doubao model, and the errors rarely overlapped between GPT-4o and Claude 3.7. Most cases involved images with highly characteristic findings, where non-specific text may have misled models—especially those more reliant on textual cues. These observations highlight model differences and underscore the need for further study.\u003c/p\u003e\n\u003cp\u003eSeveral study limitations require acknowledgment. Selection bias in educational case collections may not reflect typical clinical practice complexity. Evaluation using static clinical vignettes differs from dynamic clinical encounters where physicians gather additional information and order sequential investigations. We cannot assess optimal AI-physician collaboration potential or account for real-world time pressures and resource constraints. Additionally, without access to proprietary model training data, possible data contamination cannot be excluded.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eEthics\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u0026nbsp; \u0026nbsp; \u0026nbsp; This study used only publicly available Internet data and did not involve human subjects. Therefore, no specific ethical considerations were required in this study.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConflict of interest\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;The authors declare no competing financial or non-financial interests. None of the authors has any financial, professional, or personal relationships with OpenAI (GPT-4o), Anthropic (Claude 3.7 Sonnet), ByteDance (Doubao), or any involvement in the development, marketing, or commercial activities of the multimodal LLM platforms evaluated in this study.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u0026nbsp; \u0026nbsp; \u0026nbsp;This study was funded by the Health Research Project of the Anhui Province (grant no. AHWJ2023A20456). The funder had no role in the study design, data collection, analysis, interpretation of data, or writing of this manuscript.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eDeclaration of generative AI in scientific Writing\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe author used Claude to translate the paper and Grammarly to correct English grammar when preparing this work. The author reviewed and edited the content as needed and took full responsibility for the publication.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eData availability\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe datasets generated or analysed during the current study are available from the corresponding author on reasonable request.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCredit Authorship Contribution Statement\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eChiyu Sheng:\u003c/strong\u003e Conceptualization, Data curation, Formal analysis, Investigation, Writing - original draft, Writing - review \u0026amp; editing.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eShumin Shen:\u003c/strong\u003e Conceptualization, Data curation, Formal analysis, Investigation, Writing - original draft, Writing - review \u0026amp; editing.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eLin Wang:\u003c/strong\u003e Methodology, Software, Formal analysis, Data curation, Validation, Writing - review \u0026amp; editing.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eJie Chen:\u003c/strong\u003eMethodology, Software, Formal analysis, Data curation, Validation, Writing - review \u0026amp; editing.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eWei Chen:\u003c/strong\u003e Investigation, Validation, Resources, Writing - review \u0026amp; editing.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eNianfei Wang:\u003c/strong\u003e Conceptualization, Methodology, Validation, Writing - review \u0026amp; editing, Supervision, Project administration, Funding acquisition.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eShanghu Wang:\u003c/strong\u003e Conceptualization, Resources, Writing - original draft, Supervision, Project administration, Funding acquisition.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eGunderson CG, Bilan VP, Holleck JL, et al. Prevalence of harmful diagnostic errors in hospitalised adults: a systematic review and meta-analysis. BMJ Qual Saf 2020;29(12):1008\u0026ndash;18. \u003c/li\u003e\n\u003cli\u003eSingh H, Meyer AND, Thomas EJ. The frequency of diagnostic errors in outpatient care: estimations from three large observational studies involving US adult populations. BMJ Qual Saf 2014;23(9):727\u0026ndash;31. \u003c/li\u003e\n\u003cli\u003eFaye F, Crocione C, Anido de Pe\u0026ntilde;a R, et al. Time to diagnosis and determinants of diagnostic delays of people living with a rare disease: Results of a rare barometer retrospective patient survey. Eur J Hum Genet 2024;32(9):1116\u0026ndash;26. \u003c/li\u003e\n\u003cli\u003eNguengang Wakap S, Lambert DM, Olry A, et al. Estimating cumulative point prevalence of rare diseases: analysis of the Orphanet database. Eur J Hum Genet 2020;28(2):165\u0026ndash;73. \u003c/li\u003e\n\u003cli\u003eSchubert MC, Wick W, Venkataramani V, Schubert MC, Wick W, Venkataramani V. Performance of Large Language Models on a Neurology Board\u0026ndash;Style Examination. JAMA Netw Open 2023;6(12):e2346721. \u003c/li\u003e\n\u003cli\u003eBeam K, Sharma P, Kumar B, et al. Performance of a Large Language Model on Practice Questions for the Neonatal Board Examination. JAMA Pediatr 2023;177(9):977. \u003c/li\u003e\n\u003cli\u003eLongwell JB, Hirsch I, Binder F, et al. Performance of Large Language Models on Medical Oncology Examination Questions. JAMA Netw Open 2024;7(6):e2417641. \u003c/li\u003e\n\u003cli\u003eNori H, King N, McKinney SM, Carignan D, Horvitz E. Capabilities of GPT-4 on Medical Challenge Problems [Internet]. 2023 [cited 2025 May 28];Available from: http://arxiv.org/abs/2303.13375\u003c/li\u003e\n\u003cli\u003eBicknell BT, Butler D, Whalen S, et al. ChatGPT-4 Omni Performance in USMLE Disciplines and Clinical Skills: Comparative Analysis. JMIR Med Educ 2024;10:e63430\u0026ndash;e63430. \u003c/li\u003e\n\u003cli\u003eTakita H, Kabata D, Walston SL, et al. A systematic review and meta-analysis of diagnostic performance comparison between generative AI and physicians. npj Digit Med 2025;8(1):175. \u003c/li\u003e\n\u003cli\u003eFerber D, W\u0026ouml;lflein G, Wiest IC, et al. In-context learning enables multimodal large language models to classify cancer pathology images. Nat Commun 2024;15(1):10104. \u003c/li\u003e\n\u003cli\u003eZhang D, Yu Y, Dong J, et al. MM-LLMs: Recent Advances in MultiModal Large Language Models [Internet]. 2024 [cited 2025 May 28];Available from: http://arxiv.org/abs/2401.13601\u003c/li\u003e\n\u003cli\u003eNishino M, Ballard DH, Nishino M, Ballard DH. Multimodal Large Language Models to Solve Image-based Diagnostic Challenges: The Next Big Wave is Already Here. Radiology 2024;312(1):e241379. \u003c/li\u003e\n\u003cli\u003eBradshaw TJ, Tie X, Warner J, et al. Large Language Models and Large Multimodal Models in Medical Imaging: A Primer for Physicians. J Nucl Med 2025;66(2):173\u0026ndash;82. \u003c/li\u003e\n\u003cli\u003eZhou J, He X, Sun L, et al. Pre-trained multimodal large language model enhances dermatological diagnosis using SkinGPT-4. Nat Commun 2024;15(1):5649. \u003c/li\u003e\n\u003cli\u003eHan T, Adams LC, Bressem KK, et al. Comparative Analysis of Multimodal Large Language Model Performance on Clinical Vignette Questions. JAMA 2024;331(15):1320. \u003c/li\u003e\n\u003cli\u003eHuppertz MS, Siepmann R, Topp D, et al. Revolution or risk?\u0026mdash;Assessing the potential and challenges of GPT-4V in radiologic image interpretation. Eur Radiol 2025;35(3):1111\u0026ndash;21. \u003c/li\u003e\n\u003cli\u003eLiu F, Zhu T, Wu X, et al. A medical multimodal large language model for future pandemics. npj Digit Med 2023;6(1):1\u0026ndash;15. \u003c/li\u003e\n\u003cli\u003eKaczmarczyk R, Wilhelm TI, Martin R, Roos J. Evaluating multimodal AI in medical diagnostics. npj Digit Med 2024;7(1):205. \u003c/li\u003e\n\u003cli\u003eSuh PS, Shim WH, Suh CH, et al. Comparing Large Language Model and Human Reader Accuracy with New England Journal of Medicine Image Challenge Case Image Inputs. Radiology 2024;313(3):e241668. \u003c/li\u003e\n\u003cli\u003eOmar M, Agbareia R, Klang E, et al. Large Language Models in Rheumatologic Diagnosis: A Multimodal Performance Analysis. J Rheumatol 2025;52(2):jrheum.2024-0975. \u003c/li\u003e\n\u003cli\u003eLe Guellec B, Bruge C, Chalhoub N, et al. Comparison between multimodal foundation models and radiologists for the diagnosis of challenging neuroradiology cases with text and images. Diagnostic and Interventional Imaging 2025;S2211-5684(25)96-8. \u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"Artificial Intelligence, Multimodal Large Language Models, Diagnostic Accuracy, Rare Disease, Medical Imaging","lastPublishedDoi":"10.21203/rs.3.rs-8028355/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8028355/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003eBackground\u003c/h2\u003e\u003cp\u003eTheoretically, multimodal large language models better reflect real-world clinical scenarios in disease diagnosis compared to text-only large language models. The New England Journal of Medicine Image Challenge contains real clinical cases with images and textual materials, making it the optimal resource for testing multimodal LLM diagnostic accuracy.\u003c/p\u003e\u003ch2\u003eMethods\u003c/h2\u003e\u003cp\u003eWe analyzed 272 Image Challenge cases (June 2009 to March 2025) containing both images and clinical text. Three LLMs\u0026mdash;GPT-4o, Claude 3.7, and Doubao\u0026mdash;were evaluated against responses from 16,401,888 physicians worldwide (mean, 60,301 per case). Models were tested with images alone and with combined image-text inputs. The primary outcome was diagnostic accuracy in the multimodal condition.\u003c/p\u003e\u003ch2\u003eResults\u003c/h2\u003e\u003cp\u003eAll LLMs significantly outperformed physicians (P\u0026thinsp;\u0026lt;\u0026thinsp;0.001). Diagnostic accuracy in multimodal testing was 89.0% (95% CI, 84.9 to 92.3) with Claude 3.7, 88.6% (95% CI, 84.5 to 92.0) with GPT-4o, and 71.0% (95% CI, 65.3 to 76.2) with Doubao, compared with 46.7% (95% CI, 40.7 to 52.7) for physician majority vote\u0026mdash;an absolute difference exceeding 40 percentage points for top-performing models. In diagnostically challenging cases where fewer than 40% of physicians were correct, Claude 3.7 maintained 86.5% accuracy versus 33.4% for physicians. Despite high accuracy, model-physician concordance was low (Cohen's κ, 0.08 to 0.24), with a 15.4:1 ratio of model-advantage to physician-advantage cases for Claude 3.7. Adding clinical text to images improved accuracy by 28 to 42 percentage points across models. At least one model was correct in 96.3% of cases.\u003c/p\u003e\u003ch2\u003eConclusions\u003c/h2\u003e\u003cp\u003eMultimodal testing achieved significantly higher diagnostic accuracy than image-only evaluation and substantially exceeded physician diagnostic performance. High AI accuracy coupled with low physician-AI concordance indicates that multimodal large language models utilize fundamentally different diagnostic reasoning processes. These findings suggest multimodal LLMs may function as valuable diagnostic assistants, augmenting rather than replacing physician clinical decision-making.\u003c/p\u003e","manuscriptTitle":"Multimodal Large Language Models Challenge NEJM Image Challenge","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-12-08 11:23:34","doi":"10.21203/rs.3.rs-8028355/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2025-12-19T14:02:08+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-12-12T13:08:34+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-12-11T17:34:57+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"279823821037624811700490575262180000207","date":"2025-12-11T16:15:18+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"256539798903293269089643235721441932640","date":"2025-12-04T15:36:34+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2025-12-04T08:25:34+00:00","index":"","fulltext":""},{"type":"editorInvited","content":"","date":"2025-11-28T10:38:29+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2025-11-08T11:52:40+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2025-11-08T11:52:05+00:00","index":"","fulltext":""},{"type":"submitted","content":"Scientific Reports","date":"2025-11-04T11:38:15+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"d191b541-66a9-42e9-9add-15be4aa5ddf6","owner":[],"postedDate":"December 8th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[{"id":59196236,"name":"Biological sciences/Computational biology and bioinformatics"},{"id":59196237,"name":"Health sciences/Diseases"},{"id":59196238,"name":"Health sciences/Health care"},{"id":59196239,"name":"Physical sciences/Mathematics and computing"},{"id":59196240,"name":"Health sciences/Medical research"}],"tags":[],"updatedAt":"2026-02-16T16:05:07+00:00","versionOfRecord":{"articleIdentity":"rs-8028355","link":"https://doi.org/10.1038/s41598-026-39201-3","journal":{"identity":"scientific-reports","isVorOnly":false,"title":"Scientific Reports"},"publishedOn":"2026-02-10 15:57:19","publishedOnDateReadable":"February 10th, 2026"},"versionCreatedAt":"2025-12-08 11:23:34","video":"","vorDoi":"10.1038/s41598-026-39201-3","vorDoiUrl":"https://doi.org/10.1038/s41598-026-39201-3","workflowStages":[]},"version":"v1","identity":"rs-8028355","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8028355","identity":"rs-8028355","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-05-23T02:00:01.238055+00:00

License: CC-BY-4.0