Impact of Query Language on Large Language Model Performance in Dental Trauma Management: A Comparative Evaluation of ChatGPT, Gemini, and Claude

doi:10.21203/rs.3.rs-8754479/v1

Impact of Query Language on Large Language Model Performance in Dental Trauma Management: A Comparative Evaluation of ChatGPT, Gemini, and Claude

2026 · doi:10.21203/rs.3.rs-8754479/v1

preprint OA: closed CC-BY-4.0

📄 Open PDF Full text JSON View at publisher

Full text 111,573 characters · extracted from preprint-html · click to expand

Impact of Query Language on Large Language Model Performance in Dental Trauma Management: A Comparative Evaluation of ChatGPT, Gemini, and Claude | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Impact of Query Language on Large Language Model Performance in Dental Trauma Management: A Comparative Evaluation of ChatGPT, Gemini, and Claude Hasan Öz, Mehmet Dundar This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8754479/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted 11 You are reading this latest preprint version Abstract Background Large language models (LLMs) are increasingly used as clinical decision support tools in healthcare, yet the impact of query language on their performance remains unclear, particularly in specialized domains like dental traumatology. This study evaluated whether LLM performance in dental trauma management differs based on the language of clinical scenarios (English vs. Turkish) and compared performance across three AI models. Methods Twenty-seven clinical scenarios covering 13 dental trauma categories were presented to ChatGPT 5.2, Gemini 3.0, and Claude 4.5 Sonnet in both English and Turkish, generating 162 responses. Two blinded endodontists independently evaluated responses using a standardized rubric assessing accuracy (40%), completeness (35%), and safety (25%) against IADT 2020 Guidelines. Inter-rater reliability was assessed using intraclass correlation coefficient (ICC). Language effects were analyzed using Wilcoxon signed-rank tests; model comparisons employed Kruskal-Wallis and Mann-Whitney U tests with Bonferroni correction. Results Inter-rater reliability was good across all dimensions (ICC: 0.738–0.836). ChatGPT showed the strongest language effect with 9.14% higher performance in English (p < 0.001, r = 0.874). Gemini showed moderate English advantage (5.69%, p = 0.003, r = 0.572). Claude exhibited language independence with virtually identical performance in both languages (-0.02%, p = 0.220). In English, significant model differences emerged (H = 22.31, p < 0.001); however, model performance converged in Turkish (H = 2.89, p = 0.236). Conclusions Language-dependent performance variations in LLMs are model-specific rather than universal. While ChatGPT achieved highest absolute scores, Claude’s language independence may offer more reliable performance in non-English clinical settings. These findings have implications for deployment of AI in multilingual healthcare environments. Figures Figure 1 Figure 2 Figure 3 INTRODUCTION Artificial intelligence has transformed healthcare delivery, with large language models emerging as promising clinical decision support tools [ 1 , 2 ]. Models such as ChatGPT, Gemini, and Claude demonstrate notable capabilities in processing and generating medical text [ 3 ]. These systems show potential across clinical decision support, medical education, patient communication, and diagnostic assistance [ 4 – 6 ]. Their capacity to synthesize medical literature has generated considerable interest in their clinical applications [ 7 , 8 ]. Traumatic dental injuries (TDIs) present a public health challenge requiring rapid clinical decisions. Epidemiological studies indicate that approximately 25% of school-age children experience dental trauma, and 33% of adults have sustained injuries to their permanent teeth, with most injuries occurring before age 19. Luxation injuries predominate in primary dentition while crown fractures are more common in permanent teeth [ 9 ]. The time-sensitive nature of dental trauma management, particularly in avulsion cases where outcomes depend on immediate intervention, makes this domain well-suited for AI-assisted decision support [ 10 , 11 ]. The International Association of Dental Traumatology (IADT) has developed guidelines that serve as the gold standard for TDI management [ 9 , 12 ]. These guidelines cover a spectrum of injury types including enamel-dentin fractures, crown-root fractures, root fractures, concussion, subluxation, extrusive luxation, lateral luxation, intrusive luxation, and avulsion, each requiring distinct diagnostic and treatment approaches [ 12 , 13 ]. The complexity of these guidelines presents challenges for clinical application; treatment decisions depend on multiple interacting factors such as root development stage, extra-alveolar dry time, storage medium, patient age, and injury characteristics [ 13 , 14 ]. For example, management of an avulsed tooth varies based on apex status, extra-oral time duration, and storage conditions [ 14 ]. This multifactorial decision-making framework makes dental trauma appropriate for evaluating AI clinical reasoning capabilities. Research on AI chatbot performance in dental trauma has yielded variable results [ 15 – 22 ]. Earlier studies reported moderate accuracy rates (57.5–76.7%) for models like ChatGPT 3.5 and Google Bard [ 15 , 16 ]. Recent evaluations of advanced models show improved performance, with accuracy rates ranging from 85% to 100% [ 17 , 18 , 20 – 22 ]. However, most of these studies used direct questioning approaches where injury types were explicitly stated, which may not adequately assess the diagnostic reasoning capabilities required in clinical practice. A critical consideration is the potential influence of query language on model performance. LLMs are predominantly trained on English-language corpora, which may create performance disparities when queried in other languages [ 23 , 24 ]. Studies investigating this phenomenon in medical contexts have yielded mixed findings [ 25 – 30 ]. Several report significant English superiority in dental and medical examinations [ 25 – 28 ], while others found no significant language effect [ 29 ]. This inconsistency suggests that language effects may vary across clinical domains and warrants further investigation [ 30 ]. Despite growing literature examining AI performance in dentistry, gaps remain in understanding LLM capabilities in dental traumatology. Limited research has systematically compared multiple contemporary LLMs across different languages using clinically relevant scenario-based assessments. Most previous studies employed direct questioning methodologies where the diagnosis is provided, failing to evaluate whether AI systems can independently identify injury types from clinical presentations—a fundamental requirement for clinical decision support. Additionally, evaluation frameworks in prior studies often lacked comprehensive assessment of safety considerations. This study aimed to compare the performance of three large language models (ChatGPT 5.2, Gemini 3.0, and Claude Sonnet 4.5) in responding to scenario-based dental trauma questions in Turkish and English. We hypothesized that AI models would perform better when queried in English compared to Turkish, given the predominance of English-language training data. Unlike previous studies that directly specified injury types, this study presented clinical scenarios describing patient symptoms, clinical findings, and radiographic features without explicitly naming the diagnosis. This approach assessed whether LLMs could independently identify traumatic injury types and generate appropriate treatment recommendations aligned with IADT 2020 Guidelines. Model responses were evaluated across three dimensions: accuracy of the diagnosis and treatment plan, completeness of the recommended management protocol, and safety considerations for patient care. METHODS Study Design This comparative cross-sectional study evaluated the performance of three large language models in responding to scenario-based dental trauma questions presented in Turkish and English. The study employed a standardized evaluation framework based on IADT 2020 guidelines as the reference standard. The study protocol was exempt from ethical review as it involved analysis of AI-generated content without human participant involvement. AI Models Evaluated Three contemporary LLMs were selected: ChatGPT-5.2 (OpenAI, San Francisco, CA, USA), Gemini-3.0 (Google, Mountain View, CA, USA), and Claude Sonnet 4.5 (Anthropic, San Francisco, CA, USA). Models were accessed through their respective web-based interfaces between November 28 and December 10, 2025. To ensure reproducibility, a standardized protocol was employed: new accounts were created specifically for this study to eliminate influence from prior usage history; memory and conversation history features were disabled; web search functionality was disabled to ensure responses were generated solely from training data; default temperature settings were used; no system prompts, custom instructions, or pre-configured preferences were applied; and each scenario was submitted in a new, independent conversation session. Screenshots documenting interface settings and model version information are provided as Supplementary Material 2. Scenario Development An initial pool of 30 dental trauma clinical scenarios was developed by the primary investigator based on IADT 2020 guidelines covering traumatic injuries to permanent teeth, avulsion, and injuries in primary dentition. Each scenario described a hypothetical patient presentation including demographic information (age, sex), mechanism of injury, time elapsed since trauma, clinical findings (tooth mobility, displacement, percussion sensitivity, pulp testing results), and radiographic features. A key feature of this study was the scenario-based approach: unlike previous studies that directly specified injury types, scenarios were presented without explicit diagnostic statements. This methodology assessed whether LLMs could independently identify traumatic injury types from clinical presentations and generate appropriate treatment recommendations—a fundamental requirement for clinical decision support systems. All scenarios were initially developed in Turkish by the primary investigator with native-level proficiency in both languages. Forward translation into English was performed by the primary investigator. To ensure linguistic equivalence, a modified back-translation procedure was employed: an independent bilingual dental professional translated the English scenarios back into Turkish, and back-translated versions were compared with original Turkish scenarios by a second bilingual evaluator. Discrepancies were resolved through consensus discussion, and minor modifications were made to three scenarios to achieve semantic equivalence. Both reviewers confirmed that final translated scenarios maintained equivalent clinical meaning, difficulty level, and diagnostic complexity across languages. The final sample of 27 scenarios was determined based on comprehensive coverage of all major dental trauma categories defined in IADT 2020 guidelines. Scenarios encompassed 13 distinct trauma categories: (1) enamel infraction, (2) enamel fracture, (3) enamel-dentin fracture, (4) enamel-dentin-pulp fracture, (5) crown-root fracture, (6) root fracture, (7) alveolar fracture, (8) concussion, (9) subluxation, (10) extrusive luxation, (11) lateral luxation, (12) intrusive luxation, and (13) avulsion. Scenarios covered both permanent and primary dentition with varying clinical complexity levels, including patients of different ages and diverse extra-oral time intervals for avulsion cases, ensuring each category was represented by at least two scenarios. While formal power analysis was not performed a priori given the exploratory nature of this research, the sample size is comparable to similar LLM evaluation studies in dental traumatology (range: 15–40 scenarios in published literature). Data Collection Data collection was conducted between November 28 and December 10, 2025. Each scenario was submitted to all three LLMs in both Turkish and English, generating a total of 162 AI responses (27 scenarios × 3 models × 2 languages). To minimize potential carryover effects, each query was submitted in a new conversation session. Responses were collected and verbatim stored for subsequent evaluation. Model outputs were de-identified and randomized prior to evaluation to ensure assessor blinding. The 12-day data collection period was selected to balance practical constraints with methodological rigor. To verify model stability during this period, we confirmed that no major model updates were announced by any provider during the data collection window. Additionally, spot checks were conducted by re-submitting a subset of five scenarios at the beginning and end of the collection period; response patterns remained consistent, suggesting model stability throughout the study. Evaluation Framework A comprehensive evaluation of rubric was developed based on IADT 2020 guidelines and refined through iterative expert discussion. Model responses were evaluated across three dimensions: Accuracy (Weight: 40%) : Assessed concordance between AI responses and IADT 2020 guidelines. Evaluation criteria included correctness of diagnosis, appropriateness of treatment approach, splint type and duration, antibiotic selection and dosing (when indicated), endodontic timing recommendations, and follow-up protocols. Completeness (Weight: 35%) : Evaluated whether responses included all essential components specified in guidelines for each injury type. Required elements varied by trauma category and included immediate management steps, repositioning techniques (for luxation injuries), splinting protocols, pharmacological recommendations, pulp vitality monitoring protocols, and follow-up schedules. Safety (Weight: 25%) : Assessed absence of potentially harmful recommendations and presence of critical safety considerations. Critical violations included recommending replantation of primary teeth, prescribing tetracycline-class antibiotics to patients under 12 years of age, recommending rigid splints where flexible splints are indicated, or suggesting unnecessary endodontic treatment in immature teeth with revascularization potential. Each dimension was scored on a 0–5 point scale (0 = unacceptable, 1 = poor, 2 = inadequate, 3 = acceptable, 4 = good, 5 = excellent). Specific point deductions were established for protocol violations categorized as critical (− 2 points), serious (− 1 point), moderate (− 0.5 points), or minor (− 0.25 points) based on clinical significance. The total weighted score was calculated using the formula: Total Weighted Score = (Accuracy × 0.40) + (Completeness × 0.35) + (Safety × 0.25). This weighting reflects clinical priorities in dental trauma management where diagnostic accuracy is prerequisite for appropriate care. Completeness was weighted second, recognizing that comprehensive management protocols including splinting, pharmacotherapy, and follow-up affect treatment outcomes. Safety, while assigned the lowest weight, was considered a critical dimension where any significant violation would substantially reduce the total score. The point deduction system was adapted from established clinical error classification frameworks and refined through pilot testing. The complete evaluation rubric with detailed scoring criteria is provided as Supplementary Material 1. Rater Selection and Training Two endodontists, each with more than 5 years of clinical specialization experience, served as independent evaluators. Prior to the assessment phase, both raters participated in a calibration session where they independently evaluated a subset of five responses not included in final analysis. Discrepancies were discussed, and consensus was reached on scoring criteria interpretation. Raters were blinded to model identity and query language throughout evaluation. Consensus Process Following independent evaluation, responses where rater scores differed by ≥ 0.5 points on any dimension were identified for consensus review. In these cases, both raters reviewed the response and relevant IADT guidelines together and reached a mutually agreed-upon consensus score. Consensus scores were used for all subsequent analyses. Statistical Analysis Statistical analyses were performed using JASP software (Version 0.95.4). Statistical significance was set at α = 0.05 for all tests. Inter-rater reliability was assessed using intraclass correlation coefficient calculated with a two-way random effects model for absolute agreement (ICC[ 2 , 1 ]). Descriptive statistics including means, standard deviations, medians, and interquartile ranges were calculated for all performance metrics stratified by model and language. Given the ordinal nature of scoring data and non-normal distribution confirmed by Shapiro-Wilk tests, non-parametric methods were employed. Kruskal-Wallis H tests compared performance across models, with post-hoc pairwise comparisons using Mann-Whitney U tests with Bonferroni correction (adjusted α = 0.0167). Wilcoxon signed-rank tests compared paired English and Turkish responses within each model. Effect sizes were interpreted as small (r 0.5). ICC values were interpreted according to Koo and Li criteria: 0.90 excellent reliability. RESULTS Inter-rater Reliability Analysis demonstrated adequate inter-rater reliability across all evaluation dimensions (Table 1 ). Per Koo and Li criteria, ICC values indicated good reliability for completeness (ICC = 0.836, 95% CI: 0.783–0.877) and accuracy (ICC = 0.783, 95% CI: 0.714–0.838), with moderate reliability for safety (ICC = 0.738, 95% CI: 0.659–0.802). While safety fell within the moderate range (0.50–0.75), this value remained above acceptable thresholds for clinical research and was consistent with inherent subjectivity in evaluating safety considerations in clinical recommendations. These reliability coefficients support the validity of the evaluation framework. Table 1 Inter-rater Reliability Assessment Dimension ICC 95% CI Interpretation Accuracy 0.783 0.714–0.838 Good Completeness 0.836 0.783–0.877 Good Safety 0.738 0.659–0.802 Moderate ICC = Intraclass Correlation Coefficient; n = 162 evaluations. Interpretation based on Koo and Li criteria. Language Effect on Model Performance The primary hypothesis examined whether LLM performance varies based on the language of clinical scenarios. Wilcoxon signed-rank tests revealed distinct language-dependent patterns across the three models (Fig. 1 , Table 2 ). ChatGPT demonstrated the most pronounced language effect, with higher performance in English (4.856 ± 0.144) compared to Turkish (4.431 ± 0.191), representing a 9.14% performance differential (W = 0.0, p < 0.001, r = 0.874). This large effect size indicates substantial language-dependent variation, with English responses consistently outperforming Turkish counterparts across all 27 scenarios. Gemini exhibited a moderate language effect, scoring 4.744 ± 0.364 in English versus 4.481 ± 0.420 in Turkish, a 5.69% difference that reached statistical significance (W = 58.5, p = 0.003, r = 0.572). While less pronounced than ChatGPT, this pattern indicates meaningful performance variation attributable to language. Claude demonstrated language independence, achieving virtually identical scores in English (4.408 ± 0.459) and Turkish (4.409 ± 0.407). The negligible difference of -0.02% was not statistically significant (W = 138.0, p = 0.220, r = 0.236), indicating that Claude’s clinical reasoning capabilities were unaffected by language of presentation (Fig. 2 ). This finding suggests fundamentally different multilingual processing architectures among the evaluated models. Table 2 Language Effect Analysis by Model Model English Turkish Diff (%) p-value Effect (r) ChatGPT 4.856 ± 0.144 4.431 ± 0.191 + 9.14 < 0.001*** 0.874 Gemini 4.744 ± 0.364 4.481 ± 0.420 + 5.69 0.003** 0.572 Claude 4.408 ± 0.459 4.409 ± 0.407 -0.02 0.220 0.236 Values presented as mean ± SD. Wilcoxon signed-rank test for paired comparisons. * p < 0.001 , p < 0.01. Effect size: small (r 0.5). Overall Model Performance Comparison Kruskal-Wallis analysis revealed significant differences among models when combining both language conditions (H = 11.45, p = 0.003). ChatGPT achieved highest overall mean score (4.643 ± 0.270), followed by Gemini (4.613 ± 0.407) and Claude (4.409 ± 0.426). Post-hoc pairwise comparisons with Bonferroni correction showed no significant difference between ChatGPT and Gemini (U = 1388.0, p = 0.667), while both outperformed Claude (ChatGPT vs Claude: U = 1908.0, p = 0.006; Gemini vs Claude: U = 1951.5, p = 0.002). Language-stratified analysis revealed divergent patterns. In English, significant model differences were observed (H = 22.31, p < 0.001), whereas Turkish responses showed no significant differences among models (H = 2.89, p = 0.236). This interaction suggests that model superiority is contingent upon language context, with performance hierarchies converging when operating in non-English languages. Dimension-Specific Performance Analysis of individual evaluation dimensions revealed consistent patterns across accuracy, completeness, and safety metrics (Fig. 3 , Table 3 ). The language effect was most pronounced in completeness, where ChatGPT showed a 0.500-point decrease from English to Turkish, followed by Gemini (0.426-point decrease). Claude maintained stable completeness scores across languages (difference: 0.019 points). Safety performance demonstrated a ceiling effect in English, particularly for ChatGPT which achieved perfect safety scores (5.000) across all 27 English scenarios. However, Turkish responses revealed more variability, with ChatGPT’s safety scores declining to 4.639 (only 25.9% of scenarios achieving perfect safety scores, compared to 100% in English). Claude exhibited the most consistent safety performance across languages (English: 4.796, Turkish: 4.796), with comparable rates of perfect safety scores (66.7% vs 59.3%). Table 3 Performance by Evaluation Dimension Dimension Model English Turkish Difference Accuracy ChatGPT 4.833 4.435 + 0.398 Gemini 4.657 4.426 + 0.231 Claude 4.296 4.315 -0.019 Completeness ChatGPT 4.778 4.278 + 0.500 Gemini 4.759 4.333 + 0.426 Claude 4.259 4.241 + 0.019 Safety ChatGPT 5.000 4.639 + 0.361 Gemini 4.861 4.778 + 0.083 Claude 4.796 4.796 0.000 All values represent consensus scores. Positive differences indicate higher English performance. Response Consistency and Clinical Acceptability Analysis of response variability, measured by standard deviation of total weighted scores, revealed notable differences in consistency among models. ChatGPT demonstrated highest consistency with lowest variability (combined SD = 0.270), though this uniformity was partially attributable to a ceiling effect in English responses (SD = 0.144). Gemini and Claude showed comparable variability (combined SD = 0.407 and 0.426, respectively), indicating more heterogeneous performance across clinical scenarios. Clinical acceptability, defined as responses achieving total weighted scores ≥ 4.5, further illuminated performance patterns. In English, ChatGPT achieved 100% acceptability (27/27), compared to 81.5% for Gemini (22/27) and 51.9% for Claude (14/27). Claude’s acceptability rate improved in Turkish (59.3%, 16/27) while ChatGPT’s declined substantially (44.4%, 12/27), suggesting that language-related performance degradation disproportionately affects clinical utility for some models. DISCUSSION This study provides evidence regarding the differential impact of query language on large language model performance in dental trauma management. Our findings demonstrate that language-dependent performance variations are model-specific rather than universal, with ChatGPT 5.2 showing strong English superiority (9.14%, p < 0.001), Gemini 3.0 exhibiting moderate English advantage (5.69%, p = 0.003), and Claude 4.5 Sonnet demonstrating complete language independence (-0.02%, p = 0.220). Language Effect and Model Performance Our findings regarding ChatGPT’s English superiority align with recent literature. Studies have reported significant English advantage across dental and medical examinations [ 25 – 27 ]. Wójcik et al. found that Claude demonstrated the most consistent cross-lingual performance, paralleling our observation of Claude’s language independence [ 28 ]. However, Sozen Yanik et al. reported no significant language effect in maxillofacial prosthodontics, suggesting that language effects may vary across clinical domains [ 29 ]. The language-stratified analysis revealed that significant model differences observed in English (H = 22.31, p < 0.001) disappeared in Turkish (H = 2.89, p = 0.236). This convergence suggests that language-related degradation disproportionately affects high-performing models, effectively equalizing capabilities in non-English contexts. ChatGPT achieved the highest overall score (4.83 ± 0.24), consistent with recent benchmarks showing continued advancement in LLM capabilities from earlier reports of 57.5% accuracy [ 15 ] to current levels exceeding 85% [ 22 , 31 ]. Our lower diagnostic accuracy rates compared to studies reporting 100% [ 21 ] likely reflect our scenario-based approach requiring independent injury identification. Safety Considerations While ChatGPT achieved highest overall performance, Claude demonstrated the most consistent safety scores across languages. ChatGPT achieved perfect safety scores in all English scenarios but showed marked variability in Turkish (only 25.9% perfect scores versus 100% in English). This pattern has important clinical implications, as Wang et al. noted that different models show advantages in different task types [ 32 ]. Clinical Implications Our findings suggest that language considerations should factor into model selection for multilingual healthcare settings. In Turkish-speaking contexts, Claude’s language independence may compensate for its lower absolute accuracy. The substantial decline in ChatGPT’s safety scores from English to Turkish represents a clinically significant concern warranting careful implementation planning. The clinical significance of observed performance differentials warrants careful interpretation. ChatGPT’s 9.14% performance gap between English and Turkish translates to approximately 0.425 points on our 5-point scale. In practical terms, this difference could represent the distinction between a “good” (4.0) and “excellent” (4.5+) clinical recommendation—potentially affecting the comprehensiveness of follow-up protocols or specificity of treatment timing. More critically, the substantial decline in ChatGPT’s safety scores (100% to 25.9% perfect scores) suggests that language-related performance degradation may disproportionately affect safety-critical aspects. For individual patient care, this could manifest as incomplete safety warnings regarding contraindicated medications or suboptimal splinting recommendations. The high overall performance of all models (4.41–4.83 on a 5-point scale) suggests contemporary LLMs can serve as clinical decision support tools in dental trauma management with appropriate oversight. Haupt et al. demonstrated that ChatGPT significantly improved student accuracy in dental trauma assessments, supporting educational utility [ 33 ]. Strengths This study offers several methodological strengths. The scenario-based approach requiring independent injury identification more closely approximates real-world clinical decision-making compared to studies using direct questioning where the diagnosis is provided. The blinded, dual-rater evaluation with consensus process enhances reliability and reduces individual assessor bias. The comprehensive evaluation framework incorporating accuracy, completeness, and safety dimensions provides a more holistic assessment of clinical utility than accuracy metrics alone. The direct language comparison within identical scenarios controls for content-related confounders that could affect cross-study comparisons. This is among the first studies to systematically evaluate Claude’s multilingual performance in dental traumatology, revealing its unique language-independent characteristics. Limitations Several limitations should be acknowledged. The study evaluated only two languages; findings may not generalize to other language pairs. Evaluation was conducted at a single time point, and model performance may change with updates. The scenarios may not fully capture real-world complexity. Multimodal capabilities and domain-specific models such as Dental Trauma Evo, which achieved 85.43% accuracy, were not evaluated [ 20 ]. The sample of 27 scenarios may limit statistical power for subgroup analyses. Future research should expand language comparisons, conduct longitudinal tracking of model updates, integrate multimodal inputs, and perform prospective clinical validation studies. Conclusions This study demonstrates that large language models exhibit differential language effects in dental trauma management. ChatGPT 5.2 showed highest overall performance but significant English superiority, Gemini 3.0 exhibited moderate English advantage, and Claude 4.5 Sonnet demonstrated complete language independence. Model selection should consider both absolute performance and language-specific characteristics, particularly in non-English-speaking healthcare contexts. Declarations Ethics approval and consent to participate Not applicable. This study analyzed AI-generated responses to fictional clinical scenarios without human participant involvement or use of patient data. Consent for publication Not applicable. Availability of data and materials The datasets generated and analyzed during the current study, including all AI-generated responses and evaluation scores, are available from the corresponding author upon reasonable request. The complete evaluation rubric and clinical scenarios are provided as Supplementary Materials. Competing interests The authors declare no competing interests. This research received no specific funding. Authors’ contributions H.Ö. conceived and designed the study, developed the clinical scenarios, performed the translations, collected the data, conducted the statistical analysis, interpreted the results, and drafted the manuscript. M.D. contributed to data collection and served as an independent rater for response evaluation. Both authors read and approved the final manuscript. Acknowledgements NotebookLM (Google) was used to assist with literature review and synthesis. Grammarly was used for language editing and proofreading. After using these AI tools, the authors carefully reviewed, edited, and revised all content to ensure accuracy and quality, and took full responsibility for the final content of the published article. References Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. 2019;25(1):44–56. Knitza J, Kuhn S. Large language models–hype or hope? Die Dermatologie. 2025;76(10):672–4. Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023;29(8):1930–40. Hamada M, Kikuchi S, Akitomo T, Kusaka S, Iwamoto Y, Nomura R. Applications and potential of ChatGPT in dentistry: Scoping review of research perspectives. J Dent Sci 2025. Jaleel A, Aziz U, Farid G, Bashir MZ, Mirza TR, Abbas SMK, Aslam S, Sikander RMH. Evaluating the potential and accuracy of ChatGPT-3.5 and 4.0 in medical licensing and in-training examinations: systematic review and meta-analysis. JMIR Med Educ. 2025;11(1):e68070. Şan İ, AKKAN ÖZ M, Yortanli M, Genç M, Bulut B, Gür A, YAZICI R, MUTLU H, Gönen MÖ. AI performance in emergency medicine fellowship examination: comparative analysis of ChatGPT-4o, Gemini 2.0, Claude 3.5, and DeepSeek R1 models. Turk J Med Sci. 2025;55(5):1292–9. Urda-Cîmpean AE, Leucuța D-C, Drugan C, Duțu A-G, Călinici T, Drugan T. Assessing the Accuracy of Diagnostic Capabilities of Large Language Models. Diagnostics. 2025;15(13):1657. Workum JD, Van De Sande D, Gommers D, Van Genderen ME. Bridging the gap: a practical step-by-step approach to warrant safe implementation of large language models in healthcare. Front Artif Intell. 2025;8:1504805. Levin L, Day PF, Hicks L, O'Connell A, Fouad AF, Bourguignon C, Abbott PV. International Association of Dental Traumatology guidelines for the management of traumatic dental injuries: General introduction. Dent Traumatol. 2020;36(4):309–13. Anil S, Joseph B. Contemporary Advances in Diagnosis, Management, and Prevention of Traumatic Dental Injuries. 2025. Cantao AB, Levin L. Translating Knowledge Into Practice in Dental Trauma: From Education to Emergency Response and Prevention. In. Volume 41. Wiley Online Library; 2025. pp. 619–24. Bourguignon C, Cohenca N, Lauridsen E, Flores MT, O'Connell AC, Day PF, Tsilingaridis G, Abbott PV, Fouad AF, Hicks L. International Association of Dental Traumatology guidelines for the management of traumatic dental injuries: 1. Fractures and luxations. Dent Traumatol. 2020;36(4):314–30. Fouad AF, Abbott PV, Tsilingaridis G, Cohenca N, Lauridsen E, Bourguignon C, O'Connell A, Flores MT, Day PF, Hicks L. International Association of Dental Traumatology guidelines for the management of traumatic dental injuries: 2. Avulsion of permanent teeth. Dent Traumatol. 2020;36(4):331–42. Day PF, Flores MT, O'Connell AC, Abbott PV, Tsilingaridis G, Fouad AF, Cohenca N, Lauridsen E, Bourguignon C, Hicks L. International Association of Dental Traumatology guidelines for the management of traumatic dental injuries: 3. Injuries in the primary dentition. Dent Traumatol. 2020;36(4):343–59. Ozden I, Gokyar M, Ozden ME, Sazak Ovecoglu H. Assessment of artificial intelligence applications in responding to dental trauma. Dent Traumatol. 2024;40(6):722–9. Kuru HE, Aşık A, Demir DM. Can artificial intelligence language models effectively address dental trauma questions? Dental Traumatology 2025. Taraç MG. Evaluation of Artificial Intelligence Chatbots in the Management of Primary Tooth Traumas: A Comparative Analysis. J Int Dent Sci (Uluslararası Diş Hekimliği Bilimleri Dergisi). 2025;11(1):22–31. Sezer B, Aydoğdu T. Performance of advanced artificial intelligence models in traumatic dental injuries in primary dentition: a comparative evaluation of ChatGPT-4 Omni, DeepSeek, gemini advanced, and Claude 3.7 in terms of accuracy, completeness, response time, and readability. Appl Sci. 2025;15(14):7778. Çege EE, Cömert H, Akal N, Ölmez A. Evaluation of the Performance of Artificial Intelligence Based Chatbots in Providing First Aid Information on Dental Trauma According to the ToothSOS Application. Dent Traumatol 2025. Kumar V, Sachdeva A, Sharma S, Chawla A, Kumar V, Pandey S, Logani A. Performance Comparison of a Domain-Specific Chatbot and General‐Purpose Chatbots in Dental Traumatology. Dental Traumatology; 2025. Keleş ÖK, Arslan ZB. Performance of artificial intelligence chatbots in the diagnosis and management of simulated dental trauma cases: an evaluation based on IADT guidelines. Clin Oral Invest. 2025;30(1):26. Termteerapornpimol K, Kulvitit S, Prommanee S, Khurshid Z, Porntaveetus T. Comparative Benchmark of Seven Large Language Models for Traumatic Dental Injury Knowledge. Eur J Dentistry 2025. Novikova J, Anderson C, Blili-Hamelin B, Rosati D, Majumdar S. Consistency in language models: Current landscape, challenges, and future directions. arXiv preprint arXiv:250500268 2025. Ghosh A, Dutta D, Saha S, Agarwal C. A survey of multilingual reasoning in language models. Find Association Comput Linguistics: EMNLP. 2025;2025:8920–36. Sarı MBD, Sezer B. ChatGPT-4 Omni’s accuracy in multiple-choice dentistry questions: a multidisciplinary and bilingual assessment. Essentials Dentistry. 2025;4(1):1–9. Büyüközer Özkan H, Doğan Çankaya T, Kölüş T. The Impact of Language Variability on Artificial Intelligence Performance in Regenerative Endodontics. Healthcare: 2025. MDPI; 2025. p. 1190. Atılan AU, Çetin N. Benchmarking Large Language Models on the Turkish Dermatology Board Exam: A Comparative Multilingual Analysis. Turkish J Dermatology 2025. Wójcik D, Adamiak O, Czerepak G, Tokarczuk O, Szalewski L. Comparing the performance of ChatGPT, Gemini, and Claude in English and Polish on medical examinations. Sci Rep. 2025;15:33083. Sozen Yanik I, Sahin Hazir D, Bilgin Avsar D. Cross-lingual performance of large language models in maxillofacial prosthodontics: a comparative evaluation. BMC Oral Health. 2025;25(1):1630. Agarwal A, Meghwani H, Patel HL, Sheng T, Ravi S, Roth D. Aligning llms for multilingual consistency in enterprise applications. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track: 2025 ; 2025: 117–137. Lisboa RM, Braido A, de-Jesus-Soares A, Tewari N, Soares CJ, Paranhos LR, Vieira WA. Performance of five free large language models in dental trauma: a 30-day longitudinal benchmark study. Front Oral Health. 2025;6:1737114. Wang L, Li J, Zhuang B, Huang S, Fang M, Wang C, Li W, Zhang M, Gong S. Accuracy of large language models when answering clinical research questions: Systematic review and network meta-analysis. J Med Internet Res. 2025;27:e64486. Haupt F, Rödig T, Liersch P. Evaluating ChatGPT-4o as an Educational Support Tool for the Emergency Management of Dental Trauma: Randomized Controlled Study Among Students. JMIR Med Educ. 2025;11(1):e80576. Additional Declarations No competing interests reported. Supplementary Files SupplementaryMaterials.docx SupplementaryRawData.xlsx SupplementarySampleResponsesFull.docx Cite Share Download PDF Status: Under Review Version 1 posted Editorial decision: Revision requested 16 Mar, 2026 Reviews received at journal 02 Mar, 2026 Reviews received at journal 27 Feb, 2026 Reviewers agreed at journal 23 Feb, 2026 Reviewers agreed at journal 20 Feb, 2026 Reviews received at journal 18 Feb, 2026 Reviewers agreed at journal 18 Feb, 2026 Reviewers invited by journal 18 Feb, 2026 Editor assigned by journal 11 Feb, 2026 Submission checks completed at journal 11 Feb, 2026 First submitted to journal 11 Feb, 2026 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8754479","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":593602005,"identity":"7772d46e-f872-4ca5-9d74-aa08811d29ac","order_by":0,"name":"Hasan Öz","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA10lEQVRIiWNgGAWjYBACAwYGxgcMbBIwPjNRWpgNULWwEdYCVI9QRYQWc/b2ZxU/yiwS+6cdfvaAocI6sUG+9wFeLZY9B9Ju9pyTSJxxO83cgOFMemIDG7sBfofdSDh2g7dNIrHhdoKZBGPbYaAWAi4zuJHYVvgXqGX+7fRvEoz/iNKSzMYMsmXD7RygLQ3EaDlzjFla5pyE8cbbOWUSCcfSjdvY0ghoOd7+8OObsjrZebfTt0l8qLGW7Wc+hl8LDDg2gMgEBsLRAgf2xCocBaNgFIyCEQgAYi1ERrqjpC4AAAAASUVORK5CYII=","orcid":"","institution":"Adıyaman University","correspondingAuthor":true,"prefix":"","firstName":"Hasan","middleName":"","lastName":"Öz","suffix":""},{"id":593602008,"identity":"0286e433-f6d1-49b6-b8a0-1ffc22e12a94","order_by":1,"name":"Mehmet Dundar","email":"","orcid":"","institution":"Adıyaman University","correspondingAuthor":false,"prefix":"","firstName":"Mehmet","middleName":"","lastName":"Dundar","suffix":""}],"badges":[],"createdAt":"2026-02-01 07:38:28","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8754479/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8754479/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":103094077,"identity":"6e11836a-ee33-4a7f-92d2-b81e32db6598","added_by":"auto","created_at":"2026-02-20 17:36:31","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":156269,"visible":true,"origin":"","legend":"\u003cp\u003eLanguage effect on LLM performance in dental trauma management. Bar chart comparing total weighted scores between English and Turkish scenarios for each AI model. Error bars represent standard deviation. Statistical significance determined by Wilcoxon\u003c/p\u003e","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-8754479/v1/beab8ad218d6be05463ad2f3.png"},{"id":103504434,"identity":"aa74fc80-c808-4af5-8966-092e797c50aa","added_by":"auto","created_at":"2026-02-26 13:19:54","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":121904,"visible":true,"origin":"","legend":"\u003cp\u003eLanguage effect magnitude expressed as percentage difference between English and Turkish performance. Positive values indicate English superiority; near-zero values indicate language independence.\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-8754479/v1/aa182d44dff65f4166afc030.png"},{"id":103094080,"identity":"b95a761c-38c0-4a6e-a971-46f88d3da019","added_by":"auto","created_at":"2026-02-20 17:36:31","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":143355,"visible":true,"origin":"","legend":"\u003cp\u003eHeatmap of performance across evaluation dimensions and languages. Color intensity represents consensus scores (green = higher, red = lower). Vertical white lines separate models.\u003c/p\u003e","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-8754479/v1/0cf2419846d068679a6ae7ed.png"},{"id":103510616,"identity":"037acf2b-2cbb-46b3-81f0-c1792ccf8066","added_by":"auto","created_at":"2026-02-26 14:06:33","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1229813,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8754479/v1/9cb80e3f-5071-4ceb-a2a3-b0762aff4800.pdf"},{"id":103094076,"identity":"6fe9cfba-3308-4ceb-9156-5607d93da609","added_by":"auto","created_at":"2026-02-20 17:36:31","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"supplement","size":24813,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryMaterials.docx","url":"https://assets-eu.researchsquare.com/files/rs-8754479/v1/b6cee0511863ce2323c8b7bc.docx"},{"id":103504588,"identity":"79d4d224-533c-4b84-a584-732ee1f6bcdf","added_by":"auto","created_at":"2026-02-26 13:20:39","extension":"xlsx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":24585,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryRawData.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-8754479/v1/0948402e4da77c4c7ee8f353.xlsx"},{"id":103504172,"identity":"b4646048-f8c7-4aa9-b753-03b0a071a190","added_by":"auto","created_at":"2026-02-26 13:18:03","extension":"docx","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":36404,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementarySampleResponsesFull.docx","url":"https://assets-eu.researchsquare.com/files/rs-8754479/v1/d8482df1326c37c072197ec2.docx"}],"financialInterests":"No competing interests reported.","formattedTitle":"Impact of Query Language on Large Language Model Performance in Dental Trauma Management: A Comparative Evaluation of ChatGPT, Gemini, and Claude","fulltext":[{"header":"INTRODUCTION","content":"\u003cp\u003eArtificial intelligence has transformed healthcare delivery, with large language models emerging as promising clinical decision support tools [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e, \u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]. Models such as ChatGPT, Gemini, and Claude demonstrate notable capabilities in processing and generating medical text [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e]. These systems show potential across clinical decision support, medical education, patient communication, and diagnostic assistance [\u003cspan additionalcitationids=\"CR5\" citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e]. Their capacity to synthesize medical literature has generated considerable interest in their clinical applications [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e, \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eTraumatic dental injuries (TDIs) present a public health challenge requiring rapid clinical decisions. Epidemiological studies indicate that approximately 25% of school-age children experience dental trauma, and 33% of adults have sustained injuries to their permanent teeth, with most injuries occurring before age 19. Luxation injuries predominate in primary dentition while crown fractures are more common in permanent teeth [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e]. The time-sensitive nature of dental trauma management, particularly in avulsion cases where outcomes depend on immediate intervention, makes this domain well-suited for AI-assisted decision support [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e, \u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eThe International Association of Dental Traumatology (IADT) has developed guidelines that serve as the gold standard for TDI management [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e, \u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e]. These guidelines cover a spectrum of injury types including enamel-dentin fractures, crown-root fractures, root fractures, concussion, subluxation, extrusive luxation, lateral luxation, intrusive luxation, and avulsion, each requiring distinct diagnostic and treatment approaches [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e, \u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e]. The complexity of these guidelines presents challenges for clinical application; treatment decisions depend on multiple interacting factors such as root development stage, extra-alveolar dry time, storage medium, patient age, and injury characteristics [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e, \u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e]. For example, management of an avulsed tooth varies based on apex status, extra-oral time duration, and storage conditions [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e]. This multifactorial decision-making framework makes dental trauma appropriate for evaluating AI clinical reasoning capabilities.\u003c/p\u003e \u003cp\u003eResearch on AI chatbot performance in dental trauma has yielded variable results [\u003cspan additionalcitationids=\"CR16 CR17 CR18 CR19 CR20 CR21\" citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e]. Earlier studies reported moderate accuracy rates (57.5\u0026ndash;76.7%) for models like ChatGPT 3.5 and Google Bard [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e, \u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e]. Recent evaluations of advanced models show improved performance, with accuracy rates ranging from 85% to 100% [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e, \u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e, \u003cspan additionalcitationids=\"CR21\" citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e]. However, most of these studies used direct questioning approaches where injury types were explicitly stated, which may not adequately assess the diagnostic reasoning capabilities required in clinical practice.\u003c/p\u003e \u003cp\u003eA critical consideration is the potential influence of query language on model performance. LLMs are predominantly trained on English-language corpora, which may create performance disparities when queried in other languages [\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e, \u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e]. Studies investigating this phenomenon in medical contexts have yielded mixed findings [\u003cspan additionalcitationids=\"CR26 CR27 CR28 CR29\" citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e]. Several report significant English superiority in dental and medical examinations [\u003cspan additionalcitationids=\"CR26 CR27\" citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e], while others found no significant language effect [\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e]. This inconsistency suggests that language effects may vary across clinical domains and warrants further investigation [\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eDespite growing literature examining AI performance in dentistry, gaps remain in understanding LLM capabilities in dental traumatology. Limited research has systematically compared multiple contemporary LLMs across different languages using clinically relevant scenario-based assessments. Most previous studies employed direct questioning methodologies where the diagnosis is provided, failing to evaluate whether AI systems can independently identify injury types from clinical presentations\u0026mdash;a fundamental requirement for clinical decision support. Additionally, evaluation frameworks in prior studies often lacked comprehensive assessment of safety considerations.\u003c/p\u003e \u003cp\u003eThis study aimed to compare the performance of three large language models (ChatGPT 5.2, Gemini 3.0, and Claude Sonnet 4.5) in responding to scenario-based dental trauma questions in Turkish and English. We hypothesized that AI models would perform better when queried in English compared to Turkish, given the predominance of English-language training data. Unlike previous studies that directly specified injury types, this study presented clinical scenarios describing patient symptoms, clinical findings, and radiographic features without explicitly naming the diagnosis. This approach assessed whether LLMs could independently identify traumatic injury types and generate appropriate treatment recommendations aligned with IADT 2020 Guidelines. Model responses were evaluated across three dimensions: accuracy of the diagnosis and treatment plan, completeness of the recommended management protocol, and safety considerations for patient care.\u003c/p\u003e"},{"header":"METHODS","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003eStudy Design\u003c/h2\u003e \u003cp\u003eThis comparative cross-sectional study evaluated the performance of three large language models in responding to scenario-based dental trauma questions presented in Turkish and English. The study employed a standardized evaluation framework based on IADT 2020 guidelines as the reference standard. The study protocol was exempt from ethical review as it involved analysis of AI-generated content without human participant involvement.\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eAI Models Evaluated\u003c/h3\u003e\n\u003cp\u003eThree contemporary LLMs were selected: ChatGPT-5.2 (OpenAI, San Francisco, CA, USA), Gemini-3.0 (Google, Mountain View, CA, USA), and Claude Sonnet 4.5 (Anthropic, San Francisco, CA, USA). Models were accessed through their respective web-based interfaces between November 28 and December 10, 2025.\u003c/p\u003e \u003cp\u003eTo ensure reproducibility, a standardized protocol was employed: new accounts were created specifically for this study to eliminate influence from prior usage history; memory and conversation history features were disabled; web search functionality was disabled to ensure responses were generated solely from training data; default temperature settings were used; no system prompts, custom instructions, or pre-configured preferences were applied; and each scenario was submitted in a new, independent conversation session. Screenshots documenting interface settings and model version information are provided as Supplementary Material 2.\u003c/p\u003e\n\u003ch3\u003eScenario Development\u003c/h3\u003e\n\u003cp\u003e An initial pool of 30 dental trauma clinical scenarios was developed by the primary investigator based on IADT 2020 guidelines covering traumatic injuries to permanent teeth, avulsion, and injuries in primary dentition. Each scenario described a hypothetical patient presentation including demographic information (age, sex), mechanism of injury, time elapsed since trauma, clinical findings (tooth mobility, displacement, percussion sensitivity, pulp testing results), and radiographic features.\u003c/p\u003e \u003cp\u003eA key feature of this study was the scenario-based approach: unlike previous studies that directly specified injury types, scenarios were presented without explicit diagnostic statements. This methodology assessed whether LLMs could independently identify traumatic injury types from clinical presentations and generate appropriate treatment recommendations\u0026mdash;a fundamental requirement for clinical decision support systems.\u003c/p\u003e \u003cp\u003eAll scenarios were initially developed in Turkish by the primary investigator with native-level proficiency in both languages. Forward translation into English was performed by the primary investigator. To ensure linguistic equivalence, a modified back-translation procedure was employed: an independent bilingual dental professional translated the English scenarios back into Turkish, and back-translated versions were compared with original Turkish scenarios by a second bilingual evaluator. Discrepancies were resolved through consensus discussion, and minor modifications were made to three scenarios to achieve semantic equivalence. Both reviewers confirmed that final translated scenarios maintained equivalent clinical meaning, difficulty level, and diagnostic complexity across languages.\u003c/p\u003e \u003cp\u003e The final sample of 27 scenarios was determined based on comprehensive coverage of all major dental trauma categories defined in IADT 2020 guidelines. Scenarios encompassed 13 distinct trauma categories: (1) enamel infraction, (2) enamel fracture, (3) enamel-dentin fracture, (4) enamel-dentin-pulp fracture, (5) crown-root fracture, (6) root fracture, (7) alveolar fracture, (8) concussion, (9) subluxation, (10) extrusive luxation, (11) lateral luxation, (12) intrusive luxation, and (13) avulsion. Scenarios covered both permanent and primary dentition with varying clinical complexity levels, including patients of different ages and diverse extra-oral time intervals for avulsion cases, ensuring each category was represented by at least two scenarios. While formal power analysis was not performed a priori given the exploratory nature of this research, the sample size is comparable to similar LLM evaluation studies in dental traumatology (range: 15\u0026ndash;40 scenarios in published literature).\u003c/p\u003e\n\u003ch3\u003eData Collection\u003c/h3\u003e\n\u003cp\u003eData collection was conducted between November 28 and December 10, 2025. Each scenario was submitted to all three LLMs in both Turkish and English, generating a total of 162 AI responses (27 scenarios \u0026times; 3 models \u0026times; 2 languages). To minimize potential carryover effects, each query was submitted in a new conversation session. Responses were collected and verbatim stored for subsequent evaluation. Model outputs were de-identified and randomized prior to evaluation to ensure assessor blinding.\u003c/p\u003e \u003cp\u003eThe 12-day data collection period was selected to balance practical constraints with methodological rigor. To verify model stability during this period, we confirmed that no major model updates were announced by any provider during the data collection window. Additionally, spot checks were conducted by re-submitting a subset of five scenarios at the beginning and end of the collection period; response patterns remained consistent, suggesting model stability throughout the study.\u003c/p\u003e\n\u003ch3\u003eEvaluation Framework\u003c/h3\u003e\n\u003cp\u003e A comprehensive evaluation of rubric was developed based on IADT 2020 guidelines and refined through iterative expert discussion. Model responses were evaluated across three dimensions:\u003c/p\u003e \u003cp\u003e\u003cb\u003eAccuracy (Weight: 40%)\u003c/b\u003e: Assessed concordance between AI responses and IADT 2020 guidelines. Evaluation criteria included correctness of diagnosis, appropriateness of treatment approach, splint type and duration, antibiotic selection and dosing (when indicated), endodontic timing recommendations, and follow-up protocols.\u003c/p\u003e \u003cp\u003e\u003cb\u003eCompleteness (Weight: 35%)\u003c/b\u003e: Evaluated whether responses included all essential components specified in guidelines for each injury type. Required elements varied by trauma category and included immediate management steps, repositioning techniques (for luxation injuries), splinting protocols, pharmacological recommendations, pulp vitality monitoring protocols, and follow-up schedules.\u003c/p\u003e \u003cp\u003e \u003cb\u003eSafety (Weight: 25%)\u003c/b\u003e: Assessed absence of potentially harmful recommendations and presence of critical safety considerations. Critical violations included recommending replantation of primary teeth, prescribing tetracycline-class antibiotics to patients under 12 years of age, recommending rigid splints where flexible splints are indicated, or suggesting unnecessary endodontic treatment in immature teeth with revascularization potential.\u003c/p\u003e \u003cp\u003eEach dimension was scored on a 0\u0026ndash;5 point scale (0\u0026thinsp;=\u0026thinsp;unacceptable, 1\u0026thinsp;=\u0026thinsp;poor, 2\u0026thinsp;=\u0026thinsp;inadequate, 3\u0026thinsp;=\u0026thinsp;acceptable, 4\u0026thinsp;=\u0026thinsp;good, 5\u0026thinsp;=\u0026thinsp;excellent). Specific point deductions were established for protocol violations categorized as critical (\u0026minus;\u0026thinsp;2 points), serious (\u0026minus;\u0026thinsp;1 point), moderate (\u0026minus;\u0026thinsp;0.5 points), or minor (\u0026minus;\u0026thinsp;0.25 points) based on clinical significance.\u003c/p\u003e \u003cp\u003eThe total weighted score was calculated using the formula: Total Weighted Score = (Accuracy \u0026times; 0.40) + (Completeness \u0026times; 0.35) + (Safety \u0026times; 0.25). This weighting reflects clinical priorities in dental trauma management where diagnostic accuracy is prerequisite for appropriate care. Completeness was weighted second, recognizing that comprehensive management protocols including splinting, pharmacotherapy, and follow-up affect treatment outcomes. Safety, while assigned the lowest weight, was considered a critical dimension where any significant violation would substantially reduce the total score. The point deduction system was adapted from established clinical error classification frameworks and refined through pilot testing. The complete evaluation rubric with detailed scoring criteria is provided as Supplementary Material 1.\u003c/p\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003eRater Selection and Training\u003c/h2\u003e \u003cp\u003eTwo endodontists, each with more than 5 years of clinical specialization experience, served as independent evaluators. Prior to the assessment phase, both raters participated in a calibration session where they independently evaluated a subset of five responses not included in final analysis. Discrepancies were discussed, and consensus was reached on scoring criteria interpretation. Raters were blinded to model identity and query language throughout evaluation.\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eConsensus Process\u003c/h3\u003e\n\u003cp\u003eFollowing independent evaluation, responses where rater scores differed by \u0026ge;\u0026thinsp;0.5 points on any dimension were identified for consensus review. In these cases, both raters reviewed the response and relevant IADT guidelines together and reached a mutually agreed-upon consensus score. Consensus scores were used for all subsequent analyses.\u003c/p\u003e \u003cdiv id=\"Sec10\" class=\"Section2\"\u003e \u003ch2\u003eStatistical Analysis\u003c/h2\u003e \u003cp\u003eStatistical analyses were performed using JASP software (Version 0.95.4). Statistical significance was set at α\u0026thinsp;=\u0026thinsp;0.05 for all tests.\u003c/p\u003e \u003cp\u003eInter-rater reliability was assessed using intraclass correlation coefficient calculated with a two-way random effects model for absolute agreement (ICC[\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e, \u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e]). Descriptive statistics including means, standard deviations, medians, and interquartile ranges were calculated for all performance metrics stratified by model and language. Given the ordinal nature of scoring data and non-normal distribution confirmed by Shapiro-Wilk tests, non-parametric methods were employed. Kruskal-Wallis H tests compared performance across models, with post-hoc pairwise comparisons using Mann-Whitney U tests with Bonferroni correction (adjusted α\u0026thinsp;=\u0026thinsp;0.0167). Wilcoxon signed-rank tests compared paired English and Turkish responses within each model. Effect sizes were interpreted as small (r\u0026thinsp;\u0026lt;\u0026thinsp;0.3), medium (0.3\u0026ndash;0.5), or large (\u0026gt;\u0026thinsp;0.5). ICC values were interpreted according to Koo and Li criteria: \u0026lt;0.50 poor, 0.50\u0026ndash;0.75 moderate, 0.75\u0026ndash;0.90 good, \u0026gt;\u0026thinsp;0.90 excellent reliability.\u003c/p\u003e \u003c/div\u003e"},{"header":"RESULTS","content":"\u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003eInter-rater Reliability\u003c/h2\u003e \u003cp\u003eAnalysis demonstrated adequate inter-rater reliability across all evaluation dimensions (Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). Per Koo and Li criteria, ICC values indicated good reliability for completeness (ICC\u0026thinsp;=\u0026thinsp;0.836, 95% CI: 0.783\u0026ndash;0.877) and accuracy (ICC\u0026thinsp;=\u0026thinsp;0.783, 95% CI: 0.714\u0026ndash;0.838), with moderate reliability for safety (ICC\u0026thinsp;=\u0026thinsp;0.738, 95% CI: 0.659\u0026ndash;0.802). While safety fell within the moderate range (0.50\u0026ndash;0.75), this value remained above acceptable thresholds for clinical research and was consistent with inherent subjectivity in evaluating safety considerations in clinical recommendations. These reliability coefficients support the validity of the evaluation framework.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eInter-rater Reliability Assessment\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"4\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDimension\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eICC\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003e95% CI\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eInterpretation\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAccuracy\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.783\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.714\u0026ndash;0.838\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eGood\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCompleteness\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.836\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.783\u0026ndash;0.877\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eGood\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eSafety\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.738\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.659\u0026ndash;0.802\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eModerate\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cem\u003eICC\u0026thinsp;=\u0026thinsp;Intraclass Correlation Coefficient; n\u0026thinsp;=\u0026thinsp;162 evaluations. Interpretation based on Koo and Li criteria.\u003c/em\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec13\" class=\"Section2\"\u003e \u003ch2\u003eLanguage Effect on Model Performance\u003c/h2\u003e \u003cp\u003eThe primary hypothesis examined whether LLM performance varies based on the language of clinical scenarios. Wilcoxon signed-rank tests revealed distinct language-dependent patterns across the three models (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e, Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eChatGPT demonstrated the most pronounced language effect, with higher performance in English (4.856\u0026thinsp;\u0026plusmn;\u0026thinsp;0.144) compared to Turkish (4.431\u0026thinsp;\u0026plusmn;\u0026thinsp;0.191), representing a 9.14% performance differential (W\u0026thinsp;=\u0026thinsp;0.0, p\u0026thinsp;\u0026lt;\u0026thinsp;0.001, r\u0026thinsp;=\u0026thinsp;0.874). This large effect size indicates substantial language-dependent variation, with English responses consistently outperforming Turkish counterparts across all 27 scenarios.\u003c/p\u003e \u003cp\u003eGemini exhibited a moderate language effect, scoring 4.744\u0026thinsp;\u0026plusmn;\u0026thinsp;0.364 in English versus 4.481\u0026thinsp;\u0026plusmn;\u0026thinsp;0.420 in Turkish, a 5.69% difference that reached statistical significance (W\u0026thinsp;=\u0026thinsp;58.5, p\u0026thinsp;=\u0026thinsp;0.003, r\u0026thinsp;=\u0026thinsp;0.572). While less pronounced than ChatGPT, this pattern indicates meaningful performance variation attributable to language.\u003c/p\u003e \u003cp\u003eClaude demonstrated language independence, achieving virtually identical scores in English (4.408\u0026thinsp;\u0026plusmn;\u0026thinsp;0.459) and Turkish (4.409\u0026thinsp;\u0026plusmn;\u0026thinsp;0.407). The negligible difference of -0.02% was not statistically significant (W\u0026thinsp;=\u0026thinsp;138.0, p\u0026thinsp;=\u0026thinsp;0.220, r\u0026thinsp;=\u0026thinsp;0.236), indicating that Claude\u0026rsquo;s clinical reasoning capabilities were unaffected by language of presentation (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e). This finding suggests fundamentally different multilingual processing architectures among the evaluated models.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eLanguage Effect Analysis by Model\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"6\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\"\u0026plusmn;\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\"\u0026plusmn;\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eModel\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eEnglish\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eTurkish\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eDiff (%)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003ep-value\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eEffect (r)\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eChatGPT\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c2\"\u003e \u003cp\u003e4.856\u0026thinsp;\u0026plusmn;\u0026thinsp;0.144\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e4.431\u0026thinsp;\u0026plusmn;\u0026thinsp;0.191\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e+\u0026thinsp;9.14\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e\u0026lt;\u0026thinsp;0.001***\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.874\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGemini\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c2\"\u003e \u003cp\u003e4.744\u0026thinsp;\u0026plusmn;\u0026thinsp;0.364\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e4.481\u0026thinsp;\u0026plusmn;\u0026thinsp;0.420\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e+\u0026thinsp;5.69\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.003**\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.572\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eClaude\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c2\"\u003e \u003cp\u003e4.408\u0026thinsp;\u0026plusmn;\u0026thinsp;0.459\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e4.409\u0026thinsp;\u0026plusmn;\u0026thinsp;0.407\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e-0.02\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.220\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.236\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cem\u003eValues presented as mean\u0026thinsp;\u0026plusmn;\u0026thinsp;SD. Wilcoxon signed-rank test for paired comparisons. *\u003c/em\u003e \u003cb\u003ep\u0026thinsp;\u0026lt;\u0026thinsp;0.001\u003c/b\u003e, \u003cem\u003ep\u0026thinsp;\u0026lt;\u0026thinsp;0.01. Effect size: small (r\u0026thinsp;\u0026lt;\u0026thinsp;0.3), medium (r\u0026thinsp;=\u0026thinsp;0.3\u0026ndash;0.5), large (r\u0026thinsp;\u0026gt;\u0026thinsp;0.5).\u003c/em\u003e\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec14\" class=\"Section2\"\u003e \u003ch2\u003eOverall Model Performance Comparison\u003c/h2\u003e \u003cp\u003eKruskal-Wallis analysis revealed significant differences among models when combining both language conditions (H\u0026thinsp;=\u0026thinsp;11.45, p\u0026thinsp;=\u0026thinsp;0.003). ChatGPT achieved highest overall mean score (4.643\u0026thinsp;\u0026plusmn;\u0026thinsp;0.270), followed by Gemini (4.613\u0026thinsp;\u0026plusmn;\u0026thinsp;0.407) and Claude (4.409\u0026thinsp;\u0026plusmn;\u0026thinsp;0.426). Post-hoc pairwise comparisons with Bonferroni correction showed no significant difference between ChatGPT and Gemini (U\u0026thinsp;=\u0026thinsp;1388.0, p\u0026thinsp;=\u0026thinsp;0.667), while both outperformed Claude (ChatGPT vs Claude: U\u0026thinsp;=\u0026thinsp;1908.0, p\u0026thinsp;=\u0026thinsp;0.006; Gemini vs Claude: U\u0026thinsp;=\u0026thinsp;1951.5, p\u0026thinsp;=\u0026thinsp;0.002).\u003c/p\u003e \u003cp\u003eLanguage-stratified analysis revealed divergent patterns. In English, significant model differences were observed (H\u0026thinsp;=\u0026thinsp;22.31, p\u0026thinsp;\u0026lt;\u0026thinsp;0.001), whereas Turkish responses showed no significant differences among models (H\u0026thinsp;=\u0026thinsp;2.89, p\u0026thinsp;=\u0026thinsp;0.236). This interaction suggests that model superiority is contingent upon language context, with performance hierarchies converging when operating in non-English languages.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec15\" class=\"Section2\"\u003e \u003ch2\u003eDimension-Specific Performance\u003c/h2\u003e \u003cp\u003eAnalysis of individual evaluation dimensions revealed consistent patterns across accuracy, completeness, and safety metrics (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e, Table\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e). The language effect was most pronounced in completeness, where ChatGPT showed a 0.500-point decrease from English to Turkish, followed by Gemini (0.426-point decrease). Claude maintained stable completeness scores across languages (difference: 0.019 points).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eSafety performance demonstrated a ceiling effect in English, particularly for ChatGPT which achieved perfect safety scores (5.000) across all 27 English scenarios. However, Turkish responses revealed more variability, with ChatGPT\u0026rsquo;s safety scores declining to 4.639 (only 25.9% of scenarios achieving perfect safety scores, compared to 100% in English). Claude exhibited the most consistent safety performance across languages (English: 4.796, Turkish: 4.796), with comparable rates of perfect safety scores (66.7% vs 59.3%).\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003ePerformance by Evaluation Dimension\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"5\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDimension\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eModel\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eEnglish\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eTurkish\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eDifference\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"2\" rowspan=\"3\"\u003e \u003cp\u003eAccuracy\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eChatGPT\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e4.833\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e4.435\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e+\u0026thinsp;0.398\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGemini\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e4.657\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e4.426\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e+\u0026thinsp;0.231\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eClaude\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e4.296\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e4.315\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e-0.019\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"2\" rowspan=\"3\"\u003e \u003cp\u003eCompleteness\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eChatGPT\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e4.778\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e4.278\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e+\u0026thinsp;0.500\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGemini\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e4.759\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e4.333\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e+\u0026thinsp;0.426\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eClaude\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e4.259\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e4.241\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e+\u0026thinsp;0.019\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"2\" rowspan=\"3\"\u003e \u003cp\u003eSafety\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eChatGPT\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e5.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e4.639\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e+\u0026thinsp;0.361\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGemini\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e4.861\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e4.778\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e+\u0026thinsp;0.083\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eClaude\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e4.796\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e4.796\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.000\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cem\u003eAll values represent consensus scores. Positive differences indicate higher English performance.\u003c/em\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec16\" class=\"Section2\"\u003e \u003ch2\u003eResponse Consistency and Clinical Acceptability\u003c/h2\u003e \u003cp\u003eAnalysis of response variability, measured by standard deviation of total weighted scores, revealed notable differences in consistency among models. ChatGPT demonstrated highest consistency with lowest variability (combined SD\u0026thinsp;=\u0026thinsp;0.270), though this uniformity was partially attributable to a ceiling effect in English responses (SD\u0026thinsp;=\u0026thinsp;0.144). Gemini and Claude showed comparable variability (combined SD\u0026thinsp;=\u0026thinsp;0.407 and 0.426, respectively), indicating more heterogeneous performance across clinical scenarios.\u003c/p\u003e \u003cp\u003eClinical acceptability, defined as responses achieving total weighted scores\u0026thinsp;\u0026ge;\u0026thinsp;4.5, further illuminated performance patterns. In English, ChatGPT achieved 100% acceptability (27/27), compared to 81.5% for Gemini (22/27) and 51.9% for Claude (14/27). Claude\u0026rsquo;s acceptability rate improved in Turkish (59.3%, 16/27) while ChatGPT\u0026rsquo;s declined substantially (44.4%, 12/27), suggesting that language-related performance degradation disproportionately affects clinical utility for some models.\u003c/p\u003e \u003c/div\u003e"},{"header":"DISCUSSION","content":"\u003cp\u003eThis study provides evidence regarding the differential impact of query language on large language model performance in dental trauma management. Our findings demonstrate that language-dependent performance variations are model-specific rather than universal, with ChatGPT 5.2 showing strong English superiority (9.14%, p\u0026thinsp;\u0026lt;\u0026thinsp;0.001), Gemini 3.0 exhibiting moderate English advantage (5.69%, p\u0026thinsp;=\u0026thinsp;0.003), and Claude 4.5 Sonnet demonstrating complete language independence (-0.02%, p\u0026thinsp;=\u0026thinsp;0.220).\u003c/p\u003e \u003cdiv id=\"Sec18\" class=\"Section2\"\u003e \u003ch2\u003eLanguage Effect and Model Performance\u003c/h2\u003e \u003cp\u003eOur findings regarding ChatGPT\u0026rsquo;s English superiority align with recent literature. Studies have reported significant English advantage across dental and medical examinations [\u003cspan additionalcitationids=\"CR26\" citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e]. W\u0026oacute;jcik et al. found that Claude demonstrated the most consistent cross-lingual performance, paralleling our observation of Claude\u0026rsquo;s language independence [\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e]. However, Sozen Yanik et al. reported no significant language effect in maxillofacial prosthodontics, suggesting that language effects may vary across clinical domains [\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eThe language-stratified analysis revealed that significant model differences observed in English (H\u0026thinsp;=\u0026thinsp;22.31, p\u0026thinsp;\u0026lt;\u0026thinsp;0.001) disappeared in Turkish (H\u0026thinsp;=\u0026thinsp;2.89, p\u0026thinsp;=\u0026thinsp;0.236). This convergence suggests that language-related degradation disproportionately affects high-performing models, effectively equalizing capabilities in non-English contexts.\u003c/p\u003e \u003cp\u003eChatGPT achieved the highest overall score (4.83\u0026thinsp;\u0026plusmn;\u0026thinsp;0.24), consistent with recent benchmarks showing continued advancement in LLM capabilities from earlier reports of 57.5% accuracy [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e] to current levels exceeding 85% [\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e, \u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e]. Our lower diagnostic accuracy rates compared to studies reporting 100% [\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e] likely reflect our scenario-based approach requiring independent injury identification.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec19\" class=\"Section2\"\u003e \u003ch2\u003eSafety Considerations\u003c/h2\u003e \u003cp\u003eWhile ChatGPT achieved highest overall performance, Claude demonstrated the most consistent safety scores across languages. ChatGPT achieved perfect safety scores in all English scenarios but showed marked variability in Turkish (only 25.9% perfect scores versus 100% in English). This pattern has important clinical implications, as Wang et al. noted that different models show advantages in different task types [\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e].\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec20\" class=\"Section2\"\u003e \u003ch2\u003eClinical Implications\u003c/h2\u003e \u003cp\u003eOur findings suggest that language considerations should factor into model selection for multilingual healthcare settings. In Turkish-speaking contexts, Claude\u0026rsquo;s language independence may compensate for its lower absolute accuracy. The substantial decline in ChatGPT\u0026rsquo;s safety scores from English to Turkish represents a clinically significant concern warranting careful implementation planning.\u003c/p\u003e \u003cp\u003eThe clinical significance of observed performance differentials warrants careful interpretation. ChatGPT\u0026rsquo;s 9.14% performance gap between English and Turkish translates to approximately 0.425 points on our 5-point scale. In practical terms, this difference could represent the distinction between a \u0026ldquo;good\u0026rdquo; (4.0) and \u0026ldquo;excellent\u0026rdquo; (4.5+) clinical recommendation\u0026mdash;potentially affecting the comprehensiveness of follow-up protocols or specificity of treatment timing. More critically, the substantial decline in ChatGPT\u0026rsquo;s safety scores (100% to 25.9% perfect scores) suggests that language-related performance degradation may disproportionately affect safety-critical aspects. For individual patient care, this could manifest as incomplete safety warnings regarding contraindicated medications or suboptimal splinting recommendations.\u003c/p\u003e \u003cp\u003eThe high overall performance of all models (4.41\u0026ndash;4.83 on a 5-point scale) suggests contemporary LLMs can serve as clinical decision support tools in dental trauma management with appropriate oversight. Haupt et al. demonstrated that ChatGPT significantly improved student accuracy in dental trauma assessments, supporting educational utility [\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e].\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec21\" class=\"Section2\"\u003e \u003ch2\u003eStrengths\u003c/h2\u003e \u003cp\u003eThis study offers several methodological strengths. The scenario-based approach requiring independent injury identification more closely approximates real-world clinical decision-making compared to studies using direct questioning where the diagnosis is provided. The blinded, dual-rater evaluation with consensus process enhances reliability and reduces individual assessor bias. The comprehensive evaluation framework incorporating accuracy, completeness, and safety dimensions provides a more holistic assessment of clinical utility than accuracy metrics alone. The direct language comparison within identical scenarios controls for content-related confounders that could affect cross-study comparisons. This is among the first studies to systematically evaluate Claude\u0026rsquo;s multilingual performance in dental traumatology, revealing its unique language-independent characteristics.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec22\" class=\"Section2\"\u003e \u003ch2\u003eLimitations\u003c/h2\u003e \u003cp\u003eSeveral limitations should be acknowledged. The study evaluated only two languages; findings may not generalize to other language pairs. Evaluation was conducted at a single time point, and model performance may change with updates. The scenarios may not fully capture real-world complexity. Multimodal capabilities and domain-specific models such as Dental Trauma Evo, which achieved 85.43% accuracy, were not evaluated [\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e]. The sample of 27 scenarios may limit statistical power for subgroup analyses.\u003c/p\u003e \u003cp\u003eFuture research should expand language comparisons, conduct longitudinal tracking of model updates, integrate multimodal inputs, and perform prospective clinical validation studies.\u003c/p\u003e \u003c/div\u003e"},{"header":"Conclusions","content":"\u003cp\u003eThis study demonstrates that large language models exhibit differential language effects in dental trauma management. ChatGPT 5.2 showed highest overall performance but significant English superiority, Gemini 3.0 exhibited moderate English advantage, and Claude 4.5 Sonnet demonstrated complete language independence. Model selection should consider both absolute performance and language-specific characteristics, particularly in non-English-speaking healthcare contexts.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eEthics approval and consent to participate\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable. This study analyzed AI-generated responses to fictional clinical scenarios without human participant involvement or use of patient data.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConsent for publication\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAvailability of data and materials\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe datasets generated and analyzed during the current study, including all AI-generated responses and evaluation scores, are available from the corresponding author upon reasonable request. The complete evaluation rubric and clinical scenarios are provided as Supplementary Materials.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting interests\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors declare no competing interests.\u003c/p\u003e\n\u003cp\u003eThis research received no specific funding.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthors\u0026rsquo; contributions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eH.\u0026Ouml;. conceived and designed the study, developed the clinical scenarios, performed the translations, collected the data, conducted the statistical analysis, interpreted the results, and drafted the manuscript. M.D.\u0026nbsp;contributed to data collection and served as an independent rater for response evaluation. Both authors read and approved the final manuscript.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAcknowledgements\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNotebookLM (Google) was used to assist with literature review and synthesis. Grammarly was used for language editing and proofreading. After using these AI tools, the authors carefully reviewed, edited, and revised all content to ensure accuracy and quality, and took full responsibility for the final content of the published article.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eTopol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. 2019;25(1):44\u0026ndash;56.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKnitza J, Kuhn S. Large language models\u0026ndash;hype or hope? Die Dermatologie. 2025;76(10):672\u0026ndash;4.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eThirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023;29(8):1930\u0026ndash;40.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHamada M, Kikuchi S, Akitomo T, Kusaka S, Iwamoto Y, Nomura R. Applications and potential of ChatGPT in dentistry: Scoping review of research perspectives. J Dent Sci 2025.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJaleel A, Aziz U, Farid G, Bashir MZ, Mirza TR, Abbas SMK, Aslam S, Sikander RMH. Evaluating the potential and accuracy of ChatGPT-3.5 and 4.0 in medical licensing and in-training examinations: systematic review and meta-analysis. JMIR Med Educ. 2025;11(1):e68070.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eŞan İ, AKKAN \u0026Ouml;Z M, Yortanli M, Gen\u0026ccedil; M, Bulut B, G\u0026uuml;r A, YAZICI R, MUTLU H, G\u0026ouml;nen M\u0026Ouml;. AI performance in emergency medicine fellowship examination: comparative analysis of ChatGPT-4o, Gemini 2.0, Claude 3.5, and DeepSeek R1 models. Turk J Med Sci. 2025;55(5):1292\u0026ndash;9.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eUrda-C\u0026icirc;mpean AE, Leucuța D-C, Drugan C, Duțu A-G, Călinici T, Drugan T. Assessing the Accuracy of Diagnostic Capabilities of Large Language Models. Diagnostics. 2025;15(13):1657.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWorkum JD, Van De Sande D, Gommers D, Van Genderen ME. Bridging the gap: a practical step-by-step approach to warrant safe implementation of large language models in healthcare. Front Artif Intell. 2025;8:1504805.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLevin L, Day PF, Hicks L, O'Connell A, Fouad AF, Bourguignon C, Abbott PV. International Association of Dental Traumatology guidelines for the management of traumatic dental injuries: General introduction. Dent Traumatol. 2020;36(4):309\u0026ndash;13.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAnil S, Joseph B. Contemporary Advances in Diagnosis, Management, and Prevention of Traumatic Dental Injuries. 2025.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCantao AB, Levin L. Translating Knowledge Into Practice in Dental Trauma: From Education to Emergency Response and Prevention. In. Volume 41. Wiley Online Library; 2025. pp. 619\u0026ndash;24.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBourguignon C, Cohenca N, Lauridsen E, Flores MT, O'Connell AC, Day PF, Tsilingaridis G, Abbott PV, Fouad AF, Hicks L. International Association of Dental Traumatology guidelines for the management of traumatic dental injuries: 1. Fractures and luxations. Dent Traumatol. 2020;36(4):314\u0026ndash;30.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFouad AF, Abbott PV, Tsilingaridis G, Cohenca N, Lauridsen E, Bourguignon C, O'Connell A, Flores MT, Day PF, Hicks L. International Association of Dental Traumatology guidelines for the management of traumatic dental injuries: 2. Avulsion of permanent teeth. Dent Traumatol. 2020;36(4):331\u0026ndash;42.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDay PF, Flores MT, O'Connell AC, Abbott PV, Tsilingaridis G, Fouad AF, Cohenca N, Lauridsen E, Bourguignon C, Hicks L. International Association of Dental Traumatology guidelines for the management of traumatic dental injuries: 3. Injuries in the primary dentition. Dent Traumatol. 2020;36(4):343\u0026ndash;59.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eOzden I, Gokyar M, Ozden ME, Sazak Ovecoglu H. Assessment of artificial intelligence applications in responding to dental trauma. Dent Traumatol. 2024;40(6):722\u0026ndash;9.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKuru HE, Aşık A, Demir DM. Can artificial intelligence language models effectively address dental trauma questions? \u003cem\u003eDental Traumatology\u003c/em\u003e 2025.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTara\u0026ccedil; MG. Evaluation of Artificial Intelligence Chatbots in the Management of Primary Tooth Traumas: A Comparative Analysis. J Int Dent Sci (Uluslararası Diş Hekimliği Bilimleri Dergisi). 2025;11(1):22\u0026ndash;31.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSezer B, Aydoğdu T. Performance of advanced artificial intelligence models in traumatic dental injuries in primary dentition: a comparative evaluation of ChatGPT-4 Omni, DeepSeek, gemini advanced, and Claude 3.7 in terms of accuracy, completeness, response time, and readability. Appl Sci. 2025;15(14):7778.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003e\u0026Ccedil;ege EE, C\u0026ouml;mert H, Akal N, \u0026Ouml;lmez A. Evaluation of the Performance of Artificial Intelligence Based Chatbots in Providing First Aid Information on Dental Trauma According to the ToothSOS Application. Dent Traumatol 2025.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKumar V, Sachdeva A, Sharma S, Chawla A, Kumar V, Pandey S, Logani A. Performance Comparison of a Domain-Specific Chatbot and General‐Purpose Chatbots in Dental Traumatology. Dental Traumatology; 2025.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKeleş \u0026Ouml;K, Arslan ZB. Performance of artificial intelligence chatbots in the diagnosis and management of simulated dental trauma cases: an evaluation based on IADT guidelines. Clin Oral Invest. 2025;30(1):26.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTermteerapornpimol K, Kulvitit S, Prommanee S, Khurshid Z, Porntaveetus T. Comparative Benchmark of Seven Large Language Models for Traumatic Dental Injury Knowledge. Eur J Dentistry 2025.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNovikova J, Anderson C, Blili-Hamelin B, Rosati D, Majumdar S. Consistency in language models: Current landscape, challenges, and future directions. arXiv preprint arXiv:250500268 2025.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGhosh A, Dutta D, Saha S, Agarwal C. A survey of multilingual reasoning in language models. Find Association Comput Linguistics: EMNLP. 2025;2025:8920\u0026ndash;36.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSarı MBD, Sezer B. ChatGPT-4 Omni\u0026rsquo;s accuracy in multiple-choice dentistry questions: a multidisciplinary and bilingual assessment. Essentials Dentistry. 2025;4(1):1\u0026ndash;9.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eB\u0026uuml;y\u0026uuml;k\u0026ouml;zer \u0026Ouml;zkan H, Doğan \u0026Ccedil;ankaya T, K\u0026ouml;l\u0026uuml;ş T. The Impact of Language Variability on Artificial Intelligence Performance in Regenerative Endodontics. Healthcare: 2025. MDPI; 2025. p. 1190.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAtılan AU, \u0026Ccedil;etin N. Benchmarking Large Language Models on the Turkish Dermatology Board Exam: A Comparative Multilingual Analysis. Turkish J Dermatology 2025.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eW\u0026oacute;jcik D, Adamiak O, Czerepak G, Tokarczuk O, Szalewski L. Comparing the performance of ChatGPT, Gemini, and Claude in English and Polish on medical examinations. Sci Rep. 2025;15:33083.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSozen Yanik I, Sahin Hazir D, Bilgin Avsar D. Cross-lingual performance of large language models in maxillofacial prosthodontics: a comparative evaluation. BMC Oral Health. 2025;25(1):1630.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAgarwal A, Meghwani H, Patel HL, Sheng T, Ravi S, Roth D. Aligning llms for multilingual consistency in enterprise applications. In: \u003cem\u003eProceedings of the\u003c/em\u003e 2025 \u003cem\u003eConference on Empirical Methods in Natural Language Processing: Industry Track: 2025\u003c/em\u003e; 2025: 117\u0026ndash;137.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLisboa RM, Braido A, de-Jesus-Soares A, Tewari N, Soares CJ, Paranhos LR, Vieira WA. Performance of five free large language models in dental trauma: a 30-day longitudinal benchmark study. Front Oral Health. 2025;6:1737114.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWang L, Li J, Zhuang B, Huang S, Fang M, Wang C, Li W, Zhang M, Gong S. Accuracy of large language models when answering clinical research questions: Systematic review and network meta-analysis. J Med Internet Res. 2025;27:e64486.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHaupt F, R\u0026ouml;dig T, Liersch P. Evaluating ChatGPT-4o as an Educational Support Tool for the Emergency Management of Dental Trauma: Randomized Controlled Study Among Students. JMIR Med Educ. 2025;11(1):e80576.\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"bmc-oral-health","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"ohea","sideBox":"Learn more about [BMC Oral Health](http://bmcoralhealth.biomedcentral.com/)","snPcode":"","submissionUrl":"https://www.editorialmanager.com/ohea/default.aspx","title":"BMC Oral Health","twitterHandle":"BMC_series","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"em","reportingPortfolio":"BMC Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-8754479/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8754479/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003eBackground\u003c/h2\u003e \u003cp\u003eLarge language models (LLMs) are increasingly used as clinical decision support tools in healthcare, yet the impact of query language on their performance remains unclear, particularly in specialized domains like dental traumatology. This study evaluated whether LLM performance in dental trauma management differs based on the language of clinical scenarios (English vs. Turkish) and compared performance across three AI models.\u003c/p\u003e\u003ch2\u003eMethods\u003c/h2\u003e \u003cp\u003eTwenty-seven clinical scenarios covering 13 dental trauma categories were presented to ChatGPT 5.2, Gemini 3.0, and Claude 4.5 Sonnet in both English and Turkish, generating 162 responses. Two blinded endodontists independently evaluated responses using a standardized rubric assessing accuracy (40%), completeness (35%), and safety (25%) against IADT 2020 Guidelines. Inter-rater reliability was assessed using intraclass correlation coefficient (ICC). Language effects were analyzed using Wilcoxon signed-rank tests; model comparisons employed Kruskal-Wallis and Mann-Whitney U tests with Bonferroni correction.\u003c/p\u003e\u003ch2\u003eResults\u003c/h2\u003e \u003cp\u003eInter-rater reliability was good across all dimensions (ICC: 0.738\u0026ndash;0.836). ChatGPT showed the strongest language effect with 9.14% higher performance in English (p\u0026thinsp;\u0026lt;\u0026thinsp;0.001, r\u0026thinsp;=\u0026thinsp;0.874). Gemini showed moderate English advantage (5.69%, p\u0026thinsp;=\u0026thinsp;0.003, r\u0026thinsp;=\u0026thinsp;0.572). Claude exhibited language independence with virtually identical performance in both languages (-0.02%, p\u0026thinsp;=\u0026thinsp;0.220). In English, significant model differences emerged (H\u0026thinsp;=\u0026thinsp;22.31, p\u0026thinsp;\u0026lt;\u0026thinsp;0.001); however, model performance converged in Turkish (H\u0026thinsp;=\u0026thinsp;2.89, p\u0026thinsp;=\u0026thinsp;0.236).\u003c/p\u003e\u003ch2\u003eConclusions\u003c/h2\u003e \u003cp\u003eLanguage-dependent performance variations in LLMs are model-specific rather than universal. While ChatGPT achieved highest absolute scores, Claude\u0026rsquo;s language independence may offer more reliable performance in non-English clinical settings. These findings have implications for deployment of AI in multilingual healthcare environments.\u003c/p\u003e","manuscriptTitle":"Impact of Query Language on Large Language Model Performance in Dental Trauma Management: A Comparative Evaluation of ChatGPT, Gemini, and Claude","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-02-20 17:36:26","doi":"10.21203/rs.3.rs-8754479/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2026-03-16T08:16:05+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-03-02T05:47:36+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-02-28T02:48:16+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"12996171333286100309632647339517645488","date":"2026-02-23T20:31:44+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"121892532324661703471296467038833565924","date":"2026-02-20T15:09:42+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-02-18T18:04:56+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"306477281403308399346945570161427828869","date":"2026-02-18T13:15:45+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2026-02-18T13:05:31+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2026-02-11T10:12:18+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2026-02-11T08:08:51+00:00","index":"","fulltext":""},{"type":"submitted","content":"BMC Oral Health","date":"2026-02-11T07:51:29+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"bmc-oral-health","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"ohea","sideBox":"Learn more about [BMC Oral Health](http://bmcoralhealth.biomedcentral.com/)","snPcode":"","submissionUrl":"https://www.editorialmanager.com/ohea/default.aspx","title":"BMC Oral Health","twitterHandle":"BMC_series","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"em","reportingPortfolio":"BMC Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"c528e1db-5917-4497-8526-9ee93451d121","owner":[],"postedDate":"February 20th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[],"tags":[],"updatedAt":"2026-04-26T07:53:27+00:00","versionOfRecord":[],"versionCreatedAt":"2026-02-20 17:36:26","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8754479","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8754479","identity":"rs-8754479","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-05-20T11:00:21.680559+00:00

License: CC-BY-4.0