Psychometric Alignment Between Human and Artificial Intelligence Performance in Cardiology Residency In-Service Examinations

preprint OA: closed
Full text JSON View at publisher
Full text 130,237 characters · extracted from preprint-html · click to expand
Psychometric Alignment Between Human and Artificial Intelligence Performance in Cardiology Residency In-Service Examinations | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Psychometric Alignment Between Human and Artificial Intelligence Performance in Cardiology Residency In-Service Examinations Aykan CELIK, Tuncay KIRIS, Ugur KOCABAS, Emre OZDEMIR, Mustafa KARACA This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-9247601/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted 11 You are reading this latest preprint version Abstract Background Large language models (LLMs) have demonstrated rapidly expanding capabilities across medical knowledge tasks, including professional examinations. However, most existing evaluations focus primarily on overall accuracy and provide limited insight into how AI performance relates to the psychometric structure of examination items. Methods We evaluated the performance of five large language models on a dataset of 199 cardiology residency in-service examination questions. The models included three frontier general-purpose systems (Claude 4.6 Opus, Gemini 3.1 Flash-Lite, and GPT-5.4) and two medically oriented open-source models (MedQwen-2.5 and Qwen-3.5). Item-level analyses were conducted to examine the associations between AI accuracy and psychometric characteristics of exam questions, including human-defined item difficulty and item discrimination. Multivariable logistic regression was used to identify independent predictors of AI performance. Alignment between human and AI performance was assessed using Spearman correlation and distractor overlap analysis. Results Frontier models substantially outperformed medically oriented open-source models, achieving accuracies of 86.4% for Claude Opus, 82.9% for Gemini Flash-Lite, and 82.4% for GPT-5.4, compared with 53.3% for MedQwen and 18.6% for Qwen-3.5-35B. AI performance followed a clear gradient across human-defined difficulty levels, with frontier models answering 65–74% of hard questions and 92–96% of easy questions correctly. In multivariable analyses, item difficulty was the only psychometric factor consistently associated with AI success across frontier models (OR range 0.37–0.47, all p < 0.01). Human and AI performance were significantly correlated across items (Spearman ρ ≈ 0.25–0.30, p < 0.001). When AI models answered incorrectly, they frequently selected the same distractors as human examinees, with error overlap ranging from 31% to 53%. Conclusions Large language models demonstrate strong performance on cardiology residency examination questions and exhibit meaningful alignment with human-defined item difficulty and performance patterns. These findings suggest that AI performance on medical examinations is structured by the same psychometric characteristics that shape human assessment outcomes. Integrating AI benchmarking with psychometric analysis may provide a more informative framework for evaluating future AI systems in medical education and knowledge assessment. Health sciences/Cardiology Biological sciences/Computational biology and bioinformatics Health sciences/Health care Physical sciences/Mathematics and computing Health sciences/Medical research Artificial Intelligence Natural Language Processing Cardiology Educational Measurement Psychometrics Internship and Residency Figures Figure 1 Figure 2 Figure 3 Figure 4 INTRODUCTION Large language models (LLMs) have rapidly advanced in their ability to perform knowledge-intensive tasks across multiple domains 1 . In medicine, these models have demonstrated promising capabilities in clinical reasoning, diagnostic support, and medical question answering 2 . Several recent studies have reported that frontier LLMs can approach or even exceed human-level performance on standardized medical examinations. Consequently, professional examination datasets have increasingly been used as benchmarks for evaluating the capabilities of artificial intelligence systems in medicine 3,4 . Despite these developments, most existing evaluations focus primarily on overall accuracy or comparisons with passing thresholds. Although these metrics provide a general measure of model performance, they offer limited insight into how AI systems interact with the psychometric structure of examination questions 5 . Medical examinations are carefully constructed educational instruments grounded in measurement theory, in which characteristics such as item difficulty and item discrimination play a central role in determining test performance 6,7 . Understanding whether AI models respond to these psychometric characteristics in ways similar to human examinees is important for several reasons. First, it provides a more nuanced interpretation of AI performance beyond simple accuracy metrics. Second, it allows examination datasets to serve not only as benchmarks but also as tools for studying the interaction between AI systems and structured knowledge assessments. Finally, exploring human–AI alignment in examination performance may offer insights into whether AI models encounter conceptual challenges similar to those faced by medical residents 8 . Cardiology residency in-service examinations provide a particularly suitable setting for this investigation 9 . These assessments are designed to evaluate the breadth of cardiology knowledge among residents and typically include questions spanning pathophysiology, diagnostics, and clinical management. Because these examinations are constructed using established psychometric principles, they provide a structured framework for evaluating the relationship between AI performance and the difficulty and discriminative properties of exam items. In this study, we evaluated the performance of multiple large language models on a dataset of cardiology residency in-service examination questions and examined how AI responses relate to the psychometric characteristics of these items. Specifically, we investigated three key questions: 1. how frontier LLM performance compares with medically oriented open-source models; 2. whether AI accuracy follows human-defined item difficulty gradients; and 3. whether AI errors align with distractor patterns observed among human examinees. The examination items used in this study were derived from a validated institutional cardiology in-service examination program, in which all questions had previously undergone formal psychometric evaluation (including item difficulty and discrimination analyses) and were administered within an accredited cardiology residency training program. This study aims to provide a deeper understanding of how large language models interact with structured medical knowledge assessments by integrating AI benchmarking with psychometric analysis of examination items. METHODS Study design and dataset This study evaluated the performance of large language models on a cardiology knowledge assessment derived from cardiology residency in-service examinations. These examinations are routinely administered within residency training programs to assess knowledge acquisition and identify educational gaps among residents. The study dataset consisted of 199 multiple-choice cardiology examination items, each containing five answer options (A–E). For each item, the correct answer and aggregated response distributions from cardiology residents were available. All items originated from an institutional cardiology in-service examination program conducted within an accredited cardiology residency training program recognized by the national cardiology training authority. As part of the institutional examination quality assurance process, all questions had previously undergone formal psychometric evaluation based on classical test theory, including item difficulty indices, point-biserial discrimination coefficients, and distractor functionality analysis, as detailed in our prior longitudinal analysis of resident performance 10 . The examination dataset was organized into two predefined subsets. The primary set consisted of the core items used for the main psychometric analysis of the cardiology residency in-service examination. The supplementary set comprised additional items that met the inclusion criteria and that were analyzed separately to assess the robustness and generalizability of model performance. The text-only dataset included 199 items, consisting of 130 primary and 69 supplementary items. These data enabled the characterization of each item using classical psychometric parameters, including item difficulty and item discrimination, and allowed the examination of the relationship between AI performance and human-derived psychometric item properties, as well as distractor selection patterns among human examinees who had previously completed the same examination. Human examination performance and psychometric metrics Item difficulty was defined as the proportion of residents answering the question correctly. Based on this metric, examination items were categorized into three human-defined difficulty levels (terciles based on the distribution of resident correct response rates). Item discrimination was assessed using the point-biserial correlation coefficient, a standard measure in classical test theory that reflects how well a question differentiates between high- and low-performing examinees. In addition, the most frequently selected incorrect option among residents was identified for each question to enable comparison of distractor selection between human examinees and AI models. AI models evaluated Five large language models were evaluated: Claude 4.6 Opus (Anthropic, San Francisco, CA, USA; API version March 2026) Gemini 3.1 Flash-Lite (Google LLC, Mountain View, CA, USA; API version March 2026) GPT-5.4 (OpenAI, San Francisco, CA, USA; API version March 2026) Qwen-3.5-35b-a3b (Alibaba Cloud, Hangzhou, China) MedQwen-2.5-32B-i1 (Alibaba Cloud, Hangzhou, China) The first three are represent frontier general-purpose large language models, whereas the latter two are open-source models with a medical orientation. All models were accessed programmatically via their respective Application Programming Interfaces (APIs) between March 9 and March 13, 2026. AI prompting protocol and data processing All questions were presented to each model using a standardized prompt format. The models were instructed to select a single answer from the five available options. All questions were presented to each model using a standardized zero-shot prompt format. The models were instructed to select a single answer from the five available options using the following system prompt: “You are answering a cardiology multiple choice question. Reply with only A, B, C, D, or E. No explanation.” Model generation parameters were standardized across all API calls with a temperature of 0.2 and a maximum output length of 32 tokens to restrict responses to the predefined options. To reduce stochastic variability and assess response consistency, each question was queried independently five times per model (runs = 5). The final model answer was determined using a majority voting strategy across runs. The option selected most frequently (≥3 of 5 runs) was designated as the consensus response. In cases in which no majority was achieved, the response was classified as indeterminate. The final response of each large language model (LLM) for a given cardiology question was determined using a majority voting consensus derived from the five independent runs. Specifically, the option selected most frequently (i.e., at least three out of five iterations) was designated as the model's definitive answer. In the rare event of a tie or failure to reach a majority consensus, the model's response for that specific question was categorized as 'indeterminate'. The overall accuracy of each model was subsequently calculated by comparing the consensus answers with the established gold-standard answer key. Evaluation metrics Model performance was evaluated using several complementary approaches: Overall accuracy, defined as the proportion of correctly answered questions. Accuracy across human-defined difficulty levels, examining the variations in AI performance varied across easy, moderate, and hard questions. Human–AI performance alignment across items, assessed using Spearman’s rank correlation between human correct rates and AI accuracy across items. Human–AI error overlap is defined as the proportion of incorrect AI responses that matched the most common distractor selected by residents. Equivalent resident performance estimation To provide an interpretable clinical benchmark, AI model accuracy was mapped to the empirical distribution of resident examination scores. The distribution of correct response rates among cardiology residents who had previously completed the examination was used as the reference population, a cohort whose learning trajectories and assessment reliability have been previously established 10 . For each AI model, the overall accuracy across the 199 examination items was calculated and positioned within this distribution to estimate the equivalent resident percentile. This analysis provides an approximate comparison of AI model performance relative to human examinees but does not imply that AI models replicate the full cognitive processes of clinical residents. Statistical analysis Categorical variables, including the frequency of correct, incorrect, and indeterminate responses, were expressed as counts and percentages. To compare accuracy differences across the evaluated models, the Chi-square test or Fisher's exact test was utilized, as appropriate. Associations between the psychometric characteristics of exam items and AI correctness were evaluated using logistic regression models. Logistic regression models were used with AI correctness (correct vs incorrect) as the dependent variable. The independent variables included item difficulty (ordinal variable), item discrimination (point-biserial correlation), and question category (primary versus supplementary item). Spearman correlation analysis was used to assess the alignment between human and AI performance across examination items. To investigate the structure of AI errors, we examined the distribution of incorrect answer choices selected by AI models across all examination items. For each item, the most frequently selected incorrect option among residents was identified as the top human distractor . AI responses were then compared with this reference to determine whether models selected the same distractor as residents when answering incorrectly. Two complementary analyses were performed. First, the overall distribution of distractor selections across models was summarized to characterize the AI error topology. Second, a cross-tabulated topology matrix was constructed comparing the AI-selected distractor with the most common human distractor for each item. Heatmaps were used to visualize these relationships, allowing the identification of alignment patterns between human and AI error behavior. All analyses were performed using R programming language (R Foundation for Statistical Computing, Vienna, Austria, version 4.5.2) within the RStudio environment version 2025.09.2 (Posit PBC, Boston, MA, USA) and Microsoft Excel for Mac, version 16.107 (Microsoft Corporation, Redmond, WA, USA). A two-sided p < 0.05 was considered statistically significant. This cross-sectional in silico observational study was designed and reported in accordance with the STROBE guidelines for observational studies and adhered to the TRIPOD-LLM reporting guideline for the rigorous evaluation of large language models 11-13 . RESULTS Overall performance of AI models on cardiology residency in-service examination items The final analysis dataset consisted of 199 text-based examination items, including 130 primary and 69 supplementary items which were evaluated across five large language models. Frontier general-purpose models substantially outperformed medically oriented open-source models. Claude Opus achieved the highest majority-vote accuracy (86.4%), followed by Gemini Flash-Lite (82.9%) and GPT-5.4 (82.4%). In contrast, MedQwen achieved an accuracy of 53.3% and Qwen-3.5-35B achieved an accuracy of 18.6% (Table 1). Performance was consistently higher on supplementary items than on primary items across all frontier models. Claude Opus achieved 84.6% accuracy on primary items and 89.9% on supplementary items; Gemini Flash-Lite achieved 80.8% and 87.0%, respectively; and GPT-5.4 achieved 80.8% and 85.5%, respectively. MedQwen also showed higher performance on supplementary items (63.8%) than on primary items (47.7%), whereas Qwen-3.5-35B remained poor across both sets (20.0% vs. 15.9%). Equivalent resident performance comparison To contextualize AI performance relative to human examinees, model accuracies were mapped to the empirical distribution of cardiology resident examination scores. Based on this comparison, Claude Opus performance corresponded to approximately to the 91.7th percentile of resident performance, whereas Gemini Flash-Lite and GPT-5.4 corresponded to the 87th percentile range. In contrast, MedQwen performance corresponded to approximately to the 28th percentile, while Qwen-3.5-35B fell below the 1st percentile of resident performance (Table 2). AI performance across human-defined item difficulty levels To assess whether AI performance followed the same item difficulty structure observed in human examinees, questions were categorized into human-defined difficulty groups based on resident correct rates. This analysis demonstrated a clear and progressive accuracy gradient across all frontier models (Figure 1). Claude Opus achieved accuracies of 70.18% on hard items, 87.93% on moderately difficult items, and 96.43% on easy items. Gemini Flash-Lite showed a similar pattern, with accuracies of 68.42%, 79.31%, and 95.24%, respectively. GPT-5.4 also followed this gradient, increasing from 61.40% on hard items to 87.93% on moderate items and 92.86% on easy items (Table 3). This pattern indicates that AI models are sensitive to the same difficulty gradients encountered by cardiology residents. Psychometric predictors of AI performance Multivariable logistic regression was performed to identify independent psychometric predictors of correct AI responses among the frontier models (Table 4). Across all three models, item difficulty emerged as the only consistent and statistically significant determinant of performance. For Claude Opus, increasing difficulty was associated with lower odds of a correct response (OR 0.35, 95% CI 0.18–0.63, p < 0.001). Similar associations were observed for Gemini Flash-Lite (OR 0.41, 95% CI 0.24–0.68, p < 0.001) and GPT-5.4 (OR 0.35, 95% CI 0.21–0.59, p < 0.001). Item discrimination showed a positive but non-significant association with AI accuracy in all three models. Likewise, primary versus supplementary item status was not independently associated with performance after adjustment. Alignment between human and AI performance across items At the item level, human and AI performance showed significant positive correlations across all frontier models (Figure 2). The Spearman correlation coefficients were ρ = 0.296 for Claude Opus (p < 0.001), ρ = 0.249 for Gemini Flash-Lite (p < 0.001), and ρ = 0.295 for GPT-5.4 (p < 0.001) (Table 5). The receiver operating characteristic analysis further demonstrated that human-defined item difficulty moderately predicted AI correctness. The area under the curve was 0.749 (95% CI 0.657–0.842) for Claude Opus, 0.691 (95% CI 0.602–0.780) for Gemini Flash-Lite, and 0.723 (95% CI 0.631–0.816) for GPT-5.4 (Supplementary Table S1, Figure S1). Pairwise comparisons of AUC values using DeLong tests did not demonstrate statistically significant differences between the frontier models (Supplementary Table S2). These findings indicate that questions that were easier for human residents also tended to be answered correctly more often by AI models, whereas questions that were difficult for residents were also tended to challenge frontier LLMs. Consistent with this observation, ordinal trend models across human-defined difficulty groups were significant for all three frontier models. Increasing human-defined ease was associated with higher AI success for Claude Opus (β 1.02, p = 0.0007), Gemini Flash-Lite (β 0.88, p = 0.0008), and GPT-5.4 (β 1.06, p = 0.00009). Human–AI error overlap When models answered incorrectly, they often selected distractors similar to those chosen by human examinees. The proportion of AI errors that matched the most common human distractor was highest for Gemini Flash-Lite (52.94%) and Claude Opus (48.15%), followed by MedQwen (43.01%), GPT-5.4 (31.43%), and Qwen-3.5-35B (29.01%) (Supplementary Table S3, Figure S2). These findings suggest that at least some AI systems do not merely fail randomly but instead exhibit error patterns that partially overlap with human reasoning errors. Discrimination-stratified performance When items were stratified by point-biserial discrimination terciles, frontier models showed a numerically higher accuracy on highly discriminative items than on low-discrimination items. Claude Opus increased from 83.61% in the low-discrimination group to 93.42% in the high-discrimination group, while Gemini Flash-Lite increased from 86.89% to 88.16% and GPT-5.4 from 81.97% to 86.84% (Supplementary Table S4, Figure S3). In contrast, open-source models demonstrated lower and less consistent performance across discrimination strata. Topology of AI errors across distractor options An analysis of incorrect responses revealed structured patterns in AI error behavior across distractor options (Figure 3). Across models, incorrect answers were not randomly distributed but tended to cluster around specific distractors. Frontier models demonstrated relatively balanced error distributions across distractor options, whereas smaller models showed pronounced concentrations on specific incorrect choices. Notably, Qwen-3.5-35B exhibited strong clustering toward a limited subset of distractors, suggesting reduced discrimination between plausible incorrect alternatives. When comparing AI-selected distractors with the most frequently chosen incorrect options among residents, substantial alignment was observed (Figure 4). Diagonal dominance within the topology matrices indicated that the AI models frequently selected the same distractors that misled residents. This pattern was most pronounced for Gemini Flash-Lite and Claude Opus, whereas GPT-5.4 demonstrated moderate alignment. The mean distractor concordance across models is reported in Supplementary Table S5. DISCUSSION Principal findings In this study, we evaluated the performance of several large language models on cardiology residency in-service examination items and examined how AI responses relate to the psychometric characteristics of these questions. Three main findings emerged. First, frontier general-purpose models substantially outperformed medically oriented open-source models 3 . Claude Opus, Gemini Flash-Lite, and GPT-5.4 achieved accuracies exceeding 80%, whereas MedQwen and Qwen-3.5-35B demonstrated markedly lower performance. This finding suggests that scale, training diversity, and general reasoning capabilities of frontier models may currently play a more important role than domain-specific fine-tuning in determining performance on complex medical knowledge tasks. Second, AI performance closely followed human-defined item difficulty gradients. Across all frontier models, accuracy increased consistently from hard to easy questions. Multivariable analyses further confirmed that item difficulty was the only psychometric characteristic independently associated with AI success. These findings indicate that large language models are sensitive to the same knowledge difficulty structure encountered by human residents. Third, we observed a measurable alignment between human and AI performance at the item level. Questions that were easier for cardiology residents were also more likely to be answered correctly by AI systems. Moreover, when AI models answered incorrectly, they frequently selected the same distractor options chosen by residents. This partial overlap suggests that AI systems may encounter conceptual challenges similar to those faced by human learners when dealing with complex clinical knowledge questions 14 . Relationship to prior literature Most prior evaluations have focused on benchmark accuracy or passing thresholds rather than examining how AI performance interacts with the psychometric structure of examination items 15 . While these studies demonstrate the impressive capabilities of frontier models, they often provide limited insight into how AI performance relates to the psychometric structure of examination items 16,17 . Our findings extend this literature by integrating classical test theory metrics, including item difficulty and discrimination, into the evaluation of AI systems 18,19 . Thus, we demonstrate that AI performance is not randomly distributed across questions but instead follows the same difficulty gradients embedded in the examination structure 15,20 . This approach provides a more informative framework for interpreting AI performance in medical knowledge assessments 21 . Benchmarking AI performance against resident examination scores To provide an interpretable human benchmark, we mapped AI model accuracy onto the empirical distribution of resident examination scores 22 . Using this approach, the performance of frontier models corresponded approximately to the upper range of resident performance, with Claude Opus aligning with the 91st percentile and both Gemini Flash-Lite and GPT-5.4 aligning with approximately the 87th percentile of cardiology residents. In contrast, medically oriented open-source models corresponded to substantially lower positions within the resident performance distribution. This comparison should be interpreted as a benchmarking exercise rather than an indication that AI systems replicate the reasoning processes of clinical residents 23 . In our prior evaluation of this cohort, we demonstrated a progressive consolidation of clinical knowledge over time, with mean resident performance improving from approximately 41% in the early training period to 73% in the late training period 10 . The overall accuracies achieved by frontier LLMs (82.4%–86.4%) therefore place their performance at or above the level expected of senior cardiology residents completing their training. Nonetheless, these findings illustrate that contemporary frontier LLMs can achieve examination-level knowledge performance approaching the performance range of high-performing residents when evaluated on structured cardiology assessment items 21 . This comparison provides an intuitive benchmark for interpreting AI examination performance within the context of human training outcomes 24 . Implications for medical education and AI evaluation The results of this study have several implications for both medical education and AI benchmarking. First, they suggest that accuracy alone may be insufficient to characterize AI performance on professional examinations. Exam items vary in their psychometric characteristics, and these differences strongly influence both human and AI performance 25,26 . Consistent with this interpretation, the ROC analysis demonstrated that human-defined difficulty moderately predicted AI correctness across frontier models, suggesting that AI performance is structured by the same psychometric gradients that influence human examination outcomes 27 . Second, the observed alignment between human and AI performance indicates that large language models may reflect similar knowledge gradients present in medical curricula and examination frameworks. This alignment could make psychometrically structured examination datasets a useful tool for evaluating the educational relevance of AI systems 28 . Third, the overlap between human and AI error patterns raises interesting questions regarding how these models process clinical knowledge and whether they encounter conceptual pitfalls similar to those experienced by human residents 29 . Interpretation of human–AI error topology The examination of distractor topology provided additional insight into the cognitive behavior of AI models when answering complex clinical questions. Rather than producing random errors, AI models frequently selected the same distractors that were most commonly chosen by residents 30 . We have previously observed that human errors in these specific examinations tend to cluster around dominant distractors due to systematic conceptual misunderstandings 10 . This alignment suggests that errors produced by large language models may reflect similar cognitive traps embedded within question design, such as partially correct clinical reasoning or misleading contextual cues 31 . In contrast, smaller models demonstrated more irregular error distributions, indicating weaker discrimination among distractor options. These findings imply that advanced AI systems may process question semantics in ways that resemble human reasoning patterns, even when the final answer is incorrect. Limitations Several limitations should be considered when interpreting these findings. First, the analysis was based on a single specialty examination dataset derived from cardiology residency in-service assessments. Although cardiology questions often require the integration of pathophysiology, diagnostics, and management principles, the results may not generalize to other medical disciplines. Second, AI responses were evaluated using majority voting across multiple runs. Although approach reduces stochastic variability in model outputs, it may not fully capture the range of possible responses generated by large language models. Third, although distractor analysis provides insight into AI error patterns, it does not directly reveal the internal reasoning processes used by the models. In addition, because large language models are trained on large-scale internet corpora, it is possible that exposure to examination-style questions or similar educational material may have influenced model performance. However, the specific examination items used in this study were institutional and not publicly available, which reduces but does not entirely eliminate this possibility. Finally, because the examination items were text-based multiple-choice questions, the results may not directly translate to clinical decision-making tasks in real-world settings. Future directions Future research should extend the psychometric evaluation of AI systems across multiple specialties and examination formats 32 . In addition, integrating more advanced psychometric frameworks such as item response theory may provide further insight into how AI models interact with structured knowledge assessments 17 . Understanding how AI systems respond to educational measurement frameworks may help guide the development of more robust benchmarks for evaluating medical AI systems 21 . Future studies should also investigate AI performance using longitudinal educational datasets to determine whether models capture evolving knowledge structures within medical training. Conclusions Large language models demonstrate strong performance on cardiology residency in-service examination questions and exhibit meaningful alignment with human-defined item difficulty and performance patterns. These findings suggest that AI performance on medical examinations is structured by the same psychometric characteristics that shape human assessment outcomes. Integrating AI benchmarking with psychometric analysis provides a promising framework for evaluating future AI systems in medical education and knowledge assessment. Declarations Data Availability Statement: All relevant data are included in the manuscript or are available from the corresponding author upon reasonable request. Ethics approval and consent to participate: The study protocol was approved by the Institutional Review Board of Izmir Katip Celebi University (Approval Number: 0828, Date: 15.01.2026) and was conducted in accordance with the Declaration of Helsinki. Patient Consent Statement: The study used anonymized routinely collected educational assessment data and did not involve patient information, the requirement for individual informed consent was waived by the ethics committee. Author Contributions: Concept – A.Ç.; design – A.Ç.; supervision – T.K., E.Ö., M.K.; resources and materials – A.Ç., T.K., U.K., E.Ö., M.K.; data collection and processing – A.Ç., M.K.; analysis and interpretation – A.Ç.; literature search – A.Ç., U.K., M.K.; writing – A.Ç., T.K., U.K., E.Ö., M.K.; critical review – T.K., E.Ö., M.K. Conflict of Interest Disclosure: The authors have no conflicts of interest to declare. Acknowledgments: During the preparation of this work, the authors used ChatGPT-5.4 (OpenAI, San Francisco, CA, USA) to check for grammar and spelling to improved readability. After using this tool, the authors reviewed and edited the content as needed and took full responsibility for the content of the publication. Funding Statement: The authors declare that they have no financial support. References Thirunavukarasu, A.J. , et al. Large language models in medicine. Nat Med 29 , 1930-1940 (2023). Singhal, K. , et al. Large language models encode clinical knowledge. Nature 620 , 172-180 (2023). Nori, H., King, N., McKinney, S.M., Carignan, D. & Horvitz, E. Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375 (2023). Kung, T.H. , et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health 2 , e0000198 (2023). Chang, Y. , et al. A Survey on Evaluation of Large Language Models. ACM Transactions on Intelligent Systems and Technology 15 , 1-45 (2024). Downing, S.M. Validity: on meaningful interpretation of assessment data. Med Educ 37 , 830-837 (2003). Epstein, R.M. Assessment in medical education. N Engl J Med 356 , 387-396 (2007). Lalor, J.P., Wu, H. & Yu, H. Building an Evaluation Scale using Item Response Theory. Proc Conf Empir Methods Nat Lang Process 2016 , 648-657 (2016). Halperin, J.L., Williams, E.S. & Fuster, V. COCATS 4 Introduction. J Am Coll Cardiol 65 , 1724-1733 (2015). Celik, A., Ozdemir, E. & Karaca, M. Longitudinal trajectories of clinical knowledge performance in cardiology residency a mixed-effects analysis with psychometric adjustment of routine assessments. BMC Med Res Methodol (2026), doi: 10.1186/s12874-026-02841-0. Elm E von, A.D., Egger M, Pocock SJ, Gøtzsche PC, Vandenbroucke JP, et al. . The STROBE reporting checklist. in The EQUATOR network reporting guideline platform (ed. Harwood J, A.C., Beyer J de, Schlüssel M, Collins G,) (2025). von Elm, E. , et al. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. Ann Intern Med 147 , 573-577 (2007). Gallifant, J. , et al. The TRIPOD-LLM reporting guideline for studies using large language models. Nat Med 31 , 60-69 (2025). Lievin, V., Hother, C.E., Motzfeldt, A.G. & Winther, O. Can large language models reason about medical questions? Patterns (N Y) 5 , 100943 (2024). Yaneva, V., Baldwin, P., Jurich, D.P., Swygert, K. & Clauser, B.E. Examining ChatGPT Performance on USMLE Sample Items and Implications for Assessment. Acad Med 99 , 192-197 (2024). Siam, M.K. , et al. Benchmarking large language models on the United States medical licensing examination for clinical reasoning and medical licensing scenarios. Sci Rep 16 , 1387 (2025). Yan Zhuang, Q.L., Zachary A. Pardos, Patrick C. Kyllonen, Jiyun Zu, Zhenya Huang, Shijin Wang, Enhong Chen. Position: AI Evaluation Should Learn from How We Test Humans. arXiv preprint arXiv:2303.13375 (2025). Schuwirth, L.W. & Van der Vleuten, C.P. Programmatic assessment: From assessment of learning to assessment for learning. Med Teach 33 , 478-485 (2011). Law, A.K. , et al. AI versus human-generated multiple-choice questions for medical education: a cohort study in a high-stakes examination. BMC Med Educ 25 , 208 (2025). Hingorjo, M.R. & Jaleel, F. Analysis of one-best MCQs: the difficulty index, discrimination index and distractor efficiency. J Pak Med Assoc 62 , 142-147 (2012). Sun, L. , et al. Beyond Benchmarks: Evaluating Generalist Medical Artificial Intelligence With Psychometrics. J Med Internet Res 27 , e70901 (2025). Garibaldi, R.A., Subhiyah, R., Moore, M.E. & Waxman, H. The In-Training Examination in Internal Medicine: an analysis of resident performance over time. Ann Intern Med 137 , 505-510 (2002). Katz, U. , et al. GPT versus Resident Physicians — A Benchmark Based on Official Board Scores. Nejm Ai 1 (2024). Iftikhar, H., Anjum, S., Bhutta, Z.A., Najam, M. & Bashir, K. Performance of ChatGPT in emergency medicine residency exams in Qatar: A comparative analysis with resident physicians. Qatar Med J 2024 , 61 (2024). Singh AK, N.N., Verma VK, Prasanth PG. Beyond Accuracy: A Psychometric Benchmark and Stability Analysis of 15 Large Language Models on NEET-PG Medicine Questions (2021--2025). Journal of Contemporary Clinical Practice 11(12) , 783-789 (2025 Dec). Kwong, J.C.C. , et al. APPRAISE-AI Tool for Quantitative Evaluation of AI Studies for Clinical Decision Support. JAMA Netw Open 6 , e2335377 (2023). Linde, P. , et al. Psychometric properties and detectability of GPT-4o-generated multiple-choice questions compared with human-authored items across imaging specialties. NPJ Digit Med 9 , 132 (2026). Shelmerdine, S.C. , et al. Can artificial intelligence pass the Fellowship of the Royal College of Radiologists examination? Multi-reader diagnostic accuracy study. BMJ 379 , e072826 (2022). Wies, C., Hauser, K. & Brinker, T.J. Reply to: False conflict and false confirmation errors are crucial components of AI accuracy in medical decision making. Nat Commun 15 , 6897 (2024). Athaluri, S.A. , et al. Exploring the Boundaries of Reality: Investigating the Phenomenon of Artificial Intelligence Hallucination in Scientific Writing Through ChatGPT References. Cureus 15 , e37432 (2023). Tavakol, M. & Dennick, R. Post-examination analysis of objective tests. Med Teach 33 , 447-458 (2011). Murias Quintana, E. , et al. Improving the ability to discriminate medical multiple-choice questions through the analysis of the competitive examination to assign residency positions in Spain. BMC Med Educ 24 , 367 (2024). Tables Table 1. Overall performance of AI models on cardiology residency in-service examination items Model Run-level accuracy Majority-vote accuracy Primary-set accuracy Supplement-set accuracy Claude Opus 4.6 86.4% 86.4% 84.6% 89.9% Gemini 3.1 Flash Lite 83.0% 82.9% 80.8% 87.0% GPT-5.4 81.7% 82.4% 80.8% 85.5% MedQwen 52.9% 53.3% 47.7% 63.8% Qwen-3.5-35B 18.0% 18.6% 20.0% 15.9% Footnote: Values represent the proportion of correctly answered items. Accuracy was calculated on the text-only examination item set (n = 199), consisting of 130 primary and 69 supplementary items as defined in the Methods section. Majority-vote accuracy represents the final response obtained from the majority prediction across repeated model runs. Table 2. AI model performance relative to the distribution of resident examination scores Model Accuracy Equivalent resident percentile Claude Opus 4.6 86.4% 91.7th percentile Gemini 3.1 Flash Lite 82.9% 87.2nd percentile GPT-5.4 82.4% 87.0th percentile MedQwen 53.3% 28.2nd percentile Qwen-3.5-35B 18.6% <1st percentile Footnote: Resident percentile estimates were derived from the empirical distribution of correct response rates across cardiology residents participating in the in-service examination. Table 3. AI accuracy across human-defined examination item difficulty levels Human-defined item difficulty Claude Opus 4.6 Gemini 3.1 Flash Lite GPT-5.4 MedQwen Qwen-3.5-35B Hard 70.18% 68.42% 61.40% 33.33% 21.05% Moderate 87.93% 79.31% 87.93% 43.10% 15.52% Easy 96.43% 95.24% 92.86% 73.81% 19.05% Footnote: Difficulty categories were defined using terciles of the resident correct response rate across examination items. Easy items correspond to the highest tertile of resident performance, whereas hard items correspond to the lowest tertile. Table 4. Multivariable logistic regression analysis predicting AI correctness Variable Claude Opus 4.6 OR (95% CI) p-value Gemini 3.1 Flash Lite OR (95% CI) p-value GPT-5.4 OR (95% CI) p-value Difficulty 0.35 (0.18–0.63) <0.001 0.41 (0.24–0.68) <0.001 0.35 (0.21–0.59) <0.001 Discrimination 3.25 (0.53–20.58) 0.20 1.79 (0.34–9.26) 0.49 1.30 (0.25–6.73) 0.75 Primary item 0.76 (0.23–2.21) 0.63 0.88 (0.31–2.27) 0.80 1.13 (0.42–2.91) 0.80 Footnote: Odds ratios represent the probability of an AI model answering an item correctly. Difficulty was coded as an ordinal variable (easy to hard). Discrimination refers to the point-biserial discrimination index. Table 5. Correlation between human performance and AI correctness across examination items Model Spearman ρ p-value Claude Opus 4.6 0.296 <0.001 Gemini 3.1 Flash Lite 0.249 <0.001 GPT-5.4 0.295 <0.001 Footnote: Spearman correlation coefficients quantify the association between resident correct response rate and AI correctness across examination items. Additional Declarations No competing interests reported. Supplementary Files figsupplementS1.png Supplementary Figure S1. Human difficulty predicting AI success Receiver operating characteristic (ROC) analysis evaluating the ability of human-defined item difficulty to predict correct responses by AI models. Item difficulty was derived from the proportion of residents answering each item correctly. The curves represent the discriminative performance of item difficulty in predicting AI success for each model. Area under the curve (AUC) values quantify how strongly human-derived difficulty predicts AI performance across examination items. figsupplementS2.png Supplementary Figure S2. Human–AI error overlap across models Proportion of AI incorrect responses that matched the most frequently selected incorrect option among residents. Bars represent the percentage of overlapping error patterns between AI systems and human examinees. A higher overlap indicates greater similarity between AI and human error distributions across distractor options. figsupplementS3.png Supplementary Figure S3. AI accuracy across item discrimination levels Accuracy of large language models stratified by item discrimination levels derived from point-biserial discrimination coefficients. Items were categorized into low, moderate, and high discrimination groups. The figure demonstrates that models with higher overall performance maintain relatively stable accuracy across discrimination strata, whereas smaller models show consistently lower performance across all categories. supplementtables.docx Cite Share Download PDF Status: Under Review Version 1 posted Editorial decision: Revision requested 19 Apr, 2026 Reviews received at journal 19 Apr, 2026 Reviews received at journal 14 Apr, 2026 Reviewers agreed at journal 11 Apr, 2026 Reviewers agreed at journal 10 Apr, 2026 Reviews received at journal 10 Apr, 2026 Reviewers agreed at journal 01 Apr, 2026 Reviewers invited by journal 31 Mar, 2026 Editor assigned by journal 31 Mar, 2026 Submission checks completed at journal 30 Mar, 2026 First submitted to journal 27 Mar, 2026 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-9247601","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":616838157,"identity":"04912cb4-8d50-4462-b660-e5672d9a88d7","order_by":0,"name":"Aykan CELIK","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABHElEQVRIie3QPUvDQBjA8ec4aZaQrHeL+QoJgWBB8atcCGSLCF1FCwfpUjsn3yJF6Cwctsuha0WHgND5ptJBiher20U6Otx/Cs/lx70A2Gz/sUcANO4+HMRBAfmeQBseQzDiqPol7DgCGLs/E2B/CO9NbGh18x74GPP4ojy78lbTWLFrcQqOeGoMhD7nCW2Wm6jmiGdFSUZUyoSwUMTg5vnaQEIJCW0HQv+DuCgWJG3WxULfJU/HxE3MxNnSdi/YZUeGB/KgNLnrJ25C56XeRb9Yhg6k0Qc7Z9BDqHRHw3omokrvEt3vSVrLl21HorLnLp505q/TrQj8yeSD7ORtOluVmVKfJPAdsTQR3QkxzwfmcRdW/Ws2m81m030BarZtRPNkmz8AAAAASUVORK5CYII=","orcid":"","institution":"Izmir Atatürk Eğitim ve Araştırma Hastanesi","correspondingAuthor":true,"prefix":"","firstName":"Aykan","middleName":"","lastName":"CELIK","suffix":""},{"id":616838158,"identity":"3f3306f2-bd85-4145-bc5a-d8d0c5c2376a","order_by":1,"name":"Tuncay KIRIS","email":"","orcid":"","institution":"Izmir Kâtip Çelebi University","correspondingAuthor":false,"prefix":"","firstName":"Tuncay","middleName":"","lastName":"KIRIS","suffix":""},{"id":616838159,"identity":"7157cdd9-b472-444c-ab44-cf798ab2cc96","order_by":2,"name":"Ugur KOCABAS","email":"","orcid":"","institution":"Izmir Atatürk Eğitim ve Araştırma Hastanesi","correspondingAuthor":false,"prefix":"","firstName":"Ugur","middleName":"","lastName":"KOCABAS","suffix":""},{"id":616838160,"identity":"d3ed84b0-dba5-47de-bcc8-6e67c81e632a","order_by":3,"name":"Emre OZDEMIR","email":"","orcid":"","institution":"Izmir Kâtip Çelebi University","correspondingAuthor":false,"prefix":"","firstName":"Emre","middleName":"","lastName":"OZDEMIR","suffix":""},{"id":616838161,"identity":"965d31ff-92eb-46eb-a255-a50dfef60169","order_by":4,"name":"Mustafa KARACA","email":"","orcid":"","institution":"Izmir Kâtip Çelebi University","correspondingAuthor":false,"prefix":"","firstName":"Mustafa","middleName":"","lastName":"KARACA","suffix":""}],"badges":[],"createdAt":"2026-03-27 18:23:56","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-9247601/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-9247601/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":106242821,"identity":"b10f7a8f-a0ee-4ec5-a62b-5f43ddbb5a14","added_by":"auto","created_at":"2026-04-06 15:21:29","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":405992,"visible":true,"origin":"","legend":"\u003cp\u003eAI performance across human-defined item difficulty levels\u003c/p\u003e\n\u003cp\u003ePerformance of large language models across examination items stratified by human-defined item difficulty. Item difficulty was categorized based on the proportion of residents answering each question correctly and grouped into three categories: hard, moderate, and easy. Points represent the mean accuracy of each AI model within each difficulty category, and error bars indicate 95% confidence intervals. Across all models, accuracy increased progressively from hard to easy items, indicating that AI performance follows the same difficulty gradient observed in human examination performance.\u003c/p\u003e","description":"","filename":"fig1.png","url":"https://assets-eu.researchsquare.com/files/rs-9247601/v1/75f6638dec8041a4781841da.png"},{"id":106242824,"identity":"ce9a74c6-6993-472d-afcd-81b51b40c539","added_by":"auto","created_at":"2026-04-06 15:21:29","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":412930,"visible":true,"origin":"","legend":"\u003cp\u003eAlignment between human and AI performance across exam items\u003c/p\u003e\n\u003cp\u003eItem-level relationship between human examination performance and AI accuracy. Each point represents a single examination item. The x-axis shows the proportion of residents who answered the item correctly (human correct rate), and the y-axis indicates the probability of a correct response by the AI model. The smoothed curves represent locally weighted regression fits illustrating the relationship between human and AI performance. Across models, items that were easier for residents were also more likely to be answered correctly by AI systems, demonstrating alignment between human and AI knowledge performance. Because AI responses were coded as binary outcomes (correct vs incorrect), the observations appear clustered at the extremes of the y-axis.\u003c/p\u003e","description":"","filename":"fig2.png","url":"https://assets-eu.researchsquare.com/files/rs-9247601/v1/926db7fad946a32a91392b8d.png"},{"id":106242823,"identity":"b3c76266-8107-453f-ba8b-a30934ba8d4d","added_by":"auto","created_at":"2026-04-06 15:21:29","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":353065,"visible":true,"origin":"","legend":"\u003cp\u003eTopology of human–AI error alignment\u003c/p\u003e\n\u003cp\u003eHeatmap visualization of the alignment between human and AI error patterns across distractor options. Columns represent the most frequently selected incorrect option among residents (human distractor), and rows represent the distractor chosen by the AI model when it produced an incorrect answer. Cell values indicate the number of items for which the AI model selected a specific distractor when the corresponding human distractor was the most commonly selected incorrect option. Cells along the diagonal indicate cases in which the AI models selected the same distractor option most frequently chosen by residents, representing concordant error patterns.\u003c/p\u003e","description":"","filename":"fig3.png","url":"https://assets-eu.researchsquare.com/files/rs-9247601/v1/89b0b79d99914c5024907599.png"},{"id":106242829,"identity":"8896ea62-2a8a-4b5c-962f-cd6f85dcb06b","added_by":"auto","created_at":"2026-04-06 15:21:29","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":289429,"visible":true,"origin":"","legend":"\u003cp\u003eAI error topology across distractor options\u003c/p\u003e\n\u003cp\u003eDistribution of incorrect answer selections across distractor options for each AI model. Each cell represents the number of times a specific distractor option (A–E) was selected when the model produced an incorrect response. Color intensity corresponds to the frequency of selection. More advanced models showed relatively balanced distractor distributions, whereas smaller models demonstrated stronger clustering toward specific distractor options, suggesting weaker discrimination among plausible alternatives.\u003c/p\u003e","description":"","filename":"fig4.png","url":"https://assets-eu.researchsquare.com/files/rs-9247601/v1/bebf0148ed2284f18753ab0b.png"},{"id":106959519,"identity":"82ce17d4-67cb-49f2-bd42-8bc33f35ec96","added_by":"auto","created_at":"2026-04-15 09:10:55","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1811089,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-9247601/v1/2df53788-cfed-471c-bd12-aec509e18012.pdf"},{"id":106403634,"identity":"6f76acd4-2422-4299-ac68-281623dd0e83","added_by":"auto","created_at":"2026-04-08 09:14:39","extension":"png","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":161548,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eSupplementary Figure S1.\u003c/strong\u003e Human difficulty predicting AI success\u003c/p\u003e\n\u003cp\u003eReceiver operating characteristic (ROC) analysis evaluating the ability of human-defined item difficulty to predict correct responses by AI models. Item difficulty was derived from the proportion of residents answering each item correctly. The curves represent the discriminative performance of item difficulty in predicting AI success for each model. Area under the curve (AUC) values quantify how strongly human-derived difficulty predicts AI performance across examination items.\u003c/p\u003e","description":"","filename":"figsupplementS1.png","url":"https://assets-eu.researchsquare.com/files/rs-9247601/v1/7ff637699ef3c7b89b68595b.png"},{"id":106242822,"identity":"e6d7d9c1-f274-46d5-9e9c-25331c5b711e","added_by":"auto","created_at":"2026-04-06 15:21:29","extension":"png","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":165498,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eSupplementary Figure S2.\u003c/strong\u003e Human–AI error overlap across models\u003c/p\u003e\n\u003cp\u003eProportion of AI incorrect responses that matched the most frequently selected incorrect option among residents. Bars represent the percentage of overlapping error patterns between AI systems and human examinees. A higher overlap indicates greater similarity between AI and human error distributions across distractor options.\u003c/p\u003e","description":"","filename":"figsupplementS2.png","url":"https://assets-eu.researchsquare.com/files/rs-9247601/v1/89c0e7339574ef9aa9b1316b.png"},{"id":106242827,"identity":"f9a18b00-1138-4a8e-89e1-e36505eadb68","added_by":"auto","created_at":"2026-04-06 15:21:29","extension":"png","order_by":3,"title":"","display":"","copyAsset":false,"role":"supplement","size":98674,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eSupplementary Figure S3.\u003c/strong\u003e AI accuracy across item discrimination levels\u003c/p\u003e\n\u003cp\u003eAccuracy of large language models stratified by item discrimination levels derived from point-biserial discrimination coefficients. Items were categorized into low, moderate, and high discrimination groups. The figure demonstrates that models with higher overall performance maintain relatively stable accuracy across discrimination strata, whereas smaller models show consistently lower performance across all categories.\u003c/p\u003e","description":"","filename":"figsupplementS3.png","url":"https://assets-eu.researchsquare.com/files/rs-9247601/v1/366a488c57481f15879af917.png"},{"id":106403334,"identity":"106eb9d5-e27a-4d8b-becf-8ca5871c5fdb","added_by":"auto","created_at":"2026-04-08 09:14:05","extension":"docx","order_by":4,"title":"","display":"","copyAsset":false,"role":"supplement","size":19870,"visible":true,"origin":"","legend":"","description":"","filename":"supplementtables.docx","url":"https://assets-eu.researchsquare.com/files/rs-9247601/v1/a61e9ed142d6c3a55bbe8c8e.docx"}],"financialInterests":"No competing interests reported.","formattedTitle":"Psychometric Alignment Between Human and Artificial Intelligence Performance in Cardiology Residency In-Service Examinations","fulltext":[{"header":"INTRODUCTION","content":"\u003cp\u003eLarge language models (LLMs) have rapidly advanced in their ability to perform knowledge-intensive tasks across multiple domains\u003csup\u003e1\u003c/sup\u003e. In medicine, these models have demonstrated promising capabilities in clinical reasoning, diagnostic support, and medical question answering\u003csup\u003e2\u003c/sup\u003e. Several recent studies have reported that frontier LLMs can approach or even exceed human-level performance on standardized medical examinations. Consequently, professional examination datasets have increasingly been used as benchmarks for evaluating the capabilities of artificial intelligence systems in medicine\u003csup\u003e3,4\u003c/sup\u003e.\u003c/p\u003e\n\u003cp\u003eDespite these developments, most existing evaluations focus primarily on overall accuracy or comparisons with passing thresholds. Although these metrics provide a general measure of model performance, they offer limited insight into how AI systems interact with the psychometric structure of examination questions\u003csup\u003e5\u003c/sup\u003e. Medical examinations are carefully constructed educational instruments grounded in measurement theory, in which characteristics such as item difficulty and item discrimination play a central role in determining test performance\u003csup\u003e6,7\u003c/sup\u003e.\u003c/p\u003e\n\u003cp\u003eUnderstanding whether AI models respond to these psychometric characteristics in ways similar to human examinees is important for several reasons. First, it provides a more nuanced interpretation of AI performance beyond simple accuracy metrics. Second, it allows examination datasets to serve not only as benchmarks but also as tools for studying the interaction between AI systems and structured knowledge assessments. Finally, exploring human\u0026ndash;AI alignment in examination performance may offer insights into whether AI models encounter conceptual challenges similar to those faced by medical residents\u003csup\u003e8\u003c/sup\u003e.\u003c/p\u003e\n\u003cp\u003eCardiology residency in-service examinations provide a particularly suitable setting for this investigation\u003csup\u003e9\u003c/sup\u003e. These assessments are designed to evaluate the breadth of cardiology knowledge among residents and typically include questions spanning pathophysiology, diagnostics, and clinical management. Because these examinations are constructed using established psychometric principles, they provide a structured framework for evaluating the relationship between AI performance and the difficulty and discriminative properties of exam items.\u003c/p\u003e\n\u003cp\u003eIn this study, we evaluated the performance of multiple large language models on a dataset of cardiology residency in-service examination questions and examined how AI responses relate to the psychometric characteristics of these items. Specifically, we investigated three key questions:\u003c/p\u003e\n\u003cp\u003e\u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;\u0026nbsp;1.\u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;how frontier LLM performance compares with medically oriented open-source models;\u003c/p\u003e\n\u003cp\u003e\u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;\u0026nbsp;2.\u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;whether AI accuracy follows human-defined item difficulty gradients; and\u003c/p\u003e\n\u003cp\u003e\u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;\u0026nbsp;3.\u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;whether AI errors align with distractor patterns observed among human examinees.\u003c/p\u003e\n\u003cp\u003eThe examination items used in this study were derived from a validated institutional cardiology in-service examination program, in which all questions had previously undergone formal psychometric evaluation (including item difficulty and discrimination analyses) and were administered within an accredited cardiology residency training program.\u003c/p\u003e\n\u003cp\u003eThis study aims to provide a deeper understanding of how large language models interact with structured medical knowledge assessments by integrating AI benchmarking with psychometric analysis of examination items.\u003c/p\u003e"},{"header":"METHODS","content":"\u003cp\u003e\u003cstrong\u003eStudy design and dataset\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis study evaluated the performance of large language models on a cardiology knowledge assessment derived from cardiology residency in-service examinations. These examinations are routinely administered within residency training programs to assess knowledge acquisition and identify educational gaps among residents.\u003c/p\u003e\n\u003cp\u003eThe study dataset consisted of 199 multiple-choice cardiology examination items, each containing five answer options (A\u0026ndash;E). For each item, the correct answer and aggregated response distributions from cardiology residents were available.\u003c/p\u003e\n\u003cp\u003eAll items originated from an institutional cardiology in-service examination program conducted within an accredited cardiology residency training program recognized by the national cardiology training authority. As part of the institutional examination quality assurance process, all questions had previously undergone formal psychometric evaluation based on classical test theory, including item difficulty indices, point-biserial discrimination coefficients, and distractor functionality analysis, as detailed in our prior longitudinal analysis of resident performance\u003csup\u003e10\u003c/sup\u003e.\u003c/p\u003e\n\u003cp\u003eThe examination dataset was organized into two predefined subsets. The primary set consisted of the core items used for the main psychometric analysis of the cardiology residency in-service examination. The supplementary set comprised additional items that met the inclusion criteria and that were analyzed separately to assess the robustness and generalizability of model performance. The text-only dataset included 199 items, consisting of 130 primary and 69 supplementary items.\u003c/p\u003e\n\u003cp\u003eThese data enabled the characterization of each item using classical psychometric parameters, including item difficulty and item discrimination, and allowed the examination of the relationship between AI performance and human-derived psychometric item properties, as well as distractor selection patterns among human examinees who had previously completed the same examination.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eHuman examination performance and psychometric metrics\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eItem difficulty was defined as the proportion of residents answering the question correctly. Based on this metric, examination items were categorized into three human-defined difficulty levels (terciles based on the distribution of resident correct response rates).\u003c/p\u003e\n\u003cp\u003eItem discrimination was assessed using the point-biserial correlation coefficient, a standard measure in classical test theory that reflects how well a question differentiates between high- and low-performing examinees.\u003c/p\u003e\n\u003cp\u003eIn addition, the most frequently selected incorrect option among residents was identified for each question to enable comparison of distractor selection between human examinees and AI models.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAI models evaluated\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eFive large language models were evaluated:\u003c/p\u003e\n\u003cp\u003eClaude 4.6 Opus (Anthropic, San Francisco, CA, USA; API version March 2026)\u003c/p\u003e\n\u003cp\u003eGemini 3.1 Flash-Lite (Google LLC, Mountain View, CA, USA; API version March 2026)\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eGPT-5.4 (OpenAI, San Francisco, CA, USA; API version March 2026)\u003c/p\u003e\n\u003cp\u003eQwen-3.5-35b-a3b (Alibaba Cloud, Hangzhou, China)\u003c/p\u003e\n\u003cp\u003eMedQwen-2.5-32B-i1 (Alibaba Cloud, Hangzhou, China)\u003c/p\u003e\n\u003cp\u003eThe first three are represent frontier general-purpose large language models, whereas the latter two are open-source models with a medical orientation.\u003c/p\u003e\n\u003cp\u003eAll models were accessed programmatically via their respective Application Programming Interfaces (APIs) between March 9 and March 13, 2026.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAI prompting protocol and data processing\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAll questions were presented to each model using a standardized prompt format. The models were instructed to select a single answer from the five available options.\u003c/p\u003e\n\u003cp\u003eAll questions were presented to each model using a standardized zero-shot prompt format. The models were instructed to select a single answer from the five available options using the following system prompt:\u003c/p\u003e\n\u003cp\u003e\u0026ldquo;You are answering a cardiology multiple choice question. Reply with only A, B, C, D, or E. No explanation.\u0026rdquo;\u003c/p\u003e\n\u003cp\u003eModel generation parameters were standardized across all API calls with a temperature of 0.2 and a maximum output length of 32 tokens to restrict responses to the predefined options.\u003c/p\u003e\n\u003cp\u003eTo reduce stochastic variability and assess response consistency, each question was queried independently five times per model (runs = 5). The final model answer was determined using a majority voting strategy across runs. The option selected most frequently (\u0026ge;3 of 5 runs) was designated as the consensus response. In cases in which no majority was achieved, the response was classified as indeterminate.\u003c/p\u003e\n\u003cp\u003eThe final response of each large language model (LLM) for a given cardiology question was determined using a majority voting consensus derived from the five independent runs. Specifically, the option selected most frequently (i.e., at least three out of five iterations) was designated as the model\u0026apos;s definitive answer. In the rare event of a tie or failure to reach a majority consensus, the model\u0026apos;s response for that specific question was categorized as \u0026apos;indeterminate\u0026apos;.\u003c/p\u003e\n\u003cp\u003eThe overall accuracy of each model was subsequently calculated by comparing the consensus answers with the established gold-standard answer key.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eEvaluation metrics\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eModel performance was evaluated using several complementary approaches:\u003c/p\u003e\n\u003cp\u003eOverall accuracy, defined as the proportion of correctly answered questions.\u003c/p\u003e\n\u003cp\u003eAccuracy across human-defined difficulty levels, examining the variations in AI performance varied across easy, moderate, and hard questions.\u003c/p\u003e\n\u003cp\u003eHuman\u0026ndash;AI performance alignment across items, assessed using Spearman\u0026rsquo;s rank correlation between human correct rates and AI accuracy across items.\u003c/p\u003e\n\u003cp\u003eHuman\u0026ndash;AI error overlap is defined as the proportion of incorrect AI responses that matched the most common distractor selected by residents.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eEquivalent resident performance estimation\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTo provide an interpretable clinical benchmark, AI model accuracy was mapped to the empirical distribution of resident examination scores. The distribution of correct response rates among cardiology residents who had previously completed the examination was used as the reference population, a cohort whose learning trajectories and assessment reliability have been previously established\u003csup\u003e10\u003c/sup\u003e. For each AI model, the overall accuracy across the 199 examination items was calculated and positioned within this distribution to estimate the equivalent resident percentile.\u003c/p\u003e\n\u003cp\u003eThis analysis provides an approximate comparison of AI model performance relative to human examinees but does not imply that AI models replicate the full cognitive processes of clinical residents.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eStatistical analysis\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eCategorical variables, including the frequency of correct, incorrect, and indeterminate responses, were expressed as counts and percentages. To compare accuracy differences across the evaluated models, the Chi-square test or Fisher\u0026apos;s exact test was utilized, as appropriate.\u003c/p\u003e\n\u003cp\u003eAssociations between the psychometric characteristics of exam items and AI correctness were evaluated using logistic regression models. Logistic regression models were used with AI correctness (correct vs incorrect) as the dependent variable. The independent variables included item difficulty (ordinal variable), item discrimination (point-biserial correlation), and question category (primary versus supplementary item).\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eSpearman correlation analysis was used to assess the alignment between human and AI performance across examination items.\u003c/p\u003e\n\u003cp\u003eTo investigate the structure of AI errors, we examined the distribution of incorrect answer choices selected by AI models across all examination items. For each item, the most frequently selected incorrect option among residents was identified as the \u003cem\u003etop human distractor\u003c/em\u003e. AI responses were then compared with this reference to determine whether models selected the same distractor as residents when answering incorrectly. Two complementary analyses were performed. First, the overall distribution of distractor selections across models was summarized to characterize the AI error topology. Second, a cross-tabulated topology matrix was constructed comparing the AI-selected distractor with the most common human distractor for each item. Heatmaps were used to visualize these relationships, allowing the identification of alignment patterns between human and AI error behavior.\u003c/p\u003e\n\u003cp\u003eAll analyses were performed using R programming language (R Foundation for Statistical Computing, Vienna, Austria, version 4.5.2) within the RStudio environment version 2025.09.2 (Posit PBC, Boston, MA, USA) and Microsoft Excel for Mac, version 16.107 (Microsoft Corporation, Redmond, WA, USA). A two-sided p \u0026lt; 0.05 was considered statistically significant.\u003c/p\u003e\n\u003cp\u003eThis cross-sectional in silico observational study was designed and reported in accordance with the STROBE guidelines for observational studies and adhered to the TRIPOD-LLM reporting guideline for the rigorous evaluation of large language models\u003csup\u003e11-13\u003c/sup\u003e.\u003c/p\u003e"},{"header":"RESULTS","content":"\u003cp\u003e\u003cstrong\u003eOverall performance of AI models on cardiology residency in-service examination items\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe final analysis dataset consisted of 199 text-based examination items, including 130 primary and 69 supplementary items which were evaluated across five large language models. Frontier general-purpose models substantially outperformed medically oriented open-source models. Claude Opus achieved the highest majority-vote accuracy (86.4%), followed by Gemini Flash-Lite (82.9%) and GPT-5.4 (82.4%). In contrast, MedQwen achieved an accuracy of 53.3% and Qwen-3.5-35B achieved an accuracy of 18.6% (Table 1).\u003c/p\u003e\n\u003cp\u003ePerformance was consistently higher on supplementary items than on primary items across all frontier models. Claude Opus achieved 84.6% accuracy on primary items and 89.9% on supplementary items; Gemini Flash-Lite achieved 80.8% and 87.0%, respectively; and GPT-5.4 achieved 80.8% and 85.5%, respectively. MedQwen also showed higher performance on supplementary items (63.8%) than on primary items (47.7%), whereas Qwen-3.5-35B remained poor across both sets (20.0% vs. 15.9%).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eEquivalent resident performance comparison\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTo contextualize AI performance relative to human examinees, model accuracies were mapped to the empirical distribution of cardiology resident examination scores. Based on this comparison, Claude Opus performance corresponded to approximately to the 91.7th percentile of resident performance, whereas Gemini Flash-Lite and GPT-5.4 corresponded to the 87th percentile range. In contrast, MedQwen performance corresponded to approximately to the 28th percentile, while Qwen-3.5-35B fell below the 1st percentile of resident performance (Table 2).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAI performance across human-defined item difficulty levels\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTo assess whether AI performance followed the same item difficulty structure observed in human examinees, questions were categorized into human-defined difficulty groups based on resident correct rates. This analysis demonstrated a clear and progressive accuracy gradient across all frontier models (Figure 1).\u003c/p\u003e\n\u003cp\u003eClaude Opus achieved accuracies of 70.18% on hard items, 87.93% on moderately difficult items, and 96.43% on easy items. Gemini Flash-Lite showed a similar pattern, with accuracies of 68.42%, 79.31%, and 95.24%, respectively. GPT-5.4 also followed this gradient, increasing from 61.40% on hard items to 87.93% on moderate items and 92.86% on easy items (Table 3).\u003c/p\u003e\n\u003cp\u003eThis pattern indicates that AI models are sensitive to the same difficulty gradients encountered by cardiology residents.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ePsychometric predictors of AI performance\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eMultivariable logistic regression was performed to identify independent psychometric predictors of correct AI responses among the frontier models (Table 4). Across all three models, item difficulty emerged as the only consistent and statistically significant determinant of performance.\u003c/p\u003e\n\u003cp\u003eFor Claude Opus, increasing difficulty was associated with lower odds of a correct response (OR 0.35, 95% CI 0.18\u0026ndash;0.63, p \u0026lt; 0.001). Similar associations were observed for Gemini Flash-Lite (OR 0.41, 95% CI 0.24\u0026ndash;0.68, p \u0026lt; 0.001) and GPT-5.4 (OR 0.35, 95% CI 0.21\u0026ndash;0.59, p \u0026lt; 0.001).\u003c/p\u003e\n\u003cp\u003eItem discrimination showed a positive but non-significant association with AI accuracy in all three models. Likewise, primary versus supplementary item status was not independently associated with performance after adjustment.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAlignment between human and AI performance across items\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAt the item level, human and AI performance showed significant positive correlations across all frontier models (Figure 2). The Spearman correlation coefficients were \u0026rho; = 0.296 for Claude Opus (p \u0026lt; 0.001), \u0026rho; = 0.249 for Gemini Flash-Lite (p \u0026lt; 0.001), and \u0026rho; = 0.295 for GPT-5.4 (p \u0026lt; 0.001) (Table 5).\u003c/p\u003e\n\u003cp\u003eThe receiver operating characteristic analysis further demonstrated that human-defined item difficulty moderately predicted AI correctness. The area under the curve was 0.749 (95% CI 0.657\u0026ndash;0.842) for Claude Opus, 0.691 (95% CI 0.602\u0026ndash;0.780) for Gemini Flash-Lite, and 0.723 (95% CI 0.631\u0026ndash;0.816) for GPT-5.4 (Supplementary Table S1, Figure S1). Pairwise comparisons of AUC values using DeLong tests did not demonstrate statistically significant differences between the frontier models (Supplementary Table S2).\u003c/p\u003e\n\u003cp\u003eThese findings indicate that questions that were easier for human residents also tended to be answered correctly more often by AI models, whereas questions that were difficult for residents were also tended to challenge frontier LLMs.\u003c/p\u003e\n\u003cp\u003eConsistent with this observation, ordinal trend models across human-defined difficulty groups were significant for all three frontier models. Increasing human-defined ease was associated with higher AI success for Claude Opus (\u0026beta; 1.02, p = 0.0007), Gemini Flash-Lite (\u0026beta; 0.88, p = 0.0008), and GPT-5.4 (\u0026beta; 1.06, p = 0.00009).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eHuman\u0026ndash;AI error overlap\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eWhen models answered incorrectly, they often selected distractors similar to those chosen by human examinees. The proportion of AI errors that matched the most common human distractor was highest for Gemini Flash-Lite (52.94%) and Claude Opus (48.15%), followed by MedQwen (43.01%), GPT-5.4 (31.43%), and Qwen-3.5-35B (29.01%) (Supplementary Table S3, Figure S2).\u003c/p\u003e\n\u003cp\u003eThese findings suggest that at least some AI systems do not merely fail randomly but instead exhibit error patterns that partially overlap with human reasoning errors.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eDiscrimination-stratified performance\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eWhen items were stratified by point-biserial discrimination terciles, frontier models showed a numerically higher accuracy on highly discriminative items than on low-discrimination items. Claude Opus increased from 83.61% in the low-discrimination group to 93.42% in the high-discrimination group, while Gemini Flash-Lite increased from 86.89% to 88.16% and GPT-5.4 from 81.97% to 86.84% (Supplementary Table S4, Figure S3). In contrast, open-source models demonstrated lower and less consistent performance across discrimination strata.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTopology of AI errors across distractor options\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAn analysis of incorrect responses revealed structured patterns in AI error behavior across distractor options (Figure 3). Across models, incorrect answers were not randomly distributed but tended to cluster around specific distractors. Frontier models demonstrated relatively balanced error distributions across distractor options, whereas smaller models showed pronounced concentrations on specific incorrect choices. Notably, Qwen-3.5-35B exhibited strong clustering toward a limited subset of distractors, suggesting reduced discrimination between plausible incorrect alternatives.\u003c/p\u003e\n\u003cp\u003eWhen comparing AI-selected distractors with the most frequently chosen incorrect options among residents, substantial alignment was observed (Figure 4). Diagonal dominance within the topology matrices indicated that the AI models frequently selected the same distractors that misled residents. This pattern was most pronounced for Gemini Flash-Lite and Claude Opus, whereas GPT-5.4 demonstrated moderate alignment. The mean distractor concordance across models is reported in Supplementary Table S5.\u003c/p\u003e"},{"header":"DISCUSSION","content":"\u003cp\u003e\u003cstrong\u003ePrincipal findings\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eIn this study, we evaluated the performance of several large language models on cardiology residency in-service examination items and examined how AI responses relate to the psychometric characteristics of these questions. Three main findings emerged.\u003c/p\u003e\n\u003cp\u003eFirst, frontier general-purpose models substantially outperformed medically oriented open-source models\u003csup\u003e3\u003c/sup\u003e. Claude Opus, Gemini Flash-Lite, and GPT-5.4 achieved accuracies exceeding 80%, whereas MedQwen and Qwen-3.5-35B demonstrated markedly lower performance. This finding suggests that scale, training diversity, and general reasoning capabilities of frontier models may currently play a more important role than domain-specific fine-tuning in determining performance on complex medical knowledge tasks.\u003c/p\u003e\n\u003cp\u003eSecond, AI performance closely followed human-defined item difficulty gradients. Across all frontier models, accuracy increased consistently from hard to easy questions. Multivariable analyses further confirmed that item difficulty was the only psychometric characteristic independently associated with AI success. These findings indicate that large language models are sensitive to the same knowledge difficulty structure encountered by human residents.\u003c/p\u003e\n\u003cp\u003eThird, we observed a measurable alignment between human and AI performance at the item level. Questions that were easier for cardiology residents were also more likely to be answered correctly by AI systems. Moreover, when AI models answered incorrectly, they frequently selected the same distractor options chosen by residents. This partial overlap suggests that AI systems may encounter conceptual challenges similar to those faced by human learners when dealing with complex clinical knowledge questions\u003csup\u003e14\u003c/sup\u003e.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eRelationship to prior literature\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eMost prior evaluations have focused on benchmark accuracy or passing thresholds rather than examining how AI performance interacts with the psychometric structure of examination items\u003csup\u003e15\u003c/sup\u003e. While these studies demonstrate the impressive capabilities of frontier models, they often provide limited insight into how AI performance relates to the psychometric structure of examination items\u003csup\u003e16,17\u003c/sup\u003e.\u003c/p\u003e\n\u003cp\u003eOur findings extend this literature by integrating classical test theory metrics, including item difficulty and discrimination, into the evaluation of AI systems\u003csup\u003e18,19\u003c/sup\u003e. Thus, we demonstrate that AI performance is not randomly distributed across questions but instead follows the same difficulty gradients embedded in the examination structure\u003csup\u003e15,20\u003c/sup\u003e.\u003c/p\u003e\n\u003cp\u003eThis approach provides a more informative framework for interpreting AI performance in medical knowledge assessments\u003csup\u003e21\u003c/sup\u003e.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eBenchmarking AI performance against resident examination scores\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTo provide an interpretable human benchmark, we mapped AI model accuracy onto the empirical distribution of resident examination scores\u003csup\u003e22\u003c/sup\u003e. Using this approach, the performance of frontier models corresponded approximately to the upper range of resident performance, with Claude Opus aligning with the 91st percentile and both Gemini Flash-Lite and GPT-5.4 aligning with approximately the 87th percentile of cardiology residents. In contrast, medically oriented open-source models corresponded to substantially lower positions within the resident performance distribution. This comparison should be interpreted as a benchmarking exercise rather than an indication that AI systems replicate the reasoning processes of clinical residents\u003csup\u003e23\u003c/sup\u003e. In our prior evaluation of this cohort, we demonstrated a progressive consolidation of clinical knowledge over time, with mean resident performance improving from approximately 41% in the early training period to 73% in the late training period\u003csup\u003e10\u003c/sup\u003e. The overall accuracies achieved by frontier LLMs (82.4%\u0026ndash;86.4%) therefore place their performance at or above the level expected of senior cardiology residents completing their training. Nonetheless, these findings illustrate that contemporary frontier LLMs can achieve examination-level knowledge performance approaching the performance range of high-performing residents when evaluated on structured cardiology assessment items\u003csup\u003e21\u003c/sup\u003e. This comparison provides an intuitive benchmark for interpreting AI examination performance within the context of human training outcomes\u003csup\u003e24\u003c/sup\u003e.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eImplications for medical education and AI evaluation\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe results of this study have several implications for both medical education and AI benchmarking.\u003c/p\u003e\n\u003cp\u003eFirst, they suggest that accuracy alone may be insufficient to characterize AI performance on professional examinations. Exam items vary in their psychometric characteristics, and these differences strongly influence both human and AI performance\u003csup\u003e25,26\u003c/sup\u003e.\u003c/p\u003e\n\u003cp\u003eConsistent with this interpretation, the ROC analysis demonstrated that human-defined difficulty moderately predicted AI correctness across frontier models, suggesting that AI performance is structured by the same psychometric gradients that influence human examination outcomes\u003csup\u003e27\u003c/sup\u003e.\u003c/p\u003e\n\u003cp\u003eSecond, the observed alignment between human and AI performance indicates that large language models may reflect similar knowledge gradients present in medical curricula and examination frameworks. This alignment could make psychometrically structured examination datasets a useful tool for evaluating the educational relevance of AI systems\u003csup\u003e28\u003c/sup\u003e.\u003c/p\u003e\n\u003cp\u003eThird, the overlap between human and AI error patterns raises interesting questions regarding how these models process clinical knowledge and whether they encounter conceptual pitfalls similar to those experienced by human residents\u003csup\u003e29\u003c/sup\u003e.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eInterpretation of human\u0026ndash;AI error topology\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe examination of distractor topology provided additional insight into the cognitive behavior of AI models when answering complex clinical questions. Rather than producing random errors, AI models frequently selected the same distractors that were most commonly chosen by residents\u003csup\u003e30\u003c/sup\u003e. We have previously observed that human errors in these specific examinations tend to cluster around dominant distractors due to systematic conceptual misunderstandings\u003csup\u003e10\u003c/sup\u003e. This alignment suggests that errors produced by large language models may reflect similar cognitive traps embedded within question design, such as partially correct clinical reasoning or misleading contextual cues\u003csup\u003e31\u003c/sup\u003e. In contrast, smaller models demonstrated more irregular error distributions, indicating weaker discrimination among distractor options. These findings imply that advanced AI systems may process question semantics in ways that resemble human reasoning patterns, even when the final answer is incorrect.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eLimitations\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eSeveral limitations should be considered when interpreting these findings.\u003c/p\u003e\n\u003cp\u003eFirst, the analysis was based on a single specialty examination dataset derived from cardiology residency in-service assessments. Although cardiology questions often require the integration of pathophysiology, diagnostics, and management principles, the results may not generalize to other medical disciplines.\u003c/p\u003e\n\u003cp\u003eSecond, AI responses were evaluated using majority voting across multiple runs. Although approach reduces stochastic variability in model outputs, it may not fully capture the range of possible responses generated by large language models.\u003c/p\u003e\n\u003cp\u003eThird, although distractor analysis provides insight into AI error patterns, it does not directly reveal the internal reasoning processes used by the models.\u003c/p\u003e\n\u003cp\u003eIn addition, because large language models are trained on large-scale internet corpora, it is possible that exposure to examination-style questions or similar educational material may have influenced model performance. However, the specific examination items used in this study were institutional and not publicly available, which reduces but does not entirely eliminate this possibility.\u003c/p\u003e\n\u003cp\u003eFinally, because the examination items were text-based multiple-choice questions, the results may not directly translate to clinical decision-making tasks in real-world settings.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFuture directions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eFuture research should extend the psychometric evaluation of AI systems across multiple specialties and examination formats\u003csup\u003e32\u003c/sup\u003e. In addition, integrating more advanced psychometric frameworks such as item response theory may provide further insight into how AI models interact with structured knowledge assessments\u003csup\u003e17\u003c/sup\u003e.\u003c/p\u003e\n\u003cp\u003eUnderstanding how AI systems respond to educational measurement frameworks may help guide the development of more robust benchmarks for evaluating medical AI systems\u003csup\u003e21\u003c/sup\u003e.\u003c/p\u003e\n\u003cp\u003eFuture studies should also investigate AI performance using longitudinal educational datasets to determine whether models capture evolving knowledge structures within medical training.\u003c/p\u003e"},{"header":"Conclusions","content":"\u003cp\u003eLarge language models demonstrate strong performance on cardiology residency in-service examination questions and exhibit meaningful alignment with human-defined item difficulty and performance patterns. These findings suggest that AI performance on medical examinations is structured by the same psychometric characteristics that shape human assessment outcomes. Integrating AI benchmarking with psychometric analysis provides a promising framework for evaluating future AI systems in medical education and knowledge assessment.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eData Availability Statement:\u003c/strong\u003e All relevant data are included in the manuscript or are available from the corresponding author upon reasonable request.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eEthics approval and consent to participate:\u003c/strong\u003e The study protocol was approved by the Institutional Review Board of Izmir Katip Celebi University (Approval Number: 0828, Date: 15.01.2026) and was conducted in accordance with the Declaration of Helsinki.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ePatient Consent Statement:\u0026nbsp;\u003c/strong\u003eThe study used anonymized routinely collected educational assessment data and did not involve patient information, the requirement for individual informed consent was waived by the ethics committee.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthor Contributions:\u003c/strong\u003e Concept \u0026ndash; A.\u0026Ccedil;.; design \u0026ndash; A.\u0026Ccedil;.; supervision \u0026ndash; T.K., E.\u0026Ouml;., M.K.; resources and materials \u0026ndash; A.\u0026Ccedil;., T.K., U.K., E.\u0026Ouml;., M.K.; data collection and processing \u0026ndash; A.\u0026Ccedil;., M.K.; analysis and interpretation \u0026ndash; A.\u0026Ccedil;.; literature search \u0026ndash; A.\u0026Ccedil;., U.K., M.K.; writing \u0026ndash; A.\u0026Ccedil;., T.K., U.K., E.\u0026Ouml;., M.K.; critical review \u0026ndash; T.K., E.\u0026Ouml;., M.K.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConflict of Interest Disclosure:\u003c/strong\u003e The authors have no conflicts of interest to declare.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAcknowledgments:\u003c/strong\u003e During the preparation of this work, the authors used ChatGPT-5.4 (OpenAI, San Francisco, CA, USA) to check for grammar and spelling to improved readability. After using this tool, the authors reviewed and edited the content as needed and took full responsibility for the content of the publication.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding Statement:\u003c/strong\u003e The authors declare that they have no financial support.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eThirunavukarasu, A.J.\u003cem\u003e, et al.\u003c/em\u003e Large language models in medicine. \u003cem\u003eNat Med\u003c/em\u003e \u003cstrong\u003e29\u003c/strong\u003e, 1930-1940 (2023).\u003c/li\u003e\n\u003cli\u003eSinghal, K.\u003cem\u003e, et al.\u003c/em\u003e Large language models encode clinical knowledge. \u003cem\u003eNature\u003c/em\u003e \u003cstrong\u003e620\u003c/strong\u003e, 172-180 (2023).\u003c/li\u003e\n\u003cli\u003eNori, H., King, N., McKinney, S.M., Carignan, D. \u0026amp; Horvitz, E. Capabilities of gpt-4 on medical challenge problems. \u003cem\u003earXiv preprint arXiv:2303.13375\u003c/em\u003e (2023).\u003c/li\u003e\n\u003cli\u003eKung, T.H.\u003cem\u003e, et al.\u003c/em\u003e Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. \u003cem\u003ePLOS Digit Health\u003c/em\u003e \u003cstrong\u003e2\u003c/strong\u003e, e0000198 (2023).\u003c/li\u003e\n\u003cli\u003eChang, Y.\u003cem\u003e, et al.\u003c/em\u003e A Survey on Evaluation of Large Language Models. \u003cem\u003eACM Transactions on Intelligent Systems and Technology\u003c/em\u003e \u003cstrong\u003e15\u003c/strong\u003e, 1-45 (2024).\u003c/li\u003e\n\u003cli\u003eDowning, S.M. Validity: on meaningful interpretation of assessment data. \u003cem\u003eMed Educ\u003c/em\u003e \u003cstrong\u003e37\u003c/strong\u003e, 830-837 (2003).\u003c/li\u003e\n\u003cli\u003eEpstein, R.M. Assessment in medical education. \u003cem\u003eN Engl J Med\u003c/em\u003e \u003cstrong\u003e356\u003c/strong\u003e, 387-396 (2007).\u003c/li\u003e\n\u003cli\u003eLalor, J.P., Wu, H. \u0026amp; Yu, H. Building an Evaluation Scale using Item Response Theory. \u003cem\u003eProc Conf Empir Methods Nat Lang Process\u003c/em\u003e \u003cstrong\u003e2016\u003c/strong\u003e, 648-657 (2016).\u003c/li\u003e\n\u003cli\u003eHalperin, J.L., Williams, E.S. \u0026amp; Fuster, V. COCATS 4 Introduction. \u003cem\u003eJ Am Coll Cardiol\u003c/em\u003e \u003cstrong\u003e65\u003c/strong\u003e, 1724-1733 (2015).\u003c/li\u003e\n\u003cli\u003eCelik, A., Ozdemir, E. \u0026amp; Karaca, M. Longitudinal trajectories of clinical knowledge performance in cardiology residency a mixed-effects analysis with psychometric adjustment of routine assessments. \u003cem\u003eBMC Med Res Methodol\u003c/em\u003e (2026), doi: 10.1186/s12874-026-02841-0.\u003c/li\u003e\n\u003cli\u003eElm E von, A.D., Egger M, Pocock SJ, G\u0026oslash;tzsche PC, Vandenbroucke JP, et al. . The STROBE reporting checklist. in \u003cem\u003eThe EQUATOR network reporting guideline platform\u003c/em\u003e (ed. Harwood J, A.C., Beyer J de, Schl\u0026uuml;ssel M, Collins G,) (2025).\u003c/li\u003e\n\u003cli\u003evon Elm, E.\u003cem\u003e, et al.\u003c/em\u003e The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. \u003cem\u003eAnn Intern Med\u003c/em\u003e \u003cstrong\u003e147\u003c/strong\u003e, 573-577 (2007).\u003c/li\u003e\n\u003cli\u003eGallifant, J.\u003cem\u003e, et al.\u003c/em\u003e The TRIPOD-LLM reporting guideline for studies using large language models. \u003cem\u003eNat Med\u003c/em\u003e \u003cstrong\u003e31\u003c/strong\u003e, 60-69 (2025).\u003c/li\u003e\n\u003cli\u003eLievin, V., Hother, C.E., Motzfeldt, A.G. \u0026amp; Winther, O. Can large language models reason about medical questions? \u003cem\u003ePatterns (N Y)\u003c/em\u003e \u003cstrong\u003e5\u003c/strong\u003e, 100943 (2024).\u003c/li\u003e\n\u003cli\u003eYaneva, V., Baldwin, P., Jurich, D.P., Swygert, K. \u0026amp; Clauser, B.E. Examining ChatGPT Performance on USMLE Sample Items and Implications for Assessment. \u003cem\u003eAcad Med\u003c/em\u003e \u003cstrong\u003e99\u003c/strong\u003e, 192-197 (2024).\u003c/li\u003e\n\u003cli\u003eSiam, M.K.\u003cem\u003e, et al.\u003c/em\u003e Benchmarking large language models on the United States medical licensing examination for clinical reasoning and medical licensing scenarios. \u003cem\u003eSci Rep\u003c/em\u003e \u003cstrong\u003e16\u003c/strong\u003e, 1387 (2025).\u003c/li\u003e\n\u003cli\u003eYan Zhuang, Q.L., Zachary A. Pardos, Patrick C. Kyllonen, Jiyun Zu, Zhenya Huang, Shijin Wang, Enhong Chen. Position: AI Evaluation Should Learn from How We Test Humans. \u003cem\u003earXiv preprint arXiv:2303.13375\u003c/em\u003e (2025).\u003c/li\u003e\n\u003cli\u003eSchuwirth, L.W. \u0026amp; Van der Vleuten, C.P. Programmatic assessment: From assessment of learning to assessment for learning. \u003cem\u003eMed Teach\u003c/em\u003e \u003cstrong\u003e33\u003c/strong\u003e, 478-485 (2011).\u003c/li\u003e\n\u003cli\u003eLaw, A.K.\u003cem\u003e, et al.\u003c/em\u003e AI versus human-generated multiple-choice questions for medical education: a cohort study in a high-stakes examination. \u003cem\u003eBMC Med Educ\u003c/em\u003e \u003cstrong\u003e25\u003c/strong\u003e, 208 (2025).\u003c/li\u003e\n\u003cli\u003eHingorjo, M.R. \u0026amp; Jaleel, F. Analysis of one-best MCQs: the difficulty index, discrimination index and distractor efficiency. \u003cem\u003eJ Pak Med Assoc\u003c/em\u003e \u003cstrong\u003e62\u003c/strong\u003e, 142-147 (2012).\u003c/li\u003e\n\u003cli\u003eSun, L.\u003cem\u003e, et al.\u003c/em\u003e Beyond Benchmarks: Evaluating Generalist Medical Artificial Intelligence With Psychometrics. \u003cem\u003eJ Med Internet Res\u003c/em\u003e \u003cstrong\u003e27\u003c/strong\u003e, e70901 (2025).\u003c/li\u003e\n\u003cli\u003eGaribaldi, R.A., Subhiyah, R., Moore, M.E. \u0026amp; Waxman, H. The In-Training Examination in Internal Medicine: an analysis of resident performance over time. \u003cem\u003eAnn Intern Med\u003c/em\u003e \u003cstrong\u003e137\u003c/strong\u003e, 505-510 (2002).\u003c/li\u003e\n\u003cli\u003eKatz, U.\u003cem\u003e, et al.\u003c/em\u003e GPT versus Resident Physicians \u0026mdash; A Benchmark Based on Official Board Scores. \u003cem\u003eNejm Ai\u003c/em\u003e \u003cstrong\u003e1\u003c/strong\u003e(2024).\u003c/li\u003e\n\u003cli\u003eIftikhar, H., Anjum, S., Bhutta, Z.A., Najam, M. \u0026amp; Bashir, K. Performance of ChatGPT in emergency medicine residency exams in Qatar: A comparative analysis with resident physicians. \u003cem\u003eQatar Med J\u003c/em\u003e \u003cstrong\u003e2024\u003c/strong\u003e, 61 (2024).\u003c/li\u003e\n\u003cli\u003eSingh AK, N.N., Verma VK, Prasanth PG. Beyond Accuracy: A Psychometric Benchmark and Stability Analysis of 15 Large Language Models on NEET-PG Medicine Questions (2021--2025). \u003cem\u003eJournal of Contemporary Clinical Practice\u003c/em\u003e \u003cstrong\u003e11(12)\u003c/strong\u003e, 783-789 (2025 Dec).\u003c/li\u003e\n\u003cli\u003eKwong, J.C.C.\u003cem\u003e, et al.\u003c/em\u003e APPRAISE-AI Tool for Quantitative Evaluation of AI Studies for Clinical Decision Support. \u003cem\u003eJAMA Netw Open\u003c/em\u003e \u003cstrong\u003e6\u003c/strong\u003e, e2335377 (2023).\u003c/li\u003e\n\u003cli\u003eLinde, P.\u003cem\u003e, et al.\u003c/em\u003e Psychometric properties and detectability of GPT-4o-generated multiple-choice questions compared with human-authored items across imaging specialties. \u003cem\u003eNPJ Digit Med\u003c/em\u003e \u003cstrong\u003e9\u003c/strong\u003e, 132 (2026).\u003c/li\u003e\n\u003cli\u003eShelmerdine, S.C.\u003cem\u003e, et al.\u003c/em\u003e Can artificial intelligence pass the Fellowship of the Royal College of Radiologists examination? Multi-reader diagnostic accuracy study. \u003cem\u003eBMJ\u003c/em\u003e \u003cstrong\u003e379\u003c/strong\u003e, e072826 (2022).\u003c/li\u003e\n\u003cli\u003eWies, C., Hauser, K. \u0026amp; Brinker, T.J. Reply to: False conflict and false confirmation errors are crucial components of AI accuracy in medical decision making. \u003cem\u003eNat Commun\u003c/em\u003e \u003cstrong\u003e15\u003c/strong\u003e, 6897 (2024).\u003c/li\u003e\n\u003cli\u003eAthaluri, S.A.\u003cem\u003e, et al.\u003c/em\u003e Exploring the Boundaries of Reality: Investigating the Phenomenon of Artificial Intelligence Hallucination in Scientific Writing Through ChatGPT References. \u003cem\u003eCureus\u003c/em\u003e \u003cstrong\u003e15\u003c/strong\u003e, e37432 (2023).\u003c/li\u003e\n\u003cli\u003eTavakol, M. \u0026amp; Dennick, R. Post-examination analysis of objective tests. \u003cem\u003eMed Teach\u003c/em\u003e \u003cstrong\u003e33\u003c/strong\u003e, 447-458 (2011).\u003c/li\u003e\n\u003cli\u003eMurias Quintana, E.\u003cem\u003e, et al.\u003c/em\u003e Improving the ability to discriminate medical multiple-choice questions through the analysis of the competitive examination to assign residency positions in Spain. \u003cem\u003eBMC Med Educ\u003c/em\u003e \u003cstrong\u003e24\u003c/strong\u003e, 367 (2024).\u003c/li\u003e\n\u003c/ol\u003e"},{"header":"Tables","content":"\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd nowrap=\"\" colspan=\"5\" valign=\"top\" style=\"width: 100%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eTable 1.\u003c/strong\u003e Overall performance of AI models on cardiology residency in-service examination items\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 31.2914%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eModel\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 14.0728%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eRun-level accuracy\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 18.7086%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eMajority-vote accuracy\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 17.2185%;\"\u003e\n \u003cp\u003e\u003cstrong\u003ePrimary-set accuracy\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 18.7086%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eSupplement-set accuracy\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 31.2914%;\"\u003e\n \u003cp\u003eClaude Opus 4.6\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 14.0728%;\"\u003e\n \u003cp\u003e86.4%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 18.7086%;\"\u003e\n \u003cp\u003e86.4%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 17.2185%;\"\u003e\n \u003cp\u003e84.6%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 18.7086%;\"\u003e\n \u003cp\u003e89.9%\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 31.2914%;\"\u003e\n \u003cp\u003eGemini 3.1 Flash Lite\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 14.0728%;\"\u003e\n \u003cp\u003e83.0%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 18.7086%;\"\u003e\n \u003cp\u003e82.9%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 17.2185%;\"\u003e\n \u003cp\u003e80.8%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 18.7086%;\"\u003e\n \u003cp\u003e87.0%\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 31.2914%;\"\u003e\n \u003cp\u003eGPT-5.4\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 14.0728%;\"\u003e\n \u003cp\u003e81.7%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 18.7086%;\"\u003e\n \u003cp\u003e82.4%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 17.2185%;\"\u003e\n \u003cp\u003e80.8%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 18.7086%;\"\u003e\n \u003cp\u003e85.5%\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 31.2914%;\"\u003e\n \u003cp\u003eMedQwen\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 14.0728%;\"\u003e\n \u003cp\u003e52.9%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 18.7086%;\"\u003e\n \u003cp\u003e53.3%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 17.2185%;\"\u003e\n \u003cp\u003e47.7%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 18.7086%;\"\u003e\n \u003cp\u003e63.8%\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 31.2914%;\"\u003e\n \u003cp\u003eQwen-3.5-35B\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 14.0728%;\"\u003e\n \u003cp\u003e18.0%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 18.7086%;\"\u003e\n \u003cp\u003e18.6%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 17.2185%;\"\u003e\n \u003cp\u003e20.0%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 18.7086%;\"\u003e\n \u003cp\u003e15.9%\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd nowrap=\"\" colspan=\"5\" valign=\"top\" style=\"width: 100%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eFootnote:\u0026nbsp;\u003c/strong\u003eValues represent the proportion of correctly answered items. Accuracy was calculated on the text-only examination item set (n = 199), consisting of 130 primary and 69 supplementary items as defined in the Methods section. Majority-vote accuracy represents the final response obtained from the majority prediction across repeated model runs.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u0026nbsp;\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\" width=\"604\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd nowrap=\"\" colspan=\"3\" valign=\"top\" style=\"width: 100%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eTable 2.\u0026nbsp;\u003c/strong\u003eAI model performance relative to the distribution of resident examination scores\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 39.0728%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eModel\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 21.8543%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eAccuracy\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 39.0728%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eEquivalent resident percentile\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 39.0728%;\"\u003e\n \u003cp\u003eClaude Opus 4.6\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 21.8543%;\"\u003e\n \u003cp\u003e86.4%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 39.0728%;\"\u003e\n \u003cp\u003e91.7th percentile\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 39.0728%;\"\u003e\n \u003cp\u003eGemini 3.1 Flash Lite\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 21.8543%;\"\u003e\n \u003cp\u003e82.9%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 39.0728%;\"\u003e\n \u003cp\u003e87.2nd percentile\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 39.0728%;\"\u003e\n \u003cp\u003eGPT-5.4\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 21.8543%;\"\u003e\n \u003cp\u003e82.4%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 39.0728%;\"\u003e\n \u003cp\u003e87.0th percentile\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 39.0728%;\"\u003e\n \u003cp\u003eMedQwen\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 21.8543%;\"\u003e\n \u003cp\u003e53.3%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 39.0728%;\"\u003e\n \u003cp\u003e28.2nd percentile\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 39.0728%;\"\u003e\n \u003cp\u003eQwen-3.5-35B\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 21.8543%;\"\u003e\n \u003cp\u003e18.6%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 39.0728%;\"\u003e\n \u003cp\u003e\u0026lt;1st percentile\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd nowrap=\"\" colspan=\"3\" valign=\"top\" style=\"width: 100%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eFootnote:\u0026nbsp;\u003c/strong\u003eResident percentile estimates were derived from the empirical distribution of correct response rates across cardiology residents participating in the in-service examination.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u0026nbsp;\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\" width=\"604\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd nowrap=\"\" colspan=\"6\" valign=\"top\" style=\"width: 100%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eTable 3.\u003c/strong\u003e AI accuracy across human-defined examination item difficulty levels\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 21.8543%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eHuman-defined item difficulty\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 15.5629%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eClaude Opus 4.6\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 15.7285%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eGemini 3.1 Flash Lite\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 12.5828%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eGPT-5.4\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 15.5629%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eMedQwen\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 18.7086%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eQwen-3.5-35B\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 21.8543%;\"\u003e\n \u003cp\u003eHard\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 15.5629%;\"\u003e\n \u003cp\u003e70.18%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 15.7285%;\"\u003e\n \u003cp\u003e68.42%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 12.5828%;\"\u003e\n \u003cp\u003e61.40%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 15.5629%;\"\u003e\n \u003cp\u003e33.33%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 18.7086%;\"\u003e\n \u003cp\u003e21.05%\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 21.8543%;\"\u003e\n \u003cp\u003eModerate\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 15.5629%;\"\u003e\n \u003cp\u003e87.93%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 15.7285%;\"\u003e\n \u003cp\u003e79.31%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 12.5828%;\"\u003e\n \u003cp\u003e87.93%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 15.5629%;\"\u003e\n \u003cp\u003e43.10%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 18.7086%;\"\u003e\n \u003cp\u003e15.52%\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 21.8543%;\"\u003e\n \u003cp\u003eEasy\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 15.5629%;\"\u003e\n \u003cp\u003e96.43%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 15.7285%;\"\u003e\n \u003cp\u003e95.24%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 12.5828%;\"\u003e\n \u003cp\u003e92.86%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 15.5629%;\"\u003e\n \u003cp\u003e73.81%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 18.7086%;\"\u003e\n \u003cp\u003e19.05%\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd nowrap=\"\" colspan=\"6\" valign=\"top\" style=\"width: 100%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eFootnote:\u0026nbsp;\u003c/strong\u003eDifficulty categories were defined using terciles of the resident correct response rate\u0026nbsp;across examination items. Easy items correspond to the highest tertile of resident performance, whereas hard items correspond to the lowest tertile.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u0026nbsp;\u0026nbsp;\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\" width=\"614\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd nowrap=\"\" colspan=\"7\" valign=\"top\" style=\"width: 100%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eTable 4.\u0026nbsp;\u003c/strong\u003eMultivariable logistic regression analysis predicting AI correctness\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 18.4039%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eVariable\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 16.9381%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eClaude Opus 4.6 OR (95% CI)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 10.7492%;\"\u003e\n \u003cp\u003e\u003cstrong\u003ep-value\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 15.3094%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eGemini 3.1 Flash Lite OR (95% CI)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 12.3779%;\"\u003e\n \u003cp\u003e\u003cstrong\u003ep-value\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 15.4723%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eGPT-5.4 OR (95% CI)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 10.7492%;\"\u003e\n \u003cp\u003e\u003cstrong\u003ep-value\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 18.4039%;\"\u003e\n \u003cp\u003eDifficulty\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 16.9381%;\"\u003e\n \u003cp\u003e0.35 (0.18\u0026ndash;0.63)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 10.7492%;\"\u003e\n \u003cp\u003e\u0026lt;0.001\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 15.3094%;\"\u003e\n \u003cp\u003e0.41 (0.24\u0026ndash;0.68)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 12.3779%;\"\u003e\n \u003cp\u003e\u0026lt;0.001\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 15.4723%;\"\u003e\n \u003cp\u003e0.35 (0.21\u0026ndash;0.59)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 10.7492%;\"\u003e\n \u003cp\u003e\u0026lt;0.001\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 18.4039%;\"\u003e\n \u003cp\u003eDiscrimination\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 16.9381%;\"\u003e\n \u003cp\u003e3.25 (0.53\u0026ndash;20.58)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 10.7492%;\"\u003e\n \u003cp\u003e0.20\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 15.3094%;\"\u003e\n \u003cp\u003e1.79 (0.34\u0026ndash;9.26)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 12.3779%;\"\u003e\n \u003cp\u003e0.49\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 15.4723%;\"\u003e\n \u003cp\u003e1.30 (0.25\u0026ndash;6.73)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 10.7492%;\"\u003e\n \u003cp\u003e0.75\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 18.4039%;\"\u003e\n \u003cp\u003ePrimary item\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 16.9381%;\"\u003e\n \u003cp\u003e0.76 (0.23\u0026ndash;2.21)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 10.7492%;\"\u003e\n \u003cp\u003e0.63\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 15.3094%;\"\u003e\n \u003cp\u003e0.88 (0.31\u0026ndash;2.27)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 12.3779%;\"\u003e\n \u003cp\u003e0.80\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 15.4723%;\"\u003e\n \u003cp\u003e1.13 (0.42\u0026ndash;2.91)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 10.7492%;\"\u003e\n \u003cp\u003e0.80\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd nowrap=\"\" colspan=\"7\" valign=\"top\" style=\"width: 100%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eFootnote:\u0026nbsp;\u003c/strong\u003eOdds ratios represent the probability of an AI model answering an item correctly. Difficulty was coded as an ordinal variable (easy to hard). Discrimination refers to the point-biserial discrimination index.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u0026nbsp;\u0026nbsp;\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\" width=\"614\" class=\"fr-table-selection-hover\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd nowrap=\"\" colspan=\"3\" valign=\"top\" style=\"width: 100%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eTable 5.\u0026nbsp;\u003c/strong\u003eCorrelation between human performance and AI correctness across examination items\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 42.9967%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eModel\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 30.7818%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eSpearman \u0026rho;\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 26.2215%;\"\u003e\n \u003cp\u003e\u003cstrong\u003ep-value\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 42.9967%;\"\u003e\n \u003cp\u003eClaude Opus 4.6\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 30.7818%;\"\u003e\n \u003cp\u003e0.296\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 26.2215%;\"\u003e\n \u003cp\u003e\u0026lt;0.001\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 42.9967%;\"\u003e\n \u003cp\u003eGemini 3.1 Flash Lite\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 30.7818%;\"\u003e\n \u003cp\u003e0.249\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 26.2215%;\"\u003e\n \u003cp\u003e\u0026lt;0.001\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 42.9967%;\"\u003e\n \u003cp\u003eGPT-5.4\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 30.7818%;\"\u003e\n \u003cp\u003e0.295\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd nowrap=\"\" valign=\"top\" style=\"width: 26.2215%;\"\u003e\n \u003cp\u003e\u0026lt;0.001\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd nowrap=\"\" colspan=\"3\" valign=\"top\" style=\"width: 100%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eFootnote:\u0026nbsp;\u003c/strong\u003eSpearman correlation coefficients quantify the association between resident correct response rate and AI correctness across examination items.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u0026nbsp;\u003c/p\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"npj-digital-medicine","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"npjdigitalmed","sideBox":"Learn more about [npj Digital Medicine](http://www.nature.com/npjdigitalmed/)","snPcode":"41746","submissionUrl":"https://submission.springernature.com/new-submission/41746/3","title":"npj Digital Medicine","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"NPJ","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"Artificial Intelligence, Natural Language Processing, Cardiology, Educational Measurement, Psychometrics, Internship and Residency","lastPublishedDoi":"10.21203/rs.3.rs-9247601/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-9247601/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003eBackground\u003c/h2\u003e \u003cp\u003eLarge language models (LLMs) have demonstrated rapidly expanding capabilities across medical knowledge tasks, including professional examinations. However, most existing evaluations focus primarily on overall accuracy and provide limited insight into how AI performance relates to the psychometric structure of examination items.\u003c/p\u003e\u003ch2\u003eMethods\u003c/h2\u003e \u003cp\u003eWe evaluated the performance of five large language models on a dataset of 199 cardiology residency in-service examination questions. The models included three frontier general-purpose systems (Claude 4.6 Opus, Gemini 3.1 Flash-Lite, and GPT-5.4) and two medically oriented open-source models (MedQwen-2.5 and Qwen-3.5). Item-level analyses were conducted to examine the associations between AI accuracy and psychometric characteristics of exam questions, including human-defined item difficulty and item discrimination. Multivariable logistic regression was used to identify independent predictors of AI performance. Alignment between human and AI performance was assessed using Spearman correlation and distractor overlap analysis.\u003c/p\u003e\u003ch2\u003eResults\u003c/h2\u003e \u003cp\u003eFrontier models substantially outperformed medically oriented open-source models, achieving accuracies of 86.4% for Claude Opus, 82.9% for Gemini Flash-Lite, and 82.4% for GPT-5.4, compared with 53.3% for MedQwen and 18.6% for Qwen-3.5-35B. AI performance followed a clear gradient across human-defined difficulty levels, with frontier models answering 65\u0026ndash;74% of hard questions and 92\u0026ndash;96% of easy questions correctly. In multivariable analyses, item difficulty was the only psychometric factor consistently associated with AI success across frontier models (OR range 0.37\u0026ndash;0.47, all p\u0026thinsp;\u0026lt;\u0026thinsp;0.01). Human and AI performance were significantly correlated across items (Spearman ρ\u0026thinsp;\u0026asymp;\u0026thinsp;0.25\u0026ndash;0.30, p\u0026thinsp;\u0026lt;\u0026thinsp;0.001). When AI models answered incorrectly, they frequently selected the same distractors as human examinees, with error overlap ranging from 31% to 53%.\u003c/p\u003e\u003ch2\u003eConclusions\u003c/h2\u003e \u003cp\u003eLarge language models demonstrate strong performance on cardiology residency examination questions and exhibit meaningful alignment with human-defined item difficulty and performance patterns. These findings suggest that AI performance on medical examinations is structured by the same psychometric characteristics that shape human assessment outcomes. Integrating AI benchmarking with psychometric analysis may provide a more informative framework for evaluating future AI systems in medical education and knowledge assessment.\u003c/p\u003e","manuscriptTitle":"Psychometric Alignment Between Human and Artificial Intelligence Performance in Cardiology Residency In-Service Examinations","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-04-06 15:21:22","doi":"10.21203/rs.3.rs-9247601/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2026-04-19T23:03:23+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-04-19T08:18:39+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-04-14T14:15:37+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"123518062132220104495047849053544524677","date":"2026-04-11T05:35:45+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"125321642000745967150069198859335607735","date":"2026-04-11T02:09:39+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-04-10T17:18:16+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"129155859387820881657621158178658738303","date":"2026-04-01T21:06:00+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2026-03-31T21:15:06+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2026-03-31T14:57:35+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2026-03-31T03:51:57+00:00","index":"","fulltext":""},{"type":"submitted","content":"npj Digital Medicine","date":"2026-03-27T18:11:31+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"npj-digital-medicine","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"npjdigitalmed","sideBox":"Learn more about [npj Digital Medicine](http://www.nature.com/npjdigitalmed/)","snPcode":"41746","submissionUrl":"https://submission.springernature.com/new-submission/41746/3","title":"npj Digital Medicine","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"NPJ","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"34ca1b0f-f20d-481c-bb2b-4ffdb0db9ce9","owner":[],"postedDate":"April 6th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[{"id":65643403,"name":"Health sciences/Cardiology"},{"id":65643404,"name":"Biological sciences/Computational biology and bioinformatics"},{"id":65643405,"name":"Health sciences/Health care"},{"id":65643406,"name":"Physical sciences/Mathematics and computing"},{"id":65643407,"name":"Health sciences/Medical research"}],"tags":[],"updatedAt":"2026-04-26T18:53:15+00:00","versionOfRecord":[],"versionCreatedAt":"2026-04-06 15:21:22","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-9247601","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-9247601","identity":"rs-9247601","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00