Comparison of AI-Generated and Clinician-Designed Multiple Choice Questions in Emergency Medicine Exam: A Psychometric Analysis | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Comparison of AI-Generated and Clinician-Designed Multiple Choice Questions in Emergency Medicine Exam: A Psychometric Analysis Murtaza Kaya, Ertan Sonmez, Ali Halici, Harun Yildirim, Abdil Coskun This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-6319788/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 01 Jul, 2025 Read the published version in BMC Medical Education → Version 1 posted 13 You are reading this latest preprint version Abstract Background/aim This study compared the effectiveness and psychometric quality of artificial intelligence (AI)-generated multiple-choice questions (MCQs), specifically from ChatGPT-4o, with clinician-designed MCQs in an emergency medicine residency program. Methods Eighteen emergency medicine residents participated, completing an examination of 100 questions—50 AI-generated and 50 clinician-designed—based on core emergency medicine topics. Psychometric analysis assessed item difficulty, discrimination, and reliability through the point-biserial correlation coefficient (PBCC). Results Results showed no significant difference in discrimination indices between AI-generated and clinician-designed MCQs, indicating both question sets were similarly effective at differentiating between high and low performers. However, AI-generated MCQs were significantly more difficult (mean item difficulty index, 0.65 versus 0.76; p = 0.02). Residents performed significantly better on AI-generated questions compared to clinician-designed ones (mean score, 76.8 versus 67.3; p = 0.003). Both question sets demonstrated comparable reliability in assessing resident knowledge, as indicated by similar PBCC values. Conclusion This study highlights the potential for AI-generated MCQs to supplement clinician-designed assessments effectively, demonstrating comparable psychometric properties and reliability. However, the higher difficulty level of AI-generated questions suggests the necessity for expert review and oversight to ensure appropriateness and context accuracy. Further research with larger sample sizes and diverse medical settings is recommended to validate these findings and explore the broader implications of incorporating AI into medical education assessment strategies. Artificial Intelligence Multiple Choice Questions Educational Measurement Emergency Medicine Psychometrics Figures Figure 1 Figure 2 1. INTRODUCTION Medical education constantly evolves to integrate innovative tools that enhance assessment and learning. In recent years, artificial intelligence (AI) has gained attention for its potential role in medical training and evaluation [ 1 ]. One of the most promising AI models, ChatGPT-4o, has demonstrated capabilities in generating high-quality multiple-choice questions (MCQs) across various disciplines, including emergency medicine [ 2 ]. However, its effectiveness in creating exam questions that meet standard psychometric criteria, such as item difficulty and discrimination, remains an area of active investigation. A cross-sectional study by Maaß et al. (2024) highlighted that while most medical students are familiar with ChatGPT, they primarily use it as a simple search engine rather than as a structured learning tool. Moreover, the study found that students often lack formal training in AI applications, particularly in areas such as prompt engineering and ethical considerations, reinforcing the need for AI literacy in medical education [ 3 ]. A well-designed multiple-choice question should balance difficulty and discrimination to accurately assess knowledge levels [ 4 ]. The item difficulty index (P_index) indicates how challenging a question is by reflecting the proportion of correct responses, while the discrimination index (D_index) measures how well a question differentiates between high- and low-performing test takers [ 5 ]. Additionally, the point-biserial correlation coefficient (ρpb) evaluates the relationship between individual item performance and total test scores, providing insight into a question’s reliability in distinguishing knowledge levels [ 6 ]. Large-scale psychometric evaluations, such as the study by Kim et al (2023), have demonstrated that difficulty and discrimination indices vary significantly across different health professions’ licensing examinations. Their findings highlight the necessity of structured item analyses to ensure question validity and fairness, particularly when integrating AI-generated assessments into medical education [ 7 ]. As AI becomes increasingly incorporated into medical assessment, its ability to generate psychometrically valid MCQs remains an area of growing interest. A recent systematic review by Kıyak and Emekli (2024) analyzed 23 studies, highlighting ChatGPT’s potential in MCQ creation [ 8 ]. Their findings emphasized that well-structured prompts enhance the quality and validity of AI-generated questions, yet factual inaccuracies and contextual limitations remain concerns, necessitating expert review before widespread adoption. Notably, the review focused on the psychometric properties and challenges of AI-generated MCQs, providing insight into both their strengths and limitations in medical assessments. Despite growing interest in AI-generated MCQs, research on their psychometric validity in emergency medicine residency training remains limited. This study addresses this gap by systematically comparing the psychometric properties of MCQs created by clinical faculty and ChatGPT-4o. We analyzed item difficulty, discrimination, and point-biserial correlation to assess whether AI-generated questions can effectively evaluate medical knowledge. By examining the feasibility of AI-assisted question development, this study contributes to the evolving role of AI in medical education. 2. MATERIALS AND METHODS 2.1 Study Design and Setting This study was designed as a cross-sectional, comparative psychometric analysis conducted in an emergency medicine residency program at a tertiary care hospital. The study adhered to the STROBE (Strengthening the Reporting of Observational Studies in Epidemiology) guidelines to ensure transparency and methodological rigor. The objective was to compare the psychometric properties of MCQs prepared by clinical faculty members and ChatGPT-4o, assessing their effectiveness in evaluating emergency medicine residents' knowledge. 2.2 Participants, Exam Structure and Blinding Eighteen emergency medicine residents, categorized into junior, middle-senior, and senior levels of training, participated in this study. Each resident administered a 100-question multiple-choice examination, comprising: 50 questions developed by clinical faculty members with expertise in emergency medicine. Trainee questions were created by modifying questions selected from the Tintinalli examination and board review book. 50 questions generated by ChatGPT-4, utilizing standardized prompts to ensure relevance and clarity. Each section from Tintinalli’s Emergency Medicine book was separately uploaded to ChatGPT-4 to generate the desired number of questions. Both sets of questions were developed based on core emergency medicine topics outlined in Tintinalli’s Emergency Medicine: A Comprehensive Study Guide, 9th Edition. The questions were categorized according to the relevant chapters of the textbook, ensuring that the AI-generated and faculty-designed (MCQs) covered equivalent subject matter. These topics included shock, airway management, cardiovascular emergencies, and mechanical ventilation, among others (Table 1). Each question had a single correct answer and was scored as correct (1 point) or incorrect (0 points). The time limit for each question was 1 minute. The psychometric properties of both question sets were analyzed to assess their effectiveness in evaluating emergency medicine residents’ knowledge. To mitigate potential biases, participants were blinded to the source of the questions and were not informed whether a question was AI-generated or clinician-designed. This blinding ensured that responses were solely based on the question’s content and structure, rather than preconceived notions about AI-generated questions. 2.3 Data Collection and Psychometric Analysis Each question was analyzed using three key psychometric indices, which together provide a comprehensive assessment of question quality, reliability, and effectiveness in distinguishing knowledge levels. 1. Item Difficulty Index (P_index) This measures how difficult or easy a question is based on the proportion of correct responses. Calculation : The number of correct responses is divided by the total number of participants. Interpretation : A higher P_index indicates that the question is easier, as a larger number of participants answered it correctly. Conversely, a lower P_index suggests that the question is more difficult, with fewer participants providing the correct answer. 2. Item Discrimination Index (D_index) This assesses how well a question differentiates between high- and low-performing participants. Calculation : The difference in correct response rates between the top 27% of scorers and the bottom 27% of scorers is measured. Interpretation : When the D_index is high, it means the question effectively distinguishes between participants with strong and weak performance. If the D_index is low, the question does not differentiate well and may not be useful for assessing knowledge levels accurately. 3. Point-Biserial Correlation Coefficient (PBCC) [ 9 ] This measures the correlation between a participant’s performance on a specific question and their overall test score. Calculation : The mean test scores of participants who answered correctly are compared to those who answered incorrectly, adjusted for standard deviation. Interpretation : A high PBCC value suggests that participants who performed well on the test were also more likely to answer this specific question correctly, indicating that it is a reliable measure of knowledge. On the other hand, a low PBCC value implies that the question does not strongly correlate with overall performance, which may indicate issues with its clarity, difficulty, or relevance. 2.4 Statistical Analysis Descriptive statistics (mean, standard deviation, frequency, percentage) were employed to summarize resident demographics and exam performance. An independent samples Student’s t-test was conducted to compare the difficulty index (P_index), discrimination index (D_index), and point-biserial correlation coefficient (PBCC) between clinical faculty and ChatGPT-4o questions. A chi-square test was employed to assess the distribution of questions across discrimination and difficulty categories, ensuring that both question sources were evaluated in accordance with standardized psychometric criteria. A Pearson correlation analysis was performed to elucidate the relationships between individual item performance and overall test scores, thereby enhancing the psychometric reliability of AI-generated questions. A significance level of p < 0.05 was employed to determine statistical significance, and all analyses were conducted utilizing SPSS (version 27) or equivalent statistical software. 2.5 Ethical Considerations This study was approved by the local ethics committee (Kutahya Health Sciences University Medical Faculty Ethics Committee, approval date: 11.03.2025, approval number: 2025/04–09). All residents provided informed consent, and participation was voluntary. 3. RESULTS Demographic data for the 18 emergency medicine residents participating in the study, including their mean age, gender distribution, and seniority status, as well as an overview of the subject headings for the 100 multiple-choice questions (MCQs)—50 prepared by clinical faculty and 50 generated by ChatGPT-4o—are summarized, ensuring a balanced assessment of essential emergency medicine concepts (Table 1). A comparison of multiple-choice questions (MCQs) created by clinical faculty and ChatGPT-4o revealed no significant difference in the discrimination index (D_index; p = 0.634), indicating a similar ability to differentiate between strong and weak performers. The difficulty index (P_index) showed that ChatGPT-4o-generated questions were significantly more challenging (p = 0.02). Additionally, no difference was observed in the point-biserial correlation coefficient (PBCC; p = 0.60), suggesting comparable reliability. However, participants achieved significantly higher overall exam scores on ChatGPT-4o-generated questions (p = 0.003), reflecting differences in question design and cognitive demands (Table 2). The difficulty index (P_index) indicated that ChatGPT-4o-generated questions were significantly more difficult than those written by clinical faculty (p = 0.02). The distribution of item difficulty and discrimination indices for AI-generated MCQs is illustrated in Fig. 1 . In contrast, faculty-designed MCQs exhibited a more balanced difficulty distribution, suggesting that human-authored questions were generally easier but still maintained comparable discrimination indices (p = 0.634). The distribution of item difficulty and discrimination indices for clinician-designed MCQs is illustrated in Fig. 2 . The categorical breakdown of questions based on their discrimination and difficulty levels showed that the discrimination index (D_index) categorized items as high, moderate, poor, or non-discriminatory. Notably, no questions fell into the "good" or "poor" discrimination categories. Although the proportion of highly discriminative questions was slightly higher for ChatGPT-4o-generated items (34%) compared to those created by clinical faculty (32%), the difference was not statistically significant (p = 0.817). The difficulty index (P_index) indicated that ChatGPT-4o produced more "very difficult" questions compared to clinical faculty (4 vs. 1). Additionally, the point-biserial correlation coefficient (PBCC), assessing the relationship between item performance and total test score, revealed no significant difference between the two question sets (p = 0.424). A chi-square test was conducted to analyze the distribution of items with statistically significant rho values (Table 3). 4. DISCUSSION The integration of AI into medical education has gained significant attention, particularly in assessment methodologies such as question generation and automated evaluation [ 10 , 11 ]. Gordon et al. (2024) conducted a scoping review on AI applications in medical education and highlighted its growing role in adaptive learning, personalized instruction, and automated assessment [ 11 ]. While AI-powered tools show promise in enhancing assessment strategies, concerns remain regarding their reliability, validity, and alignment with educational objectives. Recent studies have explored the feasibility of using generative AI, such as ChatGPT-4, to create multiple-choice questions (MCQs) and have examined their psychometric properties in comparison to human-authored questions. Preiksaitis et al. (2023) emphasized that AI-generated questions can demonstrate validity and reliability, yet challenges persist in ensuring appropriate cognitive complexity and contextual accuracy [ 10 ]. In this study, we aimed to assess the psychometric properties of MCQs generated by ChatGPT-4 versus those created by clinical faculty, specifically evaluating their effectiveness in assessing emergency medicine residents’ knowledge. Our findings revealed that while both sources produced comparable questions in terms of discrimination index (D_index), AI-generated questions tended to be more challenging (P_index) but were associated with higher mean exam scores among participants. Recent studies have explored the effectiveness of AI-generated multiple-choice questions (MCQs) in medical education, highlighting both the advantages and limitations of large language models in assessment design. Law et al. (2025) conducted a cohort study comparing ChatGPT-4o-generated MCQs with human-authored ones in a high-stakes emergency medicine licensing exam [ 12 ]. Their findings revealed that AI-generated MCQs were significantly easier (P_index = 0.78 vs. 0.69, p < 0.01) but showed comparable discrimination indices, suggesting their potential utility in assessing medical trainees. Similarly, Cheung et al. (2023) reported that ChatGPT-produced MCQs were comparable in quality to human-authored questions, except for slightly lower relevance scores [ 13 ]. However, our study presents a contrasting perspective, as we found that AI-generated MCQs were significantly more difficult than those created by human experts (P_index = 0.65 vs. 0.76, p = 0.02), yet maintained a similar discrimination index. This discrepancy may be attributed to differences in AI prompting techniques, dataset training, or the complexity of topics covered in emergency medicine compared to general medical exams. Beyond difficulty levels, AI-generated questions also exhibited distinct cognitive characteristics. Law et al. (2023) and Cheung et al. (2023) both found that AI-generated MCQs primarily assessed lower-order cognitive skills (i.e., knowledge recall and understanding) rather than higher-order reasoning such as application and analysis [ 12 , 13 ]. This aligns with concerns that large language models tend to prioritize factual recall over clinical reasoning. In contrast, our study suggests that AI-generated MCQs in emergency medicine settings were not only more difficult but also demonstrated an ability to test higher-order cognitive skills, particularly when guided with structured prompting. This finding underscores the need for optimized AI-human collaboration, where AI’s efficiency in generating questions is coupled with expert review to ensure alignment with educational objectives. While AI models have shown remarkable time efficiency in MCQ generation, human oversight remains essential for refining question quality, contextual accuracy, and ensuring appropriate cognitive complexity. Recent research by Griot et al. (2024) examined the limitations of multiple-choice questions (MCQs) in assessing the reasoning capabilities of large language models (LLMs) such as ChatGPT [ 14 ]. Their study demonstrated that AI models often rely on pattern recognition rather than deep comprehension, raising concerns about the validity of MCQ-based evaluations for AI-generated content. This aligns with findings from Kung et al. (2023) who tested ChatGPT’s performance on the USMLE and noted that while the model achieved passing scores, its success appeared to stem from statistical inference rather than true clinical reasoning [ 2 ]. These results underscore the importance of diversifying assessment methods beyond MCQs, incorporating case-based and open-ended inquiries that better evaluate both AI and human examinees’ critical thinking and problem-solving skills. Our study supports this perspective, as AI-generated MCQs, while psychometrically comparable to clinician-designed ones, demonstrated limitations in assessing higher-order cognitive skills, reinforcing the need for expert revision. Similarly, a recent study by Chen et al. (2024) investigated the effectiveness of AI-generated content in high-stakes medical assessments [ 15 ]. Their findings revealed no significant difference in overall quality scores between AI- and human-authored exam content (p = 0.12), a result that closely aligns with our study’s observation that AI-generated questions exhibit similar psychometric properties to human-written ones. However, Chen et al. (2024) reported that human-generated MCQs performed better in specialties requiring nuanced contextual understanding, such as Obstetrics & Gynecology (p = 0.03) [ 15 ]. This aligns with our results, which indicate that while AI can produce technically sound MCQs, it may struggle to integrate real-world clinical complexity and contextual relevance. These findings collectively highlight that AI-generated MCQs can be a valuable tool in medical education but require expert oversight to ensure clinical depth and alignment with educational goals. Similarly, Naseer et al. (2024) found no significant difference in overall MCQ quality scores (p = 0.12), reinforcing the comparability of AI-generated and human-authored assessments. However, their study highlighted that human-generated MCQs outperformed AI-generated ones in the Obstetrics & Gynecology domain (p = 0.03), emphasizing the need for expert oversight in context-dependent specialties [ 16 ]. Our findings align with the broader literature, including Ali & Talat (2024), who systematically reviewed AI’s role in MCQ development [ 17 ]. While their review emphasized AI’s efficiency in automating question generation, they also noted limitations in content validity and reasoning ability. Our study supports these concerns, as AI-generated MCQs exhibited greater difficulty yet comparable discrimination indices, reinforcing the need for expert oversight to ensure clinical relevance and cognitive rigor.” Lindqwister et al. (2023) evaluated the performance of ChatGPT in generating MCQs for medical licensing examinations, emphasizing the model’s ability to produce high-quality test items [ 18 ]. Their findings suggested that AI-generated questions demonstrated a higher probability of correctness (P_index) compared to human-created ones, differing from our study, where AI-generated questions were found to be more challenging. This discrepancy may stem from differences in AI prompting strategies and the level of contextual guidance provided during question generation. Specifically, in our study, ChatGPT was instructed to utilize predefined textbook chapters as references, potentially leading to more content-dense and complex questions. Conversely, Lindqwister et al. (2023) noted that AI-generated questions could sometimes reflect statistical inference rather than deep understanding, aligning with broader concerns about pattern recognition bias in AI-based assessments [ 17 ]. These findings underscore the importance of structured AI-human collaboration in medical education, ensuring that AI-generated assessments align with intended cognitive learning outcomes. Meo et al. (2023) evaluated ChatGPT’s performance in both basic and clinical medical sciences using MCQ-based assessments [ 19 ]. Their study found that while ChatGPT obtained a 72% overall accuracy, its performance was notably higher in basic medical sciences (74%) compared to clinical disciplines (70%). While AI-generated MCQs were statistically more difficult, their ability to measure higher-order cognitive skills remains debatable. Prior studies (e.g., Meo et al. 2023) suggest that AI models tend to excel in factual recall rather than complex reasoning, reinforcing the need for expert revision [ 19 ]. This aligns with concerns raised by other studies (Griot et al. 2024; Kung et al. 2023) that AI models often excel at factual recall but struggle with complex clinical reasoning [ 2 , 14 ]. Our study supports these findings, as AI-generated MCQs exhibited comparable discrimination indices to clinician-designed ones but were significantly more challenging (P_index = 0.65 vs. 0.76, p = 0.02). These results highlight the importance of human oversight and structured AI prompting strategies to optimize AI-generated assessments and ensure they align with the cognitive demands of medical training. Our findings align with the broader literature on AI-assisted MCQ generation. Kıyak and Emekli (2024) emphasized the critical role of prompt engineering in enhancing the accuracy and relevance of AI-generated questions while highlighting challenges like difficulty variations, discrimination inconsistencies, and factual inaccuracies [ 8 ]. In line with their review, our study found that ChatGPT-4o-generated MCQs were statistically comparable to clinician-designed ones but exhibited greater difficulty. However, structured prompts and textbook-based references in our methodology helped mitigate common AI-related issues. These findings reinforce the need for expert oversight to optimize AI-generated assessments in medical education. Kıyak et al. (2024) investigated the feasibility of using ChatGPT to generate case-based multiple-choice questions (MCQs) in a rational pharmacotherapy exam [ 20 ]. Their findings indicated that AI-generated questions demonstrated acceptable point-biserial correlations (0.41 and 0.39), suggesting their ability to differentiate between high- and low-performing students. These results align with our study, where AI-generated MCQs exhibited comparable discrimination indices to those created by clinicians. However, Kıyak et al. (2024) identified an AI-generated question with three non-functional distractors, whereas our study observed a more even distribution of response choices [ 20 ]. This suggests that structured prompt engineering and expert review can mitigate AI-related challenges, ensuring validity and reliability in AI-assisted assessment design. Our findings reinforce the notion that AI-generated MCQs can serve as a valuable supplement in medical education, provided they undergo systematic validation. Conversely, Coşkun et al. (2025) reported inconsistent psychometric performance for AI-generated MCQs, with only six out of fifteen items achieving an acceptable point-biserial correlation (> 0.30) [ 21 ]. This contrasts with our study, where AI-generated MCQs exhibited more consistent discrimination indices, comparable to those designed by clinical faculty. The observed discrepancy may stem from differences in question design methodology, prompt specificity, or the subject matter assessed. Coşkun et al. (2025) focused on evidence-based medicine, an area requiring nuanced clinical reasoning, whereas our study encompassed a broader spectrum of emergency medicine topics, potentially facilitating more structured question generation [ 21 ]. These findings highlight the importance of domain specificity and optimized AI prompting strategies in maximizing the effectiveness of AI-generated assessments for medical education. 4.1 Limitations and Future Directions This study has some limitations. Firstly, the sample size was relatively small (n = 18 residents), which may limit the generalizability of the findings. Future studies with larger cohorts and multicenter designs are necessary to further validate AI-generated assessments. Secondly, while ChatGPT-4o generated questions without manual modifications, future research should evaluate AI-assisted question refinement, integrating expert review to optimize clarity and relevance. Finally, long-term studies should assess the impact of AI-generated questions on learning outcomes, particularly in formative assessments and competency-based medical education frameworks. 5. CONCLUSION AI-generated MCQs can complement faculty-created questions in emergency medicine assessments, offering scalability and efficiency. ChatGPT-4o-generated questions showed comparable discrimination indices and psychometric reliability to human-crafted questions, despite being more challenging. However, expert oversight is crucial to address concerns like distractors and contextual limitations. As AI evolves, its role in medical education should expand beyond question generation to include refinement, adaptive learning, and scenario-based assessments, fostering competency-driven training. Designing multiple-choice questions enhances educators’ clinical judgment and subject matter expertise. However, relying solely on AI for this task poses a potential risk that could hinder educators’ continuous self-improvement by engaging in mental exercises during question preparation. Future research should effectively integrate AI, leveraging automation to enhance faculty engagement and expertise. Declarations Acknowledgements : Not required Author contributions M.K. and E.S. contributed to the conceptualization of the study. M.K., E.S., and A.H. designed the methodology. Investigation and formal analysis were carried out by M.K. and A.H. A.C. was responsible for obtaining ethical approval and resources. The original draft was prepared by M.K. and E.S., and all authors (M.K., E.S., A.H., and A.C.) participated in reviewing and editing the manuscript. Supervision was provided by M.K. and E.S. All authors have read and approved the final version of the manuscript. Declaration of conflicting interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article. Funding : Non declared Informed consent : Informed consent was obtained from the participants or their legally authorized representatives. Ethical approval : This study was approved by the local ethics committee (This study has approval from Kutahya Health Sciences University Medical Faculty Ethics Committee (approval date and number: 11.03.2025, 2025/04-09) Human rights statement: The study protocol conforms to the ethical guidelines of the 1975 Declaration of Helsinki. References Mir MM, Mir GM, Raina NT, Mir SM, Mir SM, Miskeen E, et al. Application of Artificial Intelligence in Medical Education: Current Scenario and Future Perspectives. J Adv Med Educ Prof. 2023;11(3):133–40. https://doi.org/10.30476/JAMP.2023.98655.1803 . Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198. https://doi.org/10.1371/journal.pdig.0000198 . Maaß L, Grab-Kroll C, Koerner J, Öchsner W, Schön M, Messerer D, et al. Artificial Intelligence and ChatGPT in Medical Education: A Cross-Sectional Questionnaire on students' Competence. J CME. 2024;14(1):2437293. https://doi.org/10.1080/28338073.2024.2437293 . Sim SM, Rasiah RI. Relationship between item difficulty and discrimination indices in true/false-type multiple choice questions of a para-clinical multidisciplinary paper. Ann Acad Med Singap. 2006;35(2):67–71. Haladyna TM, Downing SM, Rodriguez MC. A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Appl Meas Educ. 2002;15(3):309–33. https://doi.org/10.1207/S15324818AME1503_5 . Ebel RL, Frisbie DA. Essentials of Educational Measurement. 5th ed. Englewood Cliffs, NJ: Prentice-Hall; 1991. Kim YH, Kim BH, Kim J, Jung B, Bae S. Item difficulty index, discrimination index, and reliability of the 26 health professions licensing examinations in 2022, Korea: a psychometric study. J Educ Eval Health Prof. 2023;20:31. https://doi.org/10.3352/jeehp.2023.20.31 . Kıyak YS, Emekli E. ChatGPT prompts for generating multiple-choice questions in medical education and evidence on their validity: a literature review. Postgrad Med J. 2024;100(1189):858–65. https://doi.org/10.1093/postmj/qgae065 . Attali Y, Fraenkel T. The Point-Biserial as a Discrimination Index for Distractors in Multiple-Choice Items: Deficiencies in Usage and an Alternative. J Educ Meas. 2000;37(1):77–86. http://www.jstor.org/stable/1435063 . Preiksaitis C, Rose C. Opportunities, Challenges, and Future Directions of Generative Artificial Intelligence in Medical Education: Scoping Review. JMIR Med Educ. 2023;9:e48785. https://doi.org/10.2196/48785 . Gordon M, Daniel M, Ajiboye A, Uraiby H, Xu NY, Bartlett R, et al. A scoping review of artificial intelligence in medical education: BEME Guide 84. Med Teach. 2024;46(4):446–70. https://doi.org/10.1080/0142159X.2024.2314198 . Law AK, So J, Lui CT, Choi YF, Cheung KH, Kei-Ching Hung K, et al. AI versus human-generated multiple-choice questions for medical education: a cohort study in a high-stakes examination. BMC Med Educ. 2025;25(1):208. https://doi.org/10.1186/s12909-025-06796-6 . Cheung BHH, Lau GKK, Wong GTC, Lee EYP, Kulkarni D, Seow CS, ChatGPT versus human in generating medical graduate exam multiple choice questions-A multinational prospective study (Hong, Kong SAR et al. Singapore, Ireland, and the United Kingdom). PLoS One. 2023;18(8):e0290691. https://doi.org/10.1371/journal.pone.0290691 Griot M, Vanderdonckt J, Yuksel D, Hemptinne C. Multiple choice questions and large language models: a case study with fictional medical data. Preprint at https://arxiv.org/abs/2406.02394 (2024). Chen J, Tao BK, Park S, Bovill E. Can ChatGPT Fool the Match? Artificial Intelligence Personal Statements for Plastic Surgery Residency Applications: A Comparative Study. Plast Surg (Oakville Ont). 2024;22925503241264832. Advance online publication. https://doi.org/10.1177/22925503241264832 Naseer MA, Nasir Y, Tabassum A, Ali S. ChatGPT-4 versus human generated final year MBBS multiple-choice questions – A study from a medical college of Pakistan. J Shalamar Med Dent Coll. 2024;5(2):58–64. https://doi.org/10.53685/jshmdc.v5i2.253 . Ali F, Talat H. AI Integration in MCQ Development: Assessing Quality in Medical Education: A Systematic Review. Life Sci. 2024;5(3):413–26. https://doi.org/10.37185/LnS.1.1.643 . Lindqwister AL, Hassanpour S, Levy J, Sin JM. AI-RADS: Successes and challenges of a novel artificial intelligence curriculum for radiologists across different delivery formats. Front Med Technol. 2023;4:1007708. https://doi.org/10.3389/fmedt.2022.1007708 . Meo SA, Al-Masri AA, Alotaibi M, Meo MZS, Meo MOS. ChatGPT Knowledge Evaluation in Basic and Clinical Medical Sciences: Multiple Choice Question Examination-Based Performance. Healthc (Basel). 2023;11(14):2046. https://doi.org/10.3390/healthcare11142046 . Kıyak YS, Coşkun Ö, Budakoğlu İİ, Uluoğlu C. ChatGPT for generating multiple-choice questions: Evidence on the use of artificial intelligence in automatic item generation for a rational pharmacotherapy exam. Eur J Clin Pharmacol. 2024;80(5):729–35. https://doi.org/10.1007/s00228-024-03649-x . Coşkun Ö, Kıyak YS, Budakoğlu İİ. ChatGPT to generate clinical vignettes for teaching and multiple-choice questions for assessment: A randomized controlled experiment. Med Teach. 2025;47(2):268–74. https://doi.org/10.1080/0142159X.2024.2327477 . Tables Tablo 1: Participants and Main Topics of Questions Emergency Medicine Residents (n:18) Mean Age (years) 26.8 Genger Female (n/%) 8/44 Male (n/%) 10/56 Seniority status of residents (n) Senior (> 3 years) 8 Middle Senior (1-3years) 5 Junior < 1 year 5 Subject Headings of the Questions(Q) Clinical Q /Chat_gpt 4.o Q n:50/50 Sudden Cardiac Death 2/2 Traumatic Shock 3/3 Non-Traumatic Shock And Anaphylaxia 2/2 Acid-Base Disorders, Blood Gas Analysis And Fluid Electrolyte Disorders 3/3 Asthma And Chronic Obstructive Pulmonary Disease 2/2 Rhythm Disorders 2/2 Antiarrhythmic, Antihypertensive And Positive Inotrop Drugs Used In Cardiac Rhythm Disorders And Defibrillation - Electrical Cardioversion 2/2 Hyperbaric Oxygen Therapy 2/2 Basic And Advanced Cardiopulmonary Resuscitation And What To Do In Special Cases (Pregnation, Drowning, Freezing, Post Cardiac Arrest Care) 3/3 Invasive And Non-Invasive Methods In Difficult Airway Management 3/3 Basic Parameters And Modes In Mechanical Ventilation 2/2 Hemodynamic Monitoring, Cardiac Pacing And Defibrillation 2/2 Vascular Management, Catheterization Techniques 2/2 Chest Pain, Axis And Approach To Low-Probability Axis 3/3 Approach to the Syncope Patients 3/3 Acute Heart Failure And Valve Diseases 2/2 Deep Vein Thrombosis And Pulmonary Embolism 2/2 Hypertensive Emergencies And Pulmonary Hypertension 3/3 Acute Arterial Obstructions, Aortic Aneurysm And Dissection 2/2 Respiratory Distress 3/3 Tube Thoracostomy, Pericardiocentesis, Paracentesis Techniques In Hemothorax And Pneumothorax 2/2 Tablo 2: Average of the Indexes And Participants' Exam Scores Clinical Q n:50 Chat_gpt 4.o Q n:50 p D_index 1 (mean ± SD) 0.172 ± 0.23 0.196 ± 0.26 0.634 P_index 2 (mean ± SD) 0.76 ± 0.23 0.65 ± 0.24 0.02 PBCC 3 (mean ± SD) 0.268 ± 0.30 0.236 ± 0.26 0.60 Exam score (mean ± SD) 67.3 ± 9.65 76.8 ± 8.18 0.003 P < 0.05 and Student t test was performed 1:Discrimination index of the item 2:Item difficulty index 3:Point-Biserial Correlation Coefficient Tablo 3: Analysis Of Questions According to Discrimination And Difficulty Indexes Clinical Q n:50 Chat_gpt 4.o Q n:50 p D_index 1 n(%) High 16(%32) 17(%34) 0.817 Good 0 0 Moderate 12(%24) 14(%28) Poor 0 0 No discrimination 22(%44) 19(%38) P_index 2 n(%) Very easy 28 18 0.106 Easy 13 14 Medium 2 8 Difficult 6 6 Very difficult 1 4 PBCC 3 n(%) Yes 7 10 0.424 No 43 40 1:Discrimination index of the item 2: Item difficulty index: represents the proportion of correct responses for each item, indicating its difficulty level 3:Point-Biserial Correlation Coefficient between item score and total test score. A chi-square test was conducted based on the number of items with statistically significant scores.. Additional Declarations No competing interests reported. Supplementary Files ClinicalQuestions.docx ChatGpt4oQuestions.docx Cite Share Download PDF Status: Published Journal Publication published 01 Jul, 2025 Read the published version in BMC Medical Education → Version 1 posted Editorial decision: Revision requested 19 May, 2025 Reviews received at journal 17 May, 2025 Reviewers agreed at journal 16 May, 2025 Reviewers agreed at journal 15 May, 2025 Reviewers agreed at journal 14 May, 2025 Reviews received at journal 14 May, 2025 Reviewers agreed at journal 14 May, 2025 Reviewers agreed at journal 14 May, 2025 Reviewers invited by journal 14 May, 2025 Editor invited by journal 12 May, 2025 Editor assigned by journal 04 Apr, 2025 Submission checks completed at journal 04 Apr, 2025 First submitted to journal 27 Mar, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-6319788","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":457028376,"identity":"3e440321-90a1-4050-ab94-abc5c4018d50","order_by":0,"name":"Murtaza Kaya","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA/klEQVRIiWNgGAWjYDADCRDx4QBDApyDBzA2wFQxzkBoMSBOCzMPMVp0Z+Q+f/Djjx2DZPsZw882Z+ryDA4wH7zNw/AnH5cWsxvpho09PMkM0jw5xtI5Nw4XGxxgS7bmYTCwbMCpJY2xgUeCmUGOIcdAOufDgcQNB3jMpIFacLoMpKXxj0E9gxz/G+PfFh/qgFr4vxHU0syTcJhBWiLHTJrhBjPIFjb8Ws48Y5wtc+A4j+SMZ2WWPWcOF0seZjO2nGNgjFvL8TSGj2/+VMtJnE/efOPHsbo8vuPND2+8qZDDGzEgwMPAwAFVxAwiCGoAA/YHRCkbBaNgFIyCkQcASt1TjXg4YSwAAAAASUVORK5CYII=","orcid":"","institution":"Kütahya Sağlık Bilimleri Üniversitesi","correspondingAuthor":true,"prefix":"","firstName":"Murtaza","middleName":"","lastName":"Kaya","suffix":""},{"id":457028378,"identity":"e1cbda12-04da-46c4-86ab-25c5a421f04d","order_by":1,"name":"Ertan Sonmez","email":"","orcid":"","institution":"Kütahya Sağlık Bilimleri Üniversitesi","correspondingAuthor":false,"prefix":"","firstName":"Ertan","middleName":"","lastName":"Sonmez","suffix":""},{"id":457028380,"identity":"f0ddbe2a-fdc1-42ab-92a6-f64c6d24887b","order_by":2,"name":"Ali Halici","email":"","orcid":"","institution":"Kütahya Sağlık Bilimleri Üniversitesi","correspondingAuthor":false,"prefix":"","firstName":"Ali","middleName":"","lastName":"Halici","suffix":""},{"id":457028382,"identity":"c3cd57b4-a649-4b18-923b-b1dcd82d6934","order_by":3,"name":"Harun Yildirim","email":"","orcid":"","institution":"Kütahya Sağlık Bilimleri Üniversitesi","correspondingAuthor":false,"prefix":"","firstName":"Harun","middleName":"","lastName":"Yildirim","suffix":""},{"id":457028384,"identity":"d70a41fd-d0fc-4655-9781-c7d78a8148cf","order_by":4,"name":"Abdil Coskun","email":"","orcid":"","institution":"Kütahya Sağlık Bilimleri Üniversitesi","correspondingAuthor":false,"prefix":"","firstName":"Abdil","middleName":"","lastName":"Coskun","suffix":""}],"badges":[],"createdAt":"2025-03-27 10:53:18","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-6319788/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-6319788/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1186/s12909-025-07528-6","type":"published","date":"2025-07-01T15:57:33+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":82889793,"identity":"9e0b1d2a-0991-4ef2-bd2a-9aa00db12a51","added_by":"auto","created_at":"2025-05-16 12:08:08","extension":"jpg","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":173177,"visible":true,"origin":"","legend":"\u003cp\u003eDistribution of Item Difficulty (P_index) and Discrimination (D_index) Indices of AI-Generated MCQs\u003c/p\u003e","description":"","filename":"Figure1jpeg.jpg","url":"https://assets-eu.researchsquare.com/files/rs-6319788/v1/ec771ff53895dcc9782a2c7e.jpg"},{"id":82887965,"identity":"5f3c6c39-f537-4988-bb32-67c9c190632a","added_by":"auto","created_at":"2025-05-16 12:00:08","extension":"jpg","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":160954,"visible":true,"origin":"","legend":"\u003cp\u003eDistribution of Item Difficulty (P_index) and Discrimination (D_index) Indices of Clinician-Designed MCQs\u003c/p\u003e","description":"","filename":"Figure2jpeg.jpg","url":"https://assets-eu.researchsquare.com/files/rs-6319788/v1/f2af4f5450bd827a2f1c5df4.jpg"},{"id":86179685,"identity":"5bc62a7c-e69c-4b2a-9b69-2aa189e60e51","added_by":"auto","created_at":"2025-07-07 16:18:21","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1108166,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6319788/v1/a0da650e-2ecf-44a2-bf2c-edac2371cf10.pdf"},{"id":82889794,"identity":"43e6c8d0-d261-43d8-91c2-a0cb7a346f98","added_by":"auto","created_at":"2025-05-16 12:08:08","extension":"docx","order_by":3,"title":"","display":"","copyAsset":false,"role":"supplement","size":2154438,"visible":true,"origin":"","legend":"","description":"","filename":"ClinicalQuestions.docx","url":"https://assets-eu.researchsquare.com/files/rs-6319788/v1/533a0789b48a0f222e5e5874.docx"},{"id":82887967,"identity":"c417b974-4b61-444f-97ba-aee0e4e494d7","added_by":"auto","created_at":"2025-05-16 12:00:08","extension":"docx","order_by":4,"title":"","display":"","copyAsset":false,"role":"supplement","size":39328,"visible":true,"origin":"","legend":"","description":"","filename":"ChatGpt4oQuestions.docx","url":"https://assets-eu.researchsquare.com/files/rs-6319788/v1/8e083d70cb599de5b82d0546.docx"}],"financialInterests":"No competing interests reported.","formattedTitle":"Comparison of AI-Generated and Clinician-Designed Multiple Choice Questions in Emergency Medicine Exam: A Psychometric Analysis","fulltext":[{"header":"1. INTRODUCTION","content":"\u003cp\u003eMedical education constantly evolves to integrate innovative tools that enhance assessment and learning. In recent years, artificial intelligence (AI) has gained attention for its potential role in medical training and evaluation [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e]. One of the most promising AI models, ChatGPT-4o, has demonstrated capabilities in generating high-quality multiple-choice questions (MCQs) across various disciplines, including emergency medicine [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]. However, its effectiveness in creating exam questions that meet standard psychometric criteria, such as item difficulty and discrimination, remains an area of active investigation. A cross-sectional study by Maa\u0026szlig; et al. (2024) highlighted that while most medical students are familiar with ChatGPT, they primarily use it as a simple search engine rather than as a structured learning tool. Moreover, the study found that students often lack formal training in AI applications, particularly in areas such as prompt engineering and ethical considerations, reinforcing the need for AI literacy in medical education [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eA well-designed multiple-choice question should balance difficulty and discrimination to accurately assess knowledge levels [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e]. The item difficulty index (P_index) indicates how challenging a question is by reflecting the proportion of correct responses, while the discrimination index (D_index) measures how well a question differentiates between high- and low-performing test takers [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e]. Additionally, the point-biserial correlation coefficient (ρpb) evaluates the relationship between individual item performance and total test scores, providing insight into a question\u0026rsquo;s reliability in distinguishing knowledge levels [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eLarge-scale psychometric evaluations, such as the study by Kim et al (2023), have demonstrated that difficulty and discrimination indices vary significantly across different health professions\u0026rsquo; licensing examinations. Their findings highlight the necessity of structured item analyses to ensure question validity and fairness, particularly when integrating AI-generated assessments into medical education [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eAs AI becomes increasingly incorporated into medical assessment, its ability to generate psychometrically valid MCQs remains an area of growing interest. A recent systematic review by Kıyak and Emekli (2024) analyzed 23 studies, highlighting ChatGPT\u0026rsquo;s potential in MCQ creation [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e]. Their findings emphasized that well-structured prompts enhance the quality and validity of AI-generated questions, yet factual inaccuracies and contextual limitations remain concerns, necessitating expert review before widespread adoption. Notably, the review focused on the psychometric properties and challenges of AI-generated MCQs, providing insight into both their strengths and limitations in medical assessments.\u003c/p\u003e \u003cp\u003eDespite growing interest in AI-generated MCQs, research on their psychometric validity in emergency medicine residency training remains limited. This study addresses this gap by systematically comparing the psychometric properties of MCQs created by clinical faculty and ChatGPT-4o. We analyzed item difficulty, discrimination, and point-biserial correlation to assess whether AI-generated questions can effectively evaluate medical knowledge. By examining the feasibility of AI-assisted question development, this study contributes to the evolving role of AI in medical education.\u003c/p\u003e"},{"header":"2. MATERIALS AND METHODS","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e\n\u003ch2\u003e2.1 Study Design and Setting\u003c/h2\u003e\n\u003cp\u003eThis study was designed as a cross-sectional, comparative psychometric analysis conducted in an emergency medicine residency program at a tertiary care hospital. The study adhered to the STROBE (Strengthening the Reporting of Observational Studies in Epidemiology) guidelines to ensure transparency and methodological rigor. The objective was to compare the psychometric properties of MCQs prepared by clinical faculty members and ChatGPT-4o, assessing their effectiveness in evaluating emergency medicine residents' knowledge.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec4\" class=\"Section2\"\u003e\n\u003ch2\u003e2.2 Participants, Exam Structure and Blinding\u003c/h2\u003e\n\u003cp\u003eEighteen emergency medicine residents, categorized into junior, middle-senior, and senior levels of training, participated in this study. Each resident administered a 100-question multiple-choice examination, comprising:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\n\u003cp\u003e50 questions developed by clinical faculty members with expertise in emergency medicine. Trainee questions were created by modifying questions selected from the Tintinalli examination and board review book.\u003c/p\u003e\n\u003c/li\u003e\n\u003cli\u003e\n\u003cp\u003e50 questions generated by ChatGPT-4, utilizing standardized prompts to ensure relevance and clarity. Each section from Tintinalli\u0026rsquo;s Emergency Medicine book was separately uploaded to ChatGPT-4 to generate the desired number of questions.\u003c/p\u003e\n\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eBoth sets of questions were developed based on core emergency medicine topics outlined in Tintinalli\u0026rsquo;s Emergency Medicine: A Comprehensive Study Guide, 9th Edition. The questions were categorized according to the relevant chapters of the textbook, ensuring that the AI-generated and faculty-designed (MCQs) covered equivalent subject matter. These topics included shock, airway management, cardiovascular emergencies, and mechanical ventilation, among others (Table\u0026nbsp;1). Each question had a single correct answer and was scored as correct (1 point) or incorrect (0 points). The time limit for each question was 1 minute. The psychometric properties of both question sets were analyzed to assess their effectiveness in evaluating emergency medicine residents\u0026rsquo; knowledge.\u003c/p\u003e\n\u003cp\u003eTo mitigate potential biases, participants were blinded to the source of the questions and were not informed whether a question was AI-generated or clinician-designed. This blinding ensured that responses were solely based on the question\u0026rsquo;s content and structure, rather than preconceived notions about AI-generated questions.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec5\" class=\"Section2\"\u003e\n\u003ch2\u003e2.3 Data Collection and Psychometric Analysis\u003c/h2\u003e\n\u003cp\u003eEach question was analyzed using three key psychometric indices, which together provide a comprehensive assessment of question quality, reliability, and effectiveness in distinguishing knowledge levels.\u003c/p\u003e\n\u003cstrong\u003e1. Item Difficulty Index (P_index)\u003c/strong\u003e\u003cbr /\u003e\n\u003cul\u003e\n\u003cli\u003e\n\u003cp\u003eThis measures how difficult or easy a question is based on the proportion of correct responses.\u003c/p\u003e\n\u003c/li\u003e\n\u003cli\u003e\n\u003cp\u003e\u003cstrong\u003eCalculation\u003c/strong\u003e: The number of correct responses is divided by the total number of participants.\u003c/p\u003e\n\u003c/li\u003e\n\u003cli\u003e\n\u003cp\u003e\u003cstrong\u003eInterpretation\u003c/strong\u003e: A higher P_index indicates that the question is easier, as a larger number of participants answered it correctly. Conversely, a lower P_index suggests that the question is more difficult, with fewer participants providing the correct answer.\u003c/p\u003e\n\u003c/li\u003e\n\u003c/ul\u003e\n\u003cstrong\u003e2. Item Discrimination Index (D_index)\u003c/strong\u003e\u003cbr /\u003e\n\u003cul\u003e\n\u003cli\u003e\n\u003cp\u003eThis assesses how well a question differentiates between high- and low-performing participants.\u003c/p\u003e\n\u003c/li\u003e\n\u003cli\u003e\n\u003cp\u003e\u003cstrong\u003eCalculation\u003c/strong\u003e: The difference in correct response rates between the top 27% of scorers and the bottom 27% of scorers is measured.\u003c/p\u003e\n\u003c/li\u003e\n\u003cli\u003e\n\u003cp\u003e\u003cstrong\u003eInterpretation\u003c/strong\u003e: When the D_index is high, it means the question effectively distinguishes between participants with strong and weak performance. If the D_index is low, the question does not differentiate well and may not be useful for assessing knowledge levels accurately.\u003c/p\u003e\n\u003c/li\u003e\n\u003c/ul\u003e\n\u003cstrong\u003e3. Point-Biserial Correlation Coefficient (PBCC)\u003c/strong\u003e [\u003cspan class=\"CitationRef\"\u003e9\u003c/span\u003e]\u003cbr /\u003e\n\u003cul\u003e\n\u003cli\u003e\n\u003cp\u003eThis measures the correlation between a participant\u0026rsquo;s performance on a specific question and their overall test score.\u003c/p\u003e\n\u003c/li\u003e\n\u003cli\u003e\n\u003cp\u003e\u003cstrong\u003eCalculation\u003c/strong\u003e: The mean test scores of participants who answered correctly are compared to those who answered incorrectly, adjusted for standard deviation.\u003c/p\u003e\n\u003c/li\u003e\n\u003cli\u003e\n\u003cp\u003e\u003cstrong\u003eInterpretation\u003c/strong\u003e: A high PBCC value suggests that participants who performed well on the test were also more likely to answer this specific question correctly, indicating that it is a reliable measure of knowledge. On the other hand, a low PBCC value implies that the question does not strongly correlate with overall performance, which may indicate issues with its clarity, difficulty, or relevance.\u003c/p\u003e\n\u003c/li\u003e\n\u003c/ul\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec6\" class=\"Section2\"\u003e\n\u003ch2\u003e2.4 Statistical Analysis\u003c/h2\u003e\n\u003cp\u003eDescriptive statistics (mean, standard deviation, frequency, percentage) were employed to summarize resident demographics and exam performance. An independent samples Student\u0026rsquo;s t-test was conducted to compare the difficulty index (P_index), discrimination index (D_index), and point-biserial correlation coefficient (PBCC) between clinical faculty and ChatGPT-4o questions. A chi-square test was employed to assess the distribution of questions across discrimination and difficulty categories, ensuring that both question sources were evaluated in accordance with standardized psychometric criteria. A Pearson correlation analysis was performed to elucidate the relationships between individual item performance and overall test scores, thereby enhancing the psychometric reliability of AI-generated questions. A significance level of p\u0026thinsp;\u0026lt;\u0026thinsp;0.05 was employed to determine statistical significance, and all analyses were conducted utilizing SPSS (version 27) or equivalent statistical software.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec7\" class=\"Section2\"\u003e\n\u003ch2\u003e2.5 Ethical Considerations\u003c/h2\u003e\n\u003cp\u003eThis study was approved by the local ethics committee (Kutahya Health Sciences University Medical Faculty Ethics Committee, approval date: 11.03.2025, approval number: 2025/04\u0026ndash;09). All residents provided informed consent, and participation was voluntary.\u003c/p\u003e\n\u003c/div\u003e"},{"header":"3. RESULTS","content":"\u003cp\u003eDemographic data for the 18 emergency medicine residents participating in the study, including their mean age, gender distribution, and seniority status, as well as an overview of the subject headings for the 100 multiple-choice questions (MCQs)\u0026mdash;50 prepared by clinical faculty and 50 generated by ChatGPT-4o\u0026mdash;are summarized, ensuring a balanced assessment of essential emergency medicine concepts (Table\u0026nbsp;1).\u003c/p\u003e\n\u003cp\u003eA comparison of multiple-choice questions (MCQs) created by clinical faculty and ChatGPT-4o revealed no significant difference in the discrimination index (D_index; p\u0026thinsp;=\u0026thinsp;0.634), indicating a similar ability to differentiate between strong and weak performers. The difficulty index (P_index) showed that ChatGPT-4o-generated questions were significantly more challenging (p\u0026thinsp;=\u0026thinsp;0.02). Additionally, no difference was observed in the point-biserial correlation coefficient (PBCC; p\u0026thinsp;=\u0026thinsp;0.60), suggesting comparable reliability. However, participants achieved significantly higher overall exam scores on ChatGPT-4o-generated questions (p\u0026thinsp;=\u0026thinsp;0.003), reflecting differences in question design and cognitive demands (Table\u0026nbsp;2).\u003c/p\u003e\n\u003cp\u003eThe difficulty index (P_index) indicated that ChatGPT-4o-generated questions were significantly more difficult than those written by clinical faculty (p\u0026thinsp;=\u0026thinsp;0.02). The distribution of item difficulty and discrimination indices for AI-generated MCQs is illustrated in Fig.\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e1\u003c/span\u003e. In contrast, faculty-designed MCQs exhibited a more balanced difficulty distribution, suggesting that human-authored questions were generally easier but still maintained comparable discrimination indices (p\u0026thinsp;=\u0026thinsp;0.634). The distribution of item difficulty and discrimination indices for clinician-designed MCQs is illustrated in Fig.\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e2\u003c/span\u003e.\u003c/p\u003e\n\u003cp\u003eThe categorical breakdown of questions based on their discrimination and difficulty levels showed that the discrimination index (D_index) categorized items as high, moderate, poor, or non-discriminatory. Notably, no questions fell into the \"good\" or \"poor\" discrimination categories. Although the proportion of highly discriminative questions was slightly higher for ChatGPT-4o-generated items (34%) compared to those created by clinical faculty (32%), the difference was not statistically significant (p\u0026thinsp;=\u0026thinsp;0.817). The difficulty index (P_index) indicated that ChatGPT-4o produced more \"very difficult\" questions compared to clinical faculty (4 vs. 1). Additionally, the point-biserial correlation coefficient (PBCC), assessing the relationship between item performance and total test score, revealed no significant difference between the two question sets (p\u0026thinsp;=\u0026thinsp;0.424). A chi-square test was conducted to analyze the distribution of items with statistically significant rho values (Table\u0026nbsp;3).\u003c/p\u003e"},{"header":"4. DISCUSSION","content":"\u003cp\u003eThe integration of AI into medical education has gained significant attention, particularly in assessment methodologies such as question generation and automated evaluation [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e, \u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e]. Gordon et al. (2024) conducted a scoping review on AI applications in medical education and highlighted its growing role in adaptive learning, personalized instruction, and automated assessment [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e]. While AI-powered tools show promise in enhancing assessment strategies, concerns remain regarding their reliability, validity, and alignment with educational objectives. Recent studies have explored the feasibility of using generative AI, such as ChatGPT-4, to create multiple-choice questions (MCQs) and have examined their psychometric properties in comparison to human-authored questions. Preiksaitis et al. (2023) emphasized that AI-generated questions can demonstrate validity and reliability, yet challenges persist in ensuring appropriate cognitive complexity and contextual accuracy [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e]. In this study, we aimed to assess the psychometric properties of MCQs generated by ChatGPT-4 versus those created by clinical faculty, specifically evaluating their effectiveness in assessing emergency medicine residents\u0026rsquo; knowledge. Our findings revealed that while both sources produced comparable questions in terms of discrimination index (D_index), AI-generated questions tended to be more challenging (P_index) but were associated with higher mean exam scores among participants.\u003c/p\u003e \u003cp\u003eRecent studies have explored the effectiveness of AI-generated multiple-choice questions (MCQs) in medical education, highlighting both the advantages and limitations of large language models in assessment design. Law et al. (2025) conducted a cohort study comparing ChatGPT-4o-generated MCQs with human-authored ones in a high-stakes emergency medicine licensing exam [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e]. Their findings revealed that AI-generated MCQs were significantly easier (P_index\u0026thinsp;=\u0026thinsp;0.78 vs. 0.69, p\u0026thinsp;\u0026lt;\u0026thinsp;0.01) but showed comparable discrimination indices, suggesting their potential utility in assessing medical trainees. Similarly, Cheung et al. (2023) reported that ChatGPT-produced MCQs were comparable in quality to human-authored questions, except for slightly lower relevance scores [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e]. However, our study presents a contrasting perspective, as we found that AI-generated MCQs were significantly more difficult than those created by human experts (P_index\u0026thinsp;=\u0026thinsp;0.65 vs. 0.76, p\u0026thinsp;=\u0026thinsp;0.02), yet maintained a similar discrimination index. This discrepancy may be attributed to differences in AI prompting techniques, dataset training, or the complexity of topics covered in emergency medicine compared to general medical exams.\u003c/p\u003e \u003cp\u003eBeyond difficulty levels, AI-generated questions also exhibited distinct cognitive characteristics. Law et al. (2023) and Cheung et al. (2023) both found that AI-generated MCQs primarily assessed lower-order cognitive skills (i.e., knowledge recall and understanding) rather than higher-order reasoning such as application and analysis [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e, \u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e]. This aligns with concerns that large language models tend to prioritize factual recall over clinical reasoning. In contrast, our study suggests that AI-generated MCQs in emergency medicine settings were not only more difficult but also demonstrated an ability to test higher-order cognitive skills, particularly when guided with structured prompting. This finding underscores the need for optimized AI-human collaboration, where AI\u0026rsquo;s efficiency in generating questions is coupled with expert review to ensure alignment with educational objectives. While AI models have shown remarkable time efficiency in MCQ generation, human oversight remains essential for refining question quality, contextual accuracy, and ensuring appropriate cognitive complexity.\u003c/p\u003e \u003cp\u003eRecent research by Griot et al. (2024) examined the limitations of multiple-choice questions (MCQs) in assessing the reasoning capabilities of large language models (LLMs) such as ChatGPT [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e]. Their study demonstrated that AI models often rely on pattern recognition rather than deep comprehension, raising concerns about the validity of MCQ-based evaluations for AI-generated content. This aligns with findings from Kung et al. (2023) who tested ChatGPT\u0026rsquo;s performance on the USMLE and noted that while the model achieved passing scores, its success appeared to stem from statistical inference rather than true clinical reasoning [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]. These results underscore the importance of diversifying assessment methods beyond MCQs, incorporating case-based and open-ended inquiries that better evaluate both AI and human examinees\u0026rsquo; critical thinking and problem-solving skills. Our study supports this perspective, as AI-generated MCQs, while psychometrically comparable to clinician-designed ones, demonstrated limitations in assessing higher-order cognitive skills, reinforcing the need for expert revision.\u003c/p\u003e \u003cp\u003eSimilarly, a recent study by Chen et al. (2024) investigated the effectiveness of AI-generated content in high-stakes medical assessments [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e]. Their findings revealed no significant difference in overall quality scores between AI- and human-authored exam content (p\u0026thinsp;=\u0026thinsp;0.12), a result that closely aligns with our study\u0026rsquo;s observation that AI-generated questions exhibit similar psychometric properties to human-written ones. However, Chen et al. (2024) reported that human-generated MCQs performed better in specialties requiring nuanced contextual understanding, such as Obstetrics \u0026amp; Gynecology (p\u0026thinsp;=\u0026thinsp;0.03) [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e]. This aligns with our results, which indicate that while AI can produce technically sound MCQs, it may struggle to integrate real-world clinical complexity and contextual relevance. These findings collectively highlight that AI-generated MCQs can be a valuable tool in medical education but require expert oversight to ensure clinical depth and alignment with educational goals. Similarly, Naseer et al. (2024) found no significant difference in overall MCQ quality scores (p\u0026thinsp;=\u0026thinsp;0.12), reinforcing the comparability of AI-generated and human-authored assessments. However, their study highlighted that human-generated MCQs outperformed AI-generated ones in the Obstetrics \u0026amp; Gynecology domain (p\u0026thinsp;=\u0026thinsp;0.03), emphasizing the need for expert oversight in context-dependent specialties [\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eOur findings align with the broader literature, including Ali \u0026amp; Talat (2024), who systematically reviewed AI\u0026rsquo;s role in MCQ development [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e]. While their review emphasized AI\u0026rsquo;s efficiency in automating question generation, they also noted limitations in content validity and reasoning ability. Our study supports these concerns, as AI-generated MCQs exhibited greater difficulty yet comparable discrimination indices, reinforcing the need for expert oversight to ensure clinical relevance and cognitive rigor.\u0026rdquo;\u003c/p\u003e \u003cp\u003eLindqwister et al. (2023) evaluated the performance of ChatGPT in generating MCQs for medical licensing examinations, emphasizing the model\u0026rsquo;s ability to produce high-quality test items [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e]. Their findings suggested that AI-generated questions demonstrated a higher probability of correctness (P_index) compared to human-created ones, differing from our study, where AI-generated questions were found to be more challenging. This discrepancy may stem from differences in AI prompting strategies and the level of contextual guidance provided during question generation. Specifically, in our study, ChatGPT was instructed to utilize predefined textbook chapters as references, potentially leading to more content-dense and complex questions. Conversely, Lindqwister et al. (2023) noted that AI-generated questions could sometimes reflect statistical inference rather than deep understanding, aligning with broader concerns about pattern recognition bias in AI-based assessments [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e]. These findings underscore the importance of structured AI-human collaboration in medical education, ensuring that AI-generated assessments align with intended cognitive learning outcomes.\u003c/p\u003e \u003cp\u003eMeo et al. (2023) evaluated ChatGPT\u0026rsquo;s performance in both basic and clinical medical sciences using MCQ-based assessments [\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e]. Their study found that while ChatGPT obtained a 72% overall accuracy, its performance was notably higher in basic medical sciences (74%) compared to clinical disciplines (70%). While AI-generated MCQs were statistically more difficult, their ability to measure higher-order cognitive skills remains debatable. Prior studies (e.g., Meo et al. 2023) suggest that AI models tend to excel in factual recall rather than complex reasoning, reinforcing the need for expert revision [\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e]. This aligns with concerns raised by other studies (Griot et al. 2024; Kung et al. 2023) that AI models often excel at factual recall but struggle with complex clinical reasoning [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e, \u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e]. Our study supports these findings, as AI-generated MCQs exhibited comparable discrimination indices to clinician-designed ones but were significantly more challenging (P_index\u0026thinsp;=\u0026thinsp;0.65 vs. 0.76, p\u0026thinsp;=\u0026thinsp;0.02). These results highlight the importance of human oversight and structured AI prompting strategies to optimize AI-generated assessments and ensure they align with the cognitive demands of medical training.\u003c/p\u003e \u003cp\u003eOur findings align with the broader literature on AI-assisted MCQ generation. Kıyak and Emekli (2024) emphasized the critical role of prompt engineering in enhancing the accuracy and relevance of AI-generated questions while highlighting challenges like difficulty variations, discrimination inconsistencies, and factual inaccuracies [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e]. In line with their review, our study found that ChatGPT-4o-generated MCQs were statistically comparable to clinician-designed ones but exhibited greater difficulty. However, structured prompts and textbook-based references in our methodology helped mitigate common AI-related issues. These findings reinforce the need for expert oversight to optimize AI-generated assessments in medical education.\u003c/p\u003e \u003cp\u003eKıyak et al. (2024) investigated the feasibility of using ChatGPT to generate case-based multiple-choice questions (MCQs) in a rational pharmacotherapy exam [\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e]. Their findings indicated that AI-generated questions demonstrated acceptable point-biserial correlations (0.41 and 0.39), suggesting their ability to differentiate between high- and low-performing students. These results align with our study, where AI-generated MCQs exhibited comparable discrimination indices to those created by clinicians. However, Kıyak et al. (2024) identified an AI-generated question with three non-functional distractors, whereas our study observed a more even distribution of response choices [\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e]. This suggests that structured prompt engineering and expert review can mitigate AI-related challenges, ensuring validity and reliability in AI-assisted assessment design. Our findings reinforce the notion that AI-generated MCQs can serve as a valuable supplement in medical education, provided they undergo systematic validation.\u003c/p\u003e \u003cp\u003eConversely, Coşkun et al. (2025) reported inconsistent psychometric performance for AI-generated MCQs, with only six out of fifteen items achieving an acceptable point-biserial correlation (\u0026gt;\u0026thinsp;0.30) [\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e]. This contrasts with our study, where AI-generated MCQs exhibited more consistent discrimination indices, comparable to those designed by clinical faculty. The observed discrepancy may stem from differences in question design methodology, prompt specificity, or the subject matter assessed. Coşkun et al. (2025) focused on evidence-based medicine, an area requiring nuanced clinical reasoning, whereas our study encompassed a broader spectrum of emergency medicine topics, potentially facilitating more structured question generation [\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e]. These findings highlight the importance of domain specificity and optimized AI prompting strategies in maximizing the effectiveness of AI-generated assessments for medical education.\u003c/p\u003e \u003cdiv id=\"Sec10\" class=\"Section2\"\u003e \u003ch2\u003e4.1 Limitations and Future Directions\u003c/h2\u003e \u003cp\u003eThis study has some limitations. Firstly, the sample size was relatively small (n\u0026thinsp;=\u0026thinsp;18 residents), which may limit the generalizability of the findings. Future studies with larger cohorts and multicenter designs are necessary to further validate AI-generated assessments. Secondly, while ChatGPT-4o generated questions without manual modifications, future research should evaluate AI-assisted question refinement, integrating expert review to optimize clarity and relevance. Finally, long-term studies should assess the impact of AI-generated questions on learning outcomes, particularly in formative assessments and competency-based medical education frameworks.\u003c/p\u003e \u003c/div\u003e"},{"header":"5. CONCLUSION","content":"\u003cp\u003eAI-generated MCQs can complement faculty-created questions in emergency medicine assessments, offering scalability and efficiency. ChatGPT-4o-generated questions showed comparable discrimination indices and psychometric reliability to human-crafted questions, despite being more challenging. However, expert oversight is crucial to address concerns like distractors and contextual limitations. As AI evolves, its role in medical education should expand beyond question generation to include refinement, adaptive learning, and scenario-based assessments, fostering competency-driven training.\u003c/p\u003e \u003cp\u003eDesigning multiple-choice questions enhances educators\u0026rsquo; clinical judgment and subject matter expertise. However, relying solely on AI for this task poses a potential risk that could hinder educators\u0026rsquo; continuous self-improvement by engaging in mental exercises during question preparation. Future research should effectively integrate AI, leveraging automation to enhance faculty engagement and expertise.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eAcknowledgements :\u0026nbsp;\u003c/strong\u003eNot required\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthor contributions\u003c/strong\u003e M.K. and E.S. contributed to the conceptualization of the study. M.K., E.S., and A.H. designed the methodology. Investigation and formal analysis were carried out by M.K. and A.H. A.C. was responsible for obtaining ethical approval and resources. The original draft was prepared by M.K. and E.S., and all authors (M.K., E.S., A.H., and A.C.) participated in reviewing and editing the manuscript. Supervision was provided by M.K. and E.S. All authors have read and approved the final version of the manuscript.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eDeclaration of conflicting interests:\u0026nbsp;\u003c/strong\u003eThe author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding :\u0026nbsp;\u003c/strong\u003eNon declared\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cbr\u003e\u0026nbsp;Informed consent :\u003c/strong\u003e Informed consent was obtained from the participants or their legally authorized representatives.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cbr\u003e\u0026nbsp;Ethical approval :\u0026nbsp;\u003c/strong\u003eThis study was approved by the local ethics committee (This study has approval from Kutahya Health Sciences University Medical Faculty Ethics Committee (approval date and number: 11.03.2025, 2025/04-09)\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eHuman rights statement:\u003c/strong\u003e The study protocol conforms to the ethical guidelines of the 1975 Declaration of Helsinki.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eMir MM, Mir GM, Raina NT, Mir SM, Mir SM, Miskeen E, et al. Application of Artificial Intelligence in Medical Education: Current Scenario and Future Perspectives. J Adv Med Educ Prof. 2023;11(3):133\u0026ndash;40. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.30476/JAMP.2023.98655.1803\u003c/span\u003e\u003cspan address=\"10.30476/JAMP.2023.98655.1803\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepa\u0026ntilde;o C, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1371/journal.pdig.0000198\u003c/span\u003e\u003cspan address=\"10.1371/journal.pdig.0000198\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMaa\u0026szlig; L, Grab-Kroll C, Koerner J, \u0026Ouml;chsner W, Sch\u0026ouml;n M, Messerer D, et al. Artificial Intelligence and ChatGPT in Medical Education: A Cross-Sectional Questionnaire on students' Competence. J CME. 2024;14(1):2437293. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1080/28338073.2024.2437293\u003c/span\u003e\u003cspan address=\"10.1080/28338073.2024.2437293\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSim SM, Rasiah RI. Relationship between item difficulty and discrimination indices in true/false-type multiple choice questions of a para-clinical multidisciplinary paper. Ann Acad Med Singap. 2006;35(2):67\u0026ndash;71.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHaladyna TM, Downing SM, Rodriguez MC. A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Appl Meas Educ. 2002;15(3):309\u0026ndash;33. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1207/S15324818AME1503_5\u003c/span\u003e\u003cspan address=\"10.1207/S15324818AME1503_5\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eEbel RL, Frisbie DA. Essentials of Educational Measurement. 5th ed. Englewood Cliffs, NJ: Prentice-Hall; 1991.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKim YH, Kim BH, Kim J, Jung B, Bae S. Item difficulty index, discrimination index, and reliability of the 26 health professions licensing examinations in 2022, Korea: a psychometric study. J Educ Eval Health Prof. 2023;20:31. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.3352/jeehp.2023.20.31\u003c/span\u003e\u003cspan address=\"10.3352/jeehp.2023.20.31\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKıyak YS, Emekli E. ChatGPT prompts for generating multiple-choice questions in medical education and evidence on their validity: a literature review. Postgrad Med J. 2024;100(1189):858\u0026ndash;65. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1093/postmj/qgae065\u003c/span\u003e\u003cspan address=\"10.1093/postmj/qgae065\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAttali Y, Fraenkel T. The Point-Biserial as a Discrimination Index for Distractors in Multiple-Choice Items: Deficiencies in Usage and an Alternative. J Educ Meas. 2000;37(1):77\u0026ndash;86. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttp://www.jstor.org/stable/1435063\u003c/span\u003e\u003cspan address=\"http://www.jstor.org/stable/1435063\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePreiksaitis C, Rose C. Opportunities, Challenges, and Future Directions of Generative Artificial Intelligence in Medical Education: Scoping Review. JMIR Med Educ. 2023;9:e48785. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.2196/48785\u003c/span\u003e\u003cspan address=\"10.2196/48785\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGordon M, Daniel M, Ajiboye A, Uraiby H, Xu NY, Bartlett R, et al. A scoping review of artificial intelligence in medical education: BEME Guide 84. Med Teach. 2024;46(4):446\u0026ndash;70. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1080/0142159X.2024.2314198\u003c/span\u003e\u003cspan address=\"10.1080/0142159X.2024.2314198\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLaw AK, So J, Lui CT, Choi YF, Cheung KH, Kei-Ching Hung K, et al. AI versus human-generated multiple-choice questions for medical education: a cohort study in a high-stakes examination. BMC Med Educ. 2025;25(1):208. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1186/s12909-025-06796-6\u003c/span\u003e\u003cspan address=\"10.1186/s12909-025-06796-6\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCheung BHH, Lau GKK, Wong GTC, Lee EYP, Kulkarni D, Seow CS, ChatGPT versus human in generating medical graduate exam multiple choice questions-A multinational prospective study (Hong, Kong SAR et al. Singapore, Ireland, and the United Kingdom). PLoS One. 2023;18(8):e0290691. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1371/journal.pone.0290691\u003c/span\u003e\u003cspan address=\"10.1371/journal.pone.0290691\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGriot M, Vanderdonckt J, Yuksel D, Hemptinne C. Multiple choice questions and large language models: a case study with fictional medical data. Preprint at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://arxiv.org/abs/2406.02394\u003c/span\u003e\u003cspan address=\"https://arxiv.org/abs/2406.02394\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChen J, Tao BK, Park S, Bovill E. Can ChatGPT Fool the Match? Artificial Intelligence Personal Statements for Plastic Surgery Residency Applications: A Comparative Study. Plast Surg (Oakville Ont). 2024;22925503241264832. Advance online publication. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1177/22925503241264832\u003c/span\u003e\u003cspan address=\"10.1177/22925503241264832\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNaseer MA, Nasir Y, Tabassum A, Ali S. ChatGPT-4 versus human generated final year MBBS multiple-choice questions \u0026ndash; A study from a medical college of Pakistan. J Shalamar Med Dent Coll. 2024;5(2):58\u0026ndash;64. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.53685/jshmdc.v5i2.253\u003c/span\u003e\u003cspan address=\"10.53685/jshmdc.v5i2.253\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAli F, Talat H. AI Integration in MCQ Development: Assessing Quality in Medical Education: A Systematic Review. Life Sci. 2024;5(3):413\u0026ndash;26. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.37185/LnS.1.1.643\u003c/span\u003e\u003cspan address=\"10.37185/LnS.1.1.643\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLindqwister AL, Hassanpour S, Levy J, Sin JM. AI-RADS: Successes and challenges of a novel artificial intelligence curriculum for radiologists across different delivery formats. Front Med Technol. 2023;4:1007708. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.3389/fmedt.2022.1007708\u003c/span\u003e\u003cspan address=\"10.3389/fmedt.2022.1007708\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMeo SA, Al-Masri AA, Alotaibi M, Meo MZS, Meo MOS. ChatGPT Knowledge Evaluation in Basic and Clinical Medical Sciences: Multiple Choice Question Examination-Based Performance. Healthc (Basel). 2023;11(14):2046. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.3390/healthcare11142046\u003c/span\u003e\u003cspan address=\"10.3390/healthcare11142046\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKıyak YS, Coşkun \u0026Ouml;, Budakoğlu İİ, Uluoğlu C. ChatGPT for generating multiple-choice questions: Evidence on the use of artificial intelligence in automatic item generation for a rational pharmacotherapy exam. Eur J Clin Pharmacol. 2024;80(5):729\u0026ndash;35. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1007/s00228-024-03649-x\u003c/span\u003e\u003cspan address=\"10.1007/s00228-024-03649-x\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCoşkun \u0026Ouml;, Kıyak YS, Budakoğlu İİ. ChatGPT to generate clinical vignettes for teaching and multiple-choice questions for assessment: A randomized controlled experiment. Med Teach. 2025;47(2):268\u0026ndash;74. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1080/0142159X.2024.2327477\u003c/span\u003e\u003cspan address=\"10.1080/0142159X.2024.2327477\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"},{"header":"Tables","content":"\u003cp\u003eTablo 1: Participants and Main Topics of Questions\u0026nbsp;\u003c/p\u003e\n\u003cdiv class=\"gridtable\"\u003e\n\u003cdiv class=\"colspec\" align=\"left\"\u003e\u0026nbsp;\u003c/div\u003e\n\u003ctable id=\"Taba\" border=\"1\"\u003e\n\u003cthead\u003e\n\u003ctr\u003e\n\u003cth colspan=\"3\" align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eEmergency Medicine Residents (n:18)\u003c/div\u003e\n\u003c/th\u003e\n\u003c/tr\u003e\n\u003c/thead\u003e\n\u003ctbody\u003e\n\u003ctr\u003e\n\u003ctd colspan=\"2\" align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eMean Age (years)\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e26.8\u003c/div\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd rowspan=\"2\" align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eGenger\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eFemale (n/%)\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e8/44\u003c/div\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eMale (n/%)\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e10/56\u003c/div\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd rowspan=\"3\" align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eSeniority status\u003c/div\u003e\n\u003cdiv class=\"SimplePara\"\u003eof residents (n)\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eSenior (\u0026gt;\u0026thinsp;3 years)\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e8\u003c/div\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eMiddle Senior (1-3years)\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e5\u003c/div\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eJunior\u0026thinsp;\u0026lt;\u0026thinsp;1 year\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e5\u003c/div\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd colspan=\"2\" align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eSubject Headings of the Questions(Q)\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e\u003cspan class=\"Bold\"\u003eClinical Q /Chat_gpt 4.o Q\u003c/span\u003e\u003c/div\u003e\n\u003cdiv class=\"SimplePara\"\u003e\u003cspan class=\"Bold\"\u003en:50/50\u003c/span\u003e\u003c/div\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd colspan=\"2\" align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eSudden Cardiac Death\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e2/2\u003c/div\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd colspan=\"2\" align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eTraumatic Shock\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e3/3\u003c/div\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd colspan=\"2\" align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eNon-Traumatic Shock And Anaphylaxia\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e2/2\u003c/div\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd colspan=\"2\" align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eAcid-Base Disorders, Blood Gas Analysis And Fluid Electrolyte Disorders\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e3/3\u003c/div\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd colspan=\"2\" align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eAsthma And Chronic Obstructive Pulmonary Disease\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e2/2\u003c/div\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd colspan=\"2\" align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eRhythm Disorders\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e2/2\u003c/div\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd colspan=\"2\" align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eAntiarrhythmic, Antihypertensive And Positive Inotrop Drugs Used In Cardiac Rhythm Disorders And Defibrillation - Electrical Cardioversion\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e2/2\u003c/div\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd colspan=\"2\" align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eHyperbaric Oxygen Therapy\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e2/2\u003c/div\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd colspan=\"2\" align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eBasic And Advanced Cardiopulmonary Resuscitation And What To Do In Special Cases (Pregnation, Drowning, Freezing, Post Cardiac Arrest Care)\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e3/3\u003c/div\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd colspan=\"2\" align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eInvasive And Non-Invasive Methods In Difficult Airway Management\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e3/3\u003c/div\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd colspan=\"2\" align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eBasic Parameters And Modes In Mechanical Ventilation\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e2/2\u003c/div\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd colspan=\"2\" align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eHemodynamic Monitoring, Cardiac Pacing And Defibrillation\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e2/2\u003c/div\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd colspan=\"2\" align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eVascular Management, Catheterization Techniques\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e2/2\u003c/div\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd colspan=\"2\" align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eChest Pain, Axis And Approach To Low-Probability Axis\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e3/3\u003c/div\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd colspan=\"2\" align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eApproach to the Syncope Patients\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e3/3\u003c/div\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd colspan=\"2\" align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eAcute Heart Failure And Valve Diseases\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e2/2\u003c/div\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd colspan=\"2\" align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eDeep Vein Thrombosis And Pulmonary Embolism\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e2/2\u003c/div\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd colspan=\"2\" align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eHypertensive Emergencies And Pulmonary Hypertension\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e3/3\u003c/div\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd colspan=\"2\" align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eAcute Arterial Obstructions, Aortic Aneurysm And Dissection\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e2/2\u003c/div\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd colspan=\"2\" align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eRespiratory Distress\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e3/3\u003c/div\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd colspan=\"2\" align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eTube Thoracostomy, Pericardiocentesis, Paracentesis Techniques In Hemothorax And Pneumothorax\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e2/2\u003c/div\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/tbody\u003e\n\u003c/table\u003e\n\u003c/div\u003e\n\u003cp\u003e\u0026nbsp;\u003c/p\u003e\n\u003cdiv class=\"gridtable\"\u003e\n\u003cdiv class=\"colspec\" align=\"left\"\u003e\u0026nbsp;\u003c/div\u003e\n\u003cdiv class=\"colspec\" align=\"left\"\u003eTablo 2: Average of the Indexes And Participants' Exam Scores\u003c/div\u003e\n\u003cdiv class=\"colspec\" align=\"left\"\u003e\u0026nbsp;\u003c/div\u003e\n\u003ctable id=\"Tabb\" border=\"1\"\u003e\n\u003cthead\u003e\n\u003ctr\u003e\n\u003cth align=\"left\"\u003e\u0026nbsp;\u003c/th\u003e\n\u003cth align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eClinical Q n:50\u003c/div\u003e\n\u003c/th\u003e\n\u003cth align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eChat_gpt 4.o Q n:50\u003c/div\u003e\n\u003c/th\u003e\n\u003cth align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e\u003cspan class=\"Italic\"\u003ep\u003c/span\u003e\u003c/div\u003e\n\u003c/th\u003e\n\u003c/tr\u003e\n\u003c/thead\u003e\n\u003ctbody\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eD_index\u003csup\u003e1\u003c/sup\u003e (mean\u0026thinsp;\u0026plusmn;\u0026thinsp;SD)\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e0.172\u0026thinsp;\u0026plusmn;\u0026thinsp;0.23\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e0.196\u0026thinsp;\u0026plusmn;\u0026thinsp;0.26\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e0.634\u003c/div\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eP_index\u003csup\u003e2\u003c/sup\u003e (mean\u0026thinsp;\u0026plusmn;\u0026thinsp;SD)\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e0.76\u0026thinsp;\u0026plusmn;\u0026thinsp;0.23\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e0.65\u0026thinsp;\u0026plusmn;\u0026thinsp;0.24\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e\u003cspan class=\"Bold\"\u003e0.02\u003c/span\u003e\u003c/div\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003ePBCC\u003csup\u003e3\u003c/sup\u003e (mean\u0026thinsp;\u0026plusmn;\u0026thinsp;SD)\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e0.268\u0026thinsp;\u0026plusmn;\u0026thinsp;0.30\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e0.236\u0026thinsp;\u0026plusmn;\u0026thinsp;0.26\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e0.60\u003c/div\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eExam score (mean\u0026thinsp;\u0026plusmn;\u0026thinsp;SD)\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e67.3\u0026thinsp;\u0026plusmn;\u0026thinsp;9.65\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e76.8\u0026thinsp;\u0026plusmn;\u0026thinsp;8.18\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e\u003cspan class=\"Bold\"\u003e0.003\u003c/span\u003e\u003c/div\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd colspan=\"4\" align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e\u003cspan class=\"Italic\"\u003eP\u0026thinsp;\u0026lt;\u0026thinsp;0.05 and Student t test was performed\u003c/span\u003e\u003c/div\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/tbody\u003e\n\u003c/table\u003e\n\u003c/div\u003e\n\u003cp\u003e\u003cem\u003e1:Discrimination index of the item\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003e\u003cem\u003e2:Item difficulty index\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003e\u003cem\u003e3:Point-Biserial Correlation Coefficient\u003c/em\u003e\u003c/p\u003e\n\u003cdiv class=\"gridtable\"\u003e\n\u003cdiv class=\"colspec\" align=\"left\"\u003e\u0026nbsp;\u003c/div\u003e\n\u003cdiv class=\"colspec\" align=\"left\"\u003e\u0026nbsp;\u003c/div\u003e\n\u003cdiv class=\"colspec\" align=\"left\"\u003e\u0026nbsp;\u003c/div\u003e\n\u003cdiv class=\"colspec\" align=\"left\"\u003e\u0026nbsp;Tablo 3: Analysis Of Questions According to Discrimination And Difficulty Indexes\u003c/div\u003e\n\u003cdiv class=\"colspec\" align=\"left\"\u003e\u0026nbsp;\u003c/div\u003e\n\u003ctable id=\"Tabc\" border=\"1\"\u003e\n\u003cthead\u003e\n\u003ctr\u003e\n\u003cth colspan=\"2\" align=\"left\"\u003e\u0026nbsp;\u003c/th\u003e\n\u003cth align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eClinical Q n:50\u003c/div\u003e\n\u003c/th\u003e\n\u003cth align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eChat_gpt 4.o Q\u003c/div\u003e\n\u003cdiv class=\"SimplePara\"\u003en:50\u003c/div\u003e\n\u003c/th\u003e\n\u003cth align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e\u003cspan class=\"Italic\"\u003ep\u003c/span\u003e\u003c/div\u003e\n\u003c/th\u003e\n\u003c/tr\u003e\n\u003c/thead\u003e\n\u003ctbody\u003e\n\u003ctr\u003e\n\u003ctd rowspan=\"5\" align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eD_index\u003csup\u003e1\u003c/sup\u003e\u003c/div\u003e\n\u003cdiv class=\"SimplePara\"\u003en(%)\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eHigh\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e16(%32)\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e17(%34)\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd rowspan=\"5\" align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e0.817\u003c/div\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eGood\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e0\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e0\u003c/div\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eModerate\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e12(%24)\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e14(%28)\u003c/div\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003ePoor\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e0\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e0\u003c/div\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eNo discrimination\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e22(%44)\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e19(%38)\u003c/div\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd rowspan=\"5\" align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eP_index\u003csup\u003e2\u003c/sup\u003e\u003c/div\u003e\n\u003cdiv class=\"SimplePara\"\u003en(%)\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eVery easy\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e28\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e18\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd rowspan=\"5\" align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e0.106\u003c/div\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eEasy\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e13\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e14\u003c/div\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eMedium\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e2\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e8\u003c/div\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eDifficult\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e6\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e6\u003c/div\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eVery difficult\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e1\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e4\u003c/div\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd rowspan=\"2\" align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003ePBCC\u003csup\u003e3\u003c/sup\u003e\u003c/div\u003e\n\u003cdiv class=\"SimplePara\"\u003en(%)\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eYes\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e7\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e10\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd rowspan=\"2\" align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e0.424\u003c/div\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003eNo\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e43\u003c/div\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cdiv class=\"SimplePara\"\u003e40\u003c/div\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/tbody\u003e\n\u003c/table\u003e\n\u003c/div\u003e\n\u003cp\u003e\u003cem\u003e1:Discrimination index of the item\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003e\u003cem\u003e2: Item difficulty index: represents the proportion of correct responses for each item, indicating its difficulty level\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003e\u003cem\u003e3:Point-Biserial Correlation Coefficient between item score and total test score. A chi-square test was conducted based on the number of items with statistically significant scores..\u003c/em\u003e\u003c/p\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"bmc-medical-education","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"meed","sideBox":"Learn more about [BMC Medical Education](http://bmcmededuc.biomedcentral.com/)","snPcode":"","submissionUrl":"https://www.editorialmanager.com/meed/default.aspx","title":"BMC Medical Education","twitterHandle":"BMC_series","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"em","reportingPortfolio":"BMC Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"Artificial Intelligence, Multiple Choice Questions, Educational Measurement, Emergency Medicine, Psychometrics","lastPublishedDoi":"10.21203/rs.3.rs-6319788/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-6319788/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003eBackground/aim\u003c/h2\u003e \u003cp\u003eThis study compared the effectiveness and psychometric quality of artificial intelligence (AI)-generated multiple-choice questions (MCQs), specifically from ChatGPT-4o, with clinician-designed MCQs in an emergency medicine residency program.\u003c/p\u003e\u003ch2\u003eMethods\u003c/h2\u003e \u003cp\u003eEighteen emergency medicine residents participated, completing an examination of 100 questions\u0026mdash;50 AI-generated and 50 clinician-designed\u0026mdash;based on core emergency medicine topics. Psychometric analysis assessed item difficulty, discrimination, and reliability through the point-biserial correlation coefficient (PBCC).\u003c/p\u003e\u003ch2\u003eResults\u003c/h2\u003e \u003cp\u003eResults showed no significant difference in discrimination indices between AI-generated and clinician-designed MCQs, indicating both question sets were similarly effective at differentiating between high and low performers. However, AI-generated MCQs were significantly more difficult (mean item difficulty index, 0.65 versus 0.76; p\u0026thinsp;=\u0026thinsp;0.02). Residents performed significantly better on AI-generated questions compared to clinician-designed ones (mean score, 76.8 versus 67.3; p\u0026thinsp;=\u0026thinsp;0.003). Both question sets demonstrated comparable reliability in assessing resident knowledge, as indicated by similar PBCC values.\u003c/p\u003e\u003ch2\u003eConclusion\u003c/h2\u003e \u003cp\u003eThis study highlights the potential for AI-generated MCQs to supplement clinician-designed assessments effectively, demonstrating comparable psychometric properties and reliability. However, the higher difficulty level of AI-generated questions suggests the necessity for expert review and oversight to ensure appropriateness and context accuracy. Further research with larger sample sizes and diverse medical settings is recommended to validate these findings and explore the broader implications of incorporating AI into medical education assessment strategies.\u003c/p\u003e","manuscriptTitle":"Comparison of AI-Generated and Clinician-Designed Multiple Choice Questions in Emergency Medicine Exam: A Psychometric Analysis","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-05-16 12:00:03","doi":"10.21203/rs.3.rs-6319788/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2025-05-19T04:39:17+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-05-17T04:59:40+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"140205190733585900183770138260172074039","date":"2025-05-17T01:48:25+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"230062888218235079827699349506465143371","date":"2025-05-15T10:37:00+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"202570953145555737380517968046158623700","date":"2025-05-14T20:58:02+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-05-14T12:54:52+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"301501385680311545077870932235300142396","date":"2025-05-14T11:32:25+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"336218355858741995058942443276145422980","date":"2025-05-14T08:09:04+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2025-05-14T07:21:52+00:00","index":"","fulltext":""},{"type":"editorInvited","content":"","date":"2025-05-12T06:44:41+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2025-04-04T05:58:38+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2025-04-04T05:55:32+00:00","index":"","fulltext":""},{"type":"submitted","content":"BMC Medical Education","date":"2025-03-27T10:43:44+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"bmc-medical-education","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"meed","sideBox":"Learn more about [BMC Medical Education](http://bmcmededuc.biomedcentral.com/)","snPcode":"","submissionUrl":"https://www.editorialmanager.com/meed/default.aspx","title":"BMC Medical Education","twitterHandle":"BMC_series","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"em","reportingPortfolio":"BMC Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"cb6904fb-e3c5-4286-968c-ca6b60353361","owner":[],"postedDate":"May 16th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[],"tags":[],"updatedAt":"2025-07-07T16:10:17+00:00","versionOfRecord":{"articleIdentity":"rs-6319788","link":"https://doi.org/10.1186/s12909-025-07528-6","journal":{"identity":"bmc-medical-education","isVorOnly":false,"title":"BMC Medical Education"},"publishedOn":"2025-07-01 15:57:33","publishedOnDateReadable":"July 1st, 2025"},"versionCreatedAt":"2025-05-16 12:00:03","video":"","vorDoi":"10.1186/s12909-025-07528-6","vorDoiUrl":"https://doi.org/10.1186/s12909-025-07528-6","workflowStages":[]},"version":"v1","identity":"rs-6319788","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-6319788","identity":"rs-6319788","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.