Psychometric Performance and Student Perceptions of AI- versus Human-Generated Multiple-Choice Questions: The AHEAD Randomized Controlled Trial

doi:10.21203/rs.3.rs-9187684/v1

Psychometric Performance and Student Perceptions of AI- versus Human-Generated Multiple-Choice Questions: The AHEAD Randomized Controlled Trial

2026 · doi:10.21203/rs.3.rs-9187684/v1

preprint OA: closed

Full text JSON View at publisher

Full text 123,935 characters · extracted from preprint-html · click to expand

Psychometric Performance and Student Perceptions of AI- versus Human-Generated Multiple-Choice Questions: The AHEAD Randomized Controlled Trial | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Psychometric Performance and Student Perceptions of AI- versus Human-Generated Multiple-Choice Questions: The AHEAD Randomized Controlled Trial Dheyaa Al-Najafi, Katherine D. Krause, Yundi Wang, Qi Kang Zuo, and 10 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-9187684/v1 This work is licensed under a CC BY 4.0 License Status: Under Revision Version 1 posted 12 You are reading this latest preprint version Abstract Background Developing high-quality multiple-choice examinations in medical education is time- and resource-intensive. Large language models (LLMs) offer a promising approach to accelerate question development; however, their utility for exam development remains underexplored. Methods The AHEAD Trial ( A I vs H uman E xam A ssessment and D evelopment) was a participant-blinded, parallel-group randomized controlled trial conducted among first-year medical students. Students were randomized to complete a 112-item case-based, single-best-answer mock examination composed of either AI-generated or human-generated multiple-choice questions (MCQs). Questions were developed using identical curricular objectives. AI-generated items were produced via a dual-model workflow (ChatGPT for generation; Google Gemini for validation); human-generated items were authored by senior medical students. Outcomes were evaluated using Van der Vleuten’s Assessment Utility Framework across feasibility, acceptability, reliability, validity, and educational impact. Primary analyses were conducted in the intention-to-treat (ITT) population using appropriate parametric or non-parametric tests, with effect sizes and 95% confidence intervals reported. Results A total of 258 students were randomized, with 127 allocated to the AI-generated exam arm and 131 to the human-generated exam arm. LLM-assisted MCQ development achieved a 5.6-fold efficiency gain compared with human authorship (4.2 ± 1.9 vs. 19.6 ± 7.5 minutes per item; p 0.05; effect sizes < 0.5). Human-generated items demonstrated slightly higher discrimination indices than AI-generated items, though the effect size was small, and distractor efficiency did not differ between protocols. Student performance was marginally higher on the human-generated exam, though this difference was not significant in the ITT analysis. Exploratory analyses identified theme-specific performance variation and potential gender performance differences on the AI-generated exam. Neither exam meaningfully changed students’ perceived preparedness. Conclusions LLMs can substantially accelerate MCQ development while producing formative assessments that are psychometrically comparable and acceptable to learners. Although small differences persist, these findings support the integration of LLM-assisted item generation within a human-in-the-loop framework, combining AI efficiency with expert oversight to preserve psychometric quality and equity. Trial registration This study was retrospectively registered on ClinicalTrials.gov (Identifier NCT07481162 registered March 18, 2026). Prospective registration was not performed as the study was conducted as an embedded educational intervention within a voluntary formative examination setting. The study protocol and statistical analysis plan were prespecified prior to data analysis. The trial is reported in accordance with CONSORT 2025 guidelines. Artificial Intelligence Medical Education Multiple-Choice Questions Large Language Models ChatGPT Formative Assessment Student Perception Exam Feasibility Randomized Controlled Trial Educational Technology Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Background The development of high-quality examination materials for medical students is a resource-intensive process, requiring that instructors balance clinical relevance, evolving guidelines, and psychometric criteria, while also avoiding common item-writing flaws. 1 , 2 Developing a single high-quality multiple-choice question (MCQ) is rigorous, often requiring multiple hours of drafting and editing. 3 – 5 Consequently, cost estimates range from over US $ 200 (inflation-adjusted) per usable MCQ 3 to more than US $ 2,000 per MCQ, 4 with licensure exam question banks requiring multi-million dollar investments. 5 Large language models (LLMs), such as OpenAI’s Chat Generative Pre-trained Transformer (ChatGPT), have demonstrated strong performance on medical knowledge benchmarks, including the United States Medical Licensing Examination (USMLE). 6 Given LLMs’ medical exam performance and the high cost of question development, there is growing interest in using LLMs to streamline question generation. Nevertheless, questions regarding the pedagogical utility of AI-generated exam questions remain. 7 As reviewed elsewhere, 7 several recent studies have explored the use of LLMs to generate medical examination questions across training levels and specialties. 7 – 18 However, few studies 8 – 11 have directly compared the use of LLM-generated and human-generated examination questions in medical education and none have used a randomized controlled trial (RCT) study design. Further, as originally conceptualized by van der Vleuten, evaluating the utility of a proposed assessment tool requires a comprehensive analysis of feasibility, reliability, validity, acceptability, and educational impact. 19 , 20 In contrast, most studies on the utility of LLM-generated medical exams have focused on isolated components of assessment utility rather than evaluating the exams holistically. Additionally, most evaluations have relied primarily on subject-matter expert judgment. 8 , 10 , 12 – 17 While valuable, this does not account for students’ experience and perceptions of such exams. Further methodological limitations include small item sets, 13,17,18 limited student samples, 9,11 and narrow disciplinary scope. 8 , 9 , 12 , 16 – 18 The AHEAD Trial ( A I vs H uman E xam A ssessment and D evelopment) is a single-center, participant-blinded RCT designed to address these gaps by evaluating how medical students perform on and perceive AI- versus human-generated MCQs. We engaged a large student cohort to evaluate an extensive and rigorously constructed bank of MCQs across multiple core medical disciplines, enabling robust psychometric and educational analyses. Guided by the Assessment Utility Framework, 19 the trial compares feasibility, reliability, validity, acceptability, and educational impact of exams generated by a dual-LLM pipeline (ChatGPT-4 with independent validation by Google Gemini) versus human authors. Methods Study design and oversight This participant-blinded, parallel RCT compared the effect of LLM-generated versus human-generated MCQs in a mock final exam on MD students’ performance and perceptions. This trial received approval from the University of British Columbia Behavioral (UBC) Ethics Research Board and was conducted at the UBC Faculty of Medicine on December 8 and 9, 2024 (Ethics ID: H16-00044). The study protocol, including the analysis plan and outcome measures, was pre-specified a priori and was not modified retrospectively. This study was retrospectively registered on ClinicalTrials.gov (Identifier: NCT07481162; registered March 18, 2026). The study protocol and statistical analysis plan were pre-specified prior to data analysis. The trial is reported in accordance with CONSORT 2025 guidelines. 21 Participants First-year MD students (UBC MD 2028 cohort) enrolled in Foundations of Medical Practice I (MEDD 411) were recruited via class-wide email and announcements during review sessions. Participation was voluntary, uncompensated, and had no impact on academic standing; instead, the exam served as a preparatory opportunity aligned with summative objectives. Inclusion and exclusion criteria Eligible participants were first-year MD students enrolled in MEDD 411 who provided informed consent and completed the mock examination under study conditions. Exclusion criteria were defined a priori , were independent of study outcomes, and included failure to submit the exam, clearly invalid demographic information ( i.e. , implausible age or GPA values), and completion time under 35 minutes as a proxy for insufficient effort responses. The rationale and validation of this time threshold are provided in Supplementary Methods 1.1 . Study intervention and blinding Participants were randomized 1:1 to complete either an AI-generated (AI-gen) or human-generated (Human-gen) MCQ mock exam. All consenting participants also completed identical pre- and post-exam surveys to collect baseline demographic and academic information and to assess perceptions of the exam; full surveys and outcome definitions are provided in Supplementary Methods 1.6 . Randomization was performed by a third party with no prior knowledge of the trial, using a simple, non-blocked computer-generated sequence. The allocation sequence was concealed. Study investigators responsible for recruitment and data collection did not have access to the random allocation sequence. Participants were blinded to the source of the MCQs (AI-gen versus human-gen). To maintain blinding, all questions were reviewed to remove any indicators of AI or human authorship, and both examination versions were delivered using identical exam links, introductions, and pre- and post-exam surveys. All survey and mock MCQ final exam responses were anonymized and data were collected and securely managed using REDCap electronic data capture tools hosted at the UBC Faculty of Medicine. 22 , 23 AI- versus Human-generated mock exams Two independent teams of six second-year MD students developed 112 case-based, single-best-answer MCQs aligned to the same MEDD 411 learning objectives, with comprehensive answer explanations provided for every option ( Fig. 1 ). AI-generated items were produced using ChatGPT-4 and underwent iterative validation by ChatGPT-4 and a Google Gemini v1.5 Flash, with revisions performed until items were accurate and aligned to objectives. Human-generated items were authored without AI assistance and validated via independent peer review. Full protocol details and model prompts are provided in Supplementary Methods 1.3 . A randomly sampled subset of 30 paired AI-gen and human-gen MCQs is publicly available, along with a de-identified item-level discrimination index and answer explanation ( Supplementary Information, Section 6) . Sample size calculation A medium effect size was set (Cohen’s d = 0.5) to indicate potentially important differences in exam performance and students’ perceptions between AI-gen and Human-gen mock exams. 24 , 25 This effect size is widely accepted as a benchmark for educational interventions. 24 , 25 The power was set to 80% (β = 0.20) and the two-sided significance level to 0.05 (α = 0.05). Based on these parameters and using standard formulas for comparing two independent means, the required sample size was 100 students per group. There were no interim analyses or stopping guidelines. Outcome Measures 1. Feasibility Measures Researchers recorded MCQ generation and proofreading time. The primary outcome was the efficiency ratio (Human-gen time / AI-gen time) per matched learning objective. Mean time per MCQ and total exam generation time were reported as descriptive feasibility metrics to contextualize differences between protocols. 2. Acceptability Measures We assessed students’ perceptions via a post-exam survey using 10-point Likert scales. These measures include students’ ratings of MCQ clarity, relevance, overall quality, difficulty, adequacy of the exam time, and the exam’s effectiveness in identifying knowledge gaps, assessing clinical understanding, and aiding in information retention for future practice. 3. Reliability Measures Reliability was assessed using two complementary psychometric properties: the Discrimination Index (DI) and Distractor Efficiency (DE). DI quantifies how well each item differentiates high- from low-performing examinees. For each item, DI was calculated using the formula: (DI = p upper – p lower ), where p upper and p lower represent the proportion of correct responses among the upper and lower 27% of examinees based on total exam score. 26 This item-level analysis captures the distribution of discriminatory power across each of the two exams. For a given item, DE is the proportion of distractors selected by at least 5% of students, quantifying the plausibility of incorrect answers. Higher DEs indicate a higher proportion of effective distractors. 4. Validity Measures Validity was assessed using several complementary indicators—performance outcomes, curricular theme performance, subgroup analyses—to evaluate whether each exam measured the intended constructs and generated meaningful scores. Student performance on the AI- and Human-gen exam was quantified using score distributions and estimates of the effect size of group assignment. The structural validity and content alignment of the two exams was assessed by evaluating student performance stratified by curricular theme. To evaluate the fairness of the exams, subgroup analyses were conducted based on academic background prior to medical school (medical-related or non-medical-related majors) and on gender. 5. Educational Impact The educational impact of participating in either mock exams was evaluated by assessing changes in the student’s self-rated perception of readiness for their upcoming summative exam (pre- versus post-exam) via a 10-point Likert scale (1 = not at all prepared; 10 = extremely prepared). Harms: No harms were anticipated or observed. Participation consisted of completing a voluntary formative mock examination that had no impact on course grades or academic standing. Although temporary exam-related stress may occur during testing, no adverse events or participant complaints were reported. Statistical analysis Analyses were conducted in R (version 2024.09.1 + 394). Primary analyses followed the ITT principle, with PP analyses performed as sensitivity analyses. Continuous variables were summarized as mean ± standard deviation and categorical variables as counts and percentages. Between-group comparisons used Welch’s t-tests for normally distributed outcomes and Wilcoxon rank-sum tests for non-normal distributions. Effect sizes were reported using Cohen’s d for parametric comparisons and rank-biserial correlation (r rb ) for non-parametric comparisons, alongside two-sided p -values and 95% confidence intervals. Item-level DI and DE were compared using non-parametric methods, with uncertainty in mean estimates quantified by using bootstrap resampling (5,000 iterations). Educational impact was evaluated using a linear mixed-effects repeated-measures model with fixed effects for time, group, and their interaction, and a random intercept for participants. Full analytic details are provided in Supplementary Methods 1.5. Results Sample characteristics All 328 medical students enrolled in the class were eligible and invited to participate through email invitations and announcements during review sessions. Of these, 258 students (78.7%) provided informed consent and were randomized. The remaining 70 eligible students (21.3%) did not enroll; reasons for non-participation were not formally collected. Among randomized participants, 127 (49.2%) were allocated to the AI-gen exam arm and 131 (50.8%) to the human-gen exam arm ( Fig. 2 ) . Primary analyses were conducted according to the intention-to-treat (ITT) principle and included all randomized participants (N = 258; AI-gen N = 127, Human-gen N = 131). Outcome data were available for all participants because completion of the mock examination and the pre- and post-exam surveys required responses to all items. Per-protocol analyses are reported as sensitivity analyses in Supplementary Results 2.1–2.4 . Baseline demographics are summarized in Table 1 . There were no statistical differences between the two groups in terms of age, gender identity, undergraduate GPA, study hours, or AI familiarity. Table 1 Baseline Characteristics of the ITT population Variable AI-gen exam ( N = 127) Human-gen exam ( N = 131) p -Value Age, Mean (SD) 24.9 (6.1) 25.5 (7.7) 0.490 Students’ academic level characteristics, Mean (SD) GPA % 90.3 (14.1) 90.1 (13.3) 0.872 Hours studying per week 22.3 (17.0) 23.3 (16.4) 0.658 Knowledge confidence* 5.9 (1.5) 6.1 (1.7) 0.323 Preparation level for real exam* 5.7 (1.5) 5.8 (1.7) 0.501 Test-taking skills* 6.6 (1.8) 6.7 (1.8) 0.771 Time management skills* 7.0 (2.0) 6.8 (2.2) 0.470 Sufficiency of study resources* 6.5 (1.7) 6.6 (1.6) 0.565 Retention of information* 5.9 (1.7) 5.9 (1.8) 0.978 Flashcard usage* 5.9 (3.0) 6.0 (2.8) 0.685 Gender (%) Female 74 (58.3) 65 (49.6) 0.219 Male 45 (35.4) 59 (45.0) Non-binary 1 (0.8) 0 (0.0) Prefer not to say 7 (5.5) 7 (5.3) Major prior to MD school Health sciences 101 (79.5) 115 (87.8) 0.194 Non-health sciences 17 (13.4) 11 (8.4) Other 9 (7.1) 5 (3.8) Familiarity with AI in education Very familiar 9 (7.1) 14 (10.7) 0.571 Familiar 44 (34.6) 36 (27.5) Neutral 19 (15.0) 17 (13.0) Somewhat familiar 46 (36.2) 51 (38.9) Not familiar at all 9 (7.1) 13 (9.9) * Denotes students’ self-assessments on a 10-point Likert scale 1. Feasibility The mean efficiency ratio (standard deviation) was 5.6 (SD 3.5). The distribution of per-item efficiency ratios is shown in Fig. 3 . Generating 112 MCQs required 2,195 minutes using the Human-gen protocol compared with 467 minutes using the AI-gen protocol. The mean time per MCQ was 19.6 ± 7.5 minutes for the Human-gen protocol and 4.2 ± 1.9 minutes for the AI-gen protocol (p < 0.001; Supplementary Figure S2 ). 2. Acceptability Students’ perceptions of the AI- and human-generated examinations are presented in Fig. 4 . Using Welch t -tests, ratings were similar across most acceptability domains. Four domains showed small differences favoring the human-generated examination: identifying knowledge gaps (AI 7.23 versus Human 7.80; p = 0.01), retention of information for future practice (AI 5.84 versus Human 6.34; p = 0.03), general preparedness for the real exam (AI 5.68 versus Human 6.12; p = 0.04), and understanding clinical concepts (AI 5.78 versus Human 6.18; p = 0.04). The remaining domains (overall difficulty, relevance to course material, enough time to complete exam, question quality, clarity of questions) were not statistically different. Detailed statistics for all acceptability domains, including mean differences (Human − AI), 95% confidence intervals, effect sizes, and p -values, are provided in Supplementary Table S3 . Across all nine metrics, effect sizes were small (all Cohen’s | d | ≤ 0.32), and absolute mean differences ranged from 0.09 to 0.57 points on a 10-point scale. Importantly, these differences were below or at the threshold of the pre-specified effect size of 0.5, indicating no practically meaningful differences in perceived acceptability between examination formats. 3. Reliability Human-gen MCQs demonstrated significantly higher item-level discrimination than AI-gen MCQs (mean DI: 0.25 ± 0.13 versus 0.19 ± 0.15; Mann–Whitney U test: W = 8214, p = 0.0001), corresponding to a small effect size ( r rb = 0.310). Bootstrap resampling confirmed overlapping 95% confidence intervals (Human-gen: 0.22–0.27; AI-gen: 0.16–0.22; Fig. 5 ). There is no statistically significant difference in DE between protocols (mean DE: 39.8% ± 30.4% for Human-gen vs 32.6% ± 27.2% for AI-gen; Wilcoxon rank-sum test, W = 7182, p = 0.070; Fig. 6 ). The associated effect size was small ( r rb = 0.135). Bootstrap resampling demonstrated overlapping 95% confidence intervals for mean DE (Human-gen: 34.3%–45.6%; AI-gen: 27.7%–37.7%; Fig. 5 ), supporting comparable distractor plausibility across protocols. 4. Validity Measures 4.1 Student Performance Outcomes In the ITT population ( N = 258), mean exam scores were 73.4% ± 12.3% for the Human-gen exam and 71.0% ± 9.0% for the AI-gen exam. This difference was not statistically significant ( p = 0.083) and corresponded to a small effect size (Cohen’s d = 0.22), indicating substantial overlap in performance between groups. In the PP population, the Human-gen group demonstrated modestly higher mean scores than the AI-gen group (p = 0.005), though the effect size remained small (Cohen’s d = 0.36). Sensitivity analyses comparing ITT and PP results are presented in the Supplementary Results 2.3. Taken together, these findings indicate that student performance was broadly comparable across AI-gen and Human-gen exams. Score distributions for both exams are shown in Supplementary Figure S5 . 4.2 Subgroup Analyses : Subgroup analyses demonstrated no meaningful differences in performance by undergraduate academic background within either examination condition. Across both analytic populations (ITT and PP), mean scores and effect sizes were small and non-significant across background categories, indicating comparable performance regardless of prior academic training (Supplementary Results Section 2.3.2 for PP and Section 3.1 for ITT). Theme-level subgroup analyses demonstrated consistent patterns across both PP and ITT populations. Students performed modestly better on Human-gen items in several curricular themes (Heart Murmur, Upper Gastrointestinal Tract, Pregnancy, and Diabetes), whereas performance favored AI-gen items in Nutrient Absorption. Other themes showed no meaningful differences between examination formats. Although several theme-level comparisons reached statistical significance, standardized effect sizes were uniformly small, indicating limited practical divergence between exam formats. Full statistical details are reported in Supplementary Results Section 2.3.3 (PP) and Section 3.2 (ITT). Gender-stratified analyses revealed a modest male performance advantage in the AI-gen exam that was not observed in the Human-gen exam. This pattern was consistent across analytic populations but remained small in magnitude and should therefore be interpreted as exploratory (Supplementary Results Section 2.3.4 for PP and Section 3.3 for ITT). 5. Educational impact Students self-assessed their preparedness for the upcoming summative examination before and after completing the mock exams. A mixed-effects repeated-measures model (random intercept for participant) demonstrated no statistically significant main effects of time (pre- versus post-exam; F (1,256) = 3.53, p = 0.062, η²p = 0.01), group assignment (AI-gen versus Human-gen; F (1,256) = 1.59, p = 0.208, η²p = 0.006), or time × group interaction ( F (1,256) = 1.68, p = 0.196, η²p = 0.007). Within-group comparisons demonstrated small increases in preparedness in both groups (AI-gen: mean difference = 0.056, SE [standard error] = 0.141, p = 0.691; Human-gen: mean difference = 0.305, SE = 0.133, p = 0.023). Although the change in the Human-gen group reached statistical significance, the magnitude of the effect was modest and remained well below the pre-specified effect size of 0.5, indicating limited practical significance. Sensitivity analyses using the per-protocol population yielded comparable patterns and are reported in the Supplementary Results 2.4 . Discussion This randomized trial provides the first RCT-level assessment of AI-generated versus human-generated MCQs for formative medical evaluation, examining feasibility, acceptability, reliability, validity, and educational impact. By applying the Assessment Utility Framework to AI-assisted item generation under exam conditions, these findings extend assessment utility theory into the context of LLM-supported assessment design. Collectively, our findings indicate that LLMs deliver significant gains in feasibility while achieving psychometric performance comparable to Human-gen MCQs. However, modest trade-offs in discrimination, topic-specific score variability, and signals of potential demographic performance differences underscore the ongoing necessity of human oversight of AI-gen exam implementation. The AI-gen protocol significantly accelerated question development without compromising exam acceptability for students. For a given learning objective, generating a Human-gen MCQ required on average 5.6 times longer than generating an AI-gen MCQ. This efficiency gain is consistent with the 5–10-fold acceleration conferred by AI in previous studies of MCQ generation for medical education. 9 , 10 , 27 Critically, the decrease in time required per question did not have a practically meaningful detrimental effect on the acceptability of the exam to learners. Student perceptions of the two exams were either statistically equivalent or favoured the Human-gen exam, albeit with a small effect size. This is concordant with prior research which demonstrated that, despite nuanced differences in acceptability metrics, AI-gen MCQs were assessed by both content experts 28 and students 11 to be of comparable overall quality to those written by experts. Together, these results indicate that AI-gen MCQs are a feasible alternative to Human-gen MCQs for formative assessments and that the two exams were broadly equivalently acceptable to students. The DI values for both the AI- and Human-gen exams, while acceptable for a formative assessment, were suboptimal for high-stakes or formal summative assessments. 29 – 31 The DI for the Human-gen exam was significantly higher than that for the AI-gen exam, but the effect size of this difference was small, suggesting that the protocols had only a minor effect on the discriminative power of the exams. The suppression of discrimination indices across both arms likely reflects the open-resource format of the mock exam, which may artificially inflate scores of lower-performing students. Moreover, the DE distributions for the Human- and AI-gen exams were indistinguishable, which indicates that the distractors produced using the AI-gen protocol were equivalently plausible to students as those produced with the Human-gen protocol. Score comparability provides evidence supporting the construct validity of the AI-generated exam. While Human-generated items produced modestly higher scores in PP population, effect sizes were small and the difference attenuated in the ITT analysis. The substantial overlap in score distributions across analytic populations suggests that both formats measured similar underlying constructs without meaningful divergence in performance outcomes. Content-level analysis showed theme-dependent performance differences between some AI-gen and Human-gen question themes. Notably, these differences should not be interpreted as a direct indicator of MCQ quality, but rather as a signal that students’ performance varied depending on the MCQ generation process (AI versus Human) for the same themes and objectives. This suggests a potential interaction between generation protocols and curricular content. Future studies should test whether prompt engineering can mitigate theme-dependent variation in AI-generated MCQs. An exploratory subgroup analysis by gender identified a potential performance difference on the AI-gen exam, with male students outperforming female students with a small-to-moderate effect size ( r rb = 0.26). This pattern was not observed on the Human-generated exam ( r rb = 0.07). Although this analysis was exploratory and underpowered, the observed signal underscores the importance of ongoing human oversight and demographic monitoring as LLMs are integrated into medical education assessment pipelines. This study has several limitations that should be considered when interpreting the results. First, the trial was conducted at a single center with only first-year MD students. The use of a mock examination format may limit the generalizability of these findings to high-stakes summative assessments. Both question-generation teams consisted of senior medical students rather than faculty experts, which could underestimate the psychometric advantage achievable by experienced human item writers. The open-resource format of the mock exam may have artificially suppressed the discrimination indices for both groups. Findings should be interpreted in the context of rapidly evolving LLM capabilities, as ongoing improvements in model performance and alignment may affect the generalizability of these results over time. 32 Finally, this study was retrospectively registered, which may introduce risk of reporting bias; however, all outcomes and analyses were pre-specified prior to data analysis to mitigate this risk. Conclusions The AHEAD Trial demonstrates that LLMs can dramatically accelerate MCQ development while producing formative assessments that are acceptable to learners and psychometrically comparable to human-generated questions. While human-generated items currently demonstrate modest advantages in discrimination, the performance gap was small in this study and may narrow further as LLMs continue to improve. Abbreviations AI, Artificial Intelligence DI, Discrimination Index LLM, Large Language Model MCQ, Multiple-Choice Question MD, Medical doctor RCT, Randomized Controlled Trial SD, Standard Deviation AHEAD, A I vs H uman E xam A ssessment and D evelopment GPT, Generative Pre-trained Transformer CI, Confidence Interval REDCap, Research Electronic Data Capture Declarations Ethics approval and consent to participate This study was approved by the University of British Columbia Behavioral Research Ethics Board (Ethics ID: H16-00044). All participants provided informed consent prior to enrollment. Participation was voluntary, uncompensated, and had no impact on academic standing. Availability of data and materials The datasets supporting the conclusions of this article are partially available in the Zenodo repository, https://zenodo.org/records/18284890 , which contains a randomized subset of 30 paired AI-generated and human-generated multiple-choice questions for transparency and reproducibility. Additional de-identified participant-level data, analysis code, and full datasets supporting the findings of this study are available from the corresponding author upon reasonable request. Relevant summary data are included within the article and its supplementary files. Competing interests The authors declare that they have no competing interests. Funding This study received no external funding. Consent for publication Not applicable. Authors’ contributions Dheyaa Al-Najafi (DAN) conceived and designed the study, led the development of both AI-generated and human-generated multiple-choice question (MCQ) protocols, coordinated data collection, conducted the statistical analyses, interpreted the data, generated figures, and drafted the initial manuscript. Katherine D. Krause (KDK) contributed substantially to manuscript drafting and revision, advised on statistical methodology, assisted with figure generation and figure captions, and contributed to the overall structure and critical refinement of the manuscript. Yundi Wang (YW) contributed to the initial drafting of the manuscript and participated in critical revision and refinement of subsequent versions. Qi Kang Zuo (QKZ) contributed to the initial study design, participated in MCQ development and validation, and contributed to drafting and finalizing the manuscript. Maya Koblanski (MK), Cameron Leong (CL), Emma Schmidt (ES), Muhammad Faran (MF), Vanay Verma (VV), Ravi Vyas (RV), Matthew Campbell (MC), and Jaehyun Hwang (JH) contributed to MCQ development, writing the initial draft, and reviewed the manuscript for important intellectual content. Jiawen Deng (JD) advised on psychometric methodology and statistical interpretation and critically revised the manuscript. Anita Palepu (AP) provided senior supervision and substantial intellectual leadership throughout the study. She contributed to study conception and design, provided critical methodological and statistical guidance, advised on analytic strategy and interpretation of findings, and critically revised the manuscript for intellectual rigor, clarity, and educational relevance. All authors reviewed and approved the final manuscript and agree to be accountable for the work. Acknowledgements The authors thank the UBC MD Class of 2028 for their participation in this study. We also acknowledge the support of the UBC Student Learning Group (SLG) and the administrative and technical support provided by the REDCap platform at the University of British Columbia. We are especially grateful to Dr. Kevin Eva for his contributions to this work, including conceptual guidance on the application of the Assessment Utility Framework, input on study design and statistical methodology, and critical review of the manuscript. References Parekh P, Bahadoor V. The Utility of Multiple-Choice Assessment in Current Medical Education: A Critical Review. Cureus. 2024;16(5):e59778. Royal KD, Hedgpeth MW, Jeon T, Colford CM. Automated Item Generation: The Future of Medical Education Assessment. EMJ Innov. 2018;88–93. Case SM, Holtzman K, Ripkey DR. Developing an Item Pool for CBT: A Practical Comparison of Three Models of Item Writing. Acad Med. 2001;76(Supplement):S111–3. Rudner LM. Implementing the Graduate Management Admission Test Computerized Adaptive Test. In: van der Linden W, Glas C, editors. Elements of Adaptive Testing [Internet]. New York, NY: Springer; 2009 [cited 2025 Jul 13]. pp. 151–65. (Statistics for Social and Behavioral Sciences). Available from: https://link.springer.com/chapter/ 10.1007/978-0-387-85461-8_8#chapter-info Gierl MJ, Lai H, Turner SR. Using automatic item generation to create multiple-choice test items. Med Educ. 2012;46(8):757–65. Brin D, Sorin V, Konen E, Nadkarni G, Glicksberg BS, Klang E. How GPT models perform on the United States medical licensing examination: a systematic review. Discov Appl Sci. 2024;6(10):500. Artsi Y, Sorin V, Konen E, Glicksberg BS, Nadkarni G, Klang E. Large language models for generating medical examinations: systematic review. BMC Med Educ [Internet]. 2024 Mar 29 [cited 2025 Jul 23];24(1). Available from: https://bmcmededuc.biomedcentral.com/articles/ 10.1186/s12909-024-05239-y Mistry NP, Saeed H, Rafique S, Le T, Obaid H, Adams SJ. Large Language Models as Tools to Generate Radiology Board-Style Multiple-Choice Questions. Acad Radiol. 2024;31(9):3872–8. Law AK, So J, Lui CT, Choi YF, Cheung KH, Kei-ching Hung K et al. AI versus human-generated multiple-choice questions for medical education: a cohort study in a high-stakes examination. BMC Med Educ [Internet]. 2025 Feb 8 [cited 2025 Jul 23];25(1). Available from: https://bmcmededuc.biomedcentral.com/articles/ 10.1186/s12909-025-06796-6 Cheung BHH, Lau GKK, Wong GTC, Lee EYP, Kulkarni D, Seow CS, et al. ChatGPT versus human in generating medical graduate exam multiple choice questions—A multinational prospective study (Hong Kong S.A.R., Singapore, Ireland, and the United Kingdom). PLoS ONE. 2023;18(8):e0290691. Elzayyat M, Mohammad JN, Zaqout S. Assessing LLM-generated vs. expert-created clinical anatomy MCQs: a student perception-based comparative study in medical education. Med Educ Online. 2025;30(1):2554678. Camarata T, McCoy L, Rosenberg R, Temprine Grellinger KR, Brettschnieder K, Berman J. LLM-Generated multiple choice practice quizzes for preclinical medical students. Adv Physiol Educ. 2025;49(3):758–63. Biswas S. Passing is Great: Can ChatGPT Conduct USMLE Exams? Ann Biomed Eng. 2023;51(9):1885–6. Balu A, Prvulovic ST, Fernandez Perez C, Kim A, Donoho DA, Keating G. Evaluating the value of AI-generated questions for USMLE step 1 preparation: A study using ChatGPT-3.5. Med Teach. 2025;1–9. Klang E, Portugez S, Gross R, Kassif Lerner R, Brenner A, Gilboa M, et al. Advantages and pitfalls in utilizing artificial intelligence for crafting medical examinations: a medical education pilot study with GPT-4. BMC Med Educ. 2023;23:772. Ayub I, Hamann D, Hamann CR, Davis MJ. Exploring the Potential and Limitations of Chat Generative Pre-trained Transformer (ChatGPT) in Generating Board-Style Dermatology Questions: A Qualitative Analysis. Cureus. 2023;15(8):e43717. Sevgi UT, Erol G, Doğruel Y, Sönmez OF, Tubbs RS, Güngor A. The role of an open artificial intelligence platform in modern neurosurgical education: a preliminary study. Neurosurg Rev. 2023;46(1):86. Han Z, Battaglia F, Udaiyar A, Fooks A, Terlecky SR. An explorative assessment of ChatGPT as an aid in medical education: Use it with caution. Med Teach. 2024;46(5):657–64. Van Der Vleuten CPM. The assessment of professional competence: Developments, research and practical implications. Adv Health Sci Educ. 1996;1(1):41–67. Colbert-Getz JM, Ryan M, Hennessey E, Lindeman B, Pitts B, Rutherford KA et al. Measuring Assessment Quality With an Assessment Utility Rubric for Medical Education. MedEdPORTAL. 2017;10588. Schulz KF, Altman DG, Moher D, Group CONSORT. CONSORT 2010 Statement: updated guidelines for reporting parallel group randomised trials. BMC Med. 2010;8:18. Harris PA, Taylor R, Thielke R, Payne J, Gonzalez N, Conde JG. Research electronic data capture (REDCap)—A metadata-driven methodology and workflow process for providing translational research informatics support. J Biomed Inf. 2009;42(2):377–81. Harris PA, Taylor R, Minor BL, Elliott V, Fernandez M, O’Neal L, et al. The REDCap consortium: Building an international community of software platform partners. J Biomed Inf. 2019;95:103208. Cohen J. Statistical power analysis for the behavioral sciences. 2nd ed. Hillsdale, N.J: L. Erlbaum Associates; 1988. p. 567. Lakens D. Sample Size Justification. Ravenzwaaij D van. editor Collabra Psychol. 2022;8(1):33267. Ebel RL, Frisbie DA. Evaluating Test and Item Characteristics. Essentials of Educational Measurement. 5th ed. Englewood Cliffs, NJ: Prentice-Hall Inc.; 1991. pp. 220–40. Zuckerman M, Flood R, Tan RJB, Kelp N, Ecker DJ, Menke J, et al. ChatGPT for assessment writing. Med Teach. 2023;45(11):1224–7. Wu H, Zerner T, Lee D, Court-Kowalski S, Devitt P, Palmer E. GPT-4 versus human authors in clinically complex MCQ creation: A blinded analysis of item quality. Med Teach. 2025;1–14. Rao C, Kishan Prasad H, Sajitha K, Permi H, Shetty J. Item analysis of multiple choice questions: Assessing an assessment tool in medical students. Int J Educ Psychol Res. 2016;2(4):201. Hingorjo MR, Jaleel F. Analysis of One-Best MCQs: the Difficulty Index, Discrimination Index and Distractor Efficiency. J Pak Med Assoc. 2012;62(2). AlKhatib HS, Brazeau G, Akour A, Almuhaissen SA. Evaluation of the effect of items’ format and type on psychometric properties of sixth year pharmacy students clinical clerkship assessment items. BMC Med Educ. 2020;20:190. Qiu Z, Jiang A, Qi C, Gan W, Zhu L, Mou W, et al. Temporal evolution of large language models (LLMs) in oncology. J Transl Med. 2025;23(1):1219. Additional Declarations No competing interests reported. Supplementary Files SIAHEADTrial.docx Cite Share Download PDF Status: Under Revision Version 1 posted Editorial decision: Revision requested 24 Apr, 2026 Reviews received at journal 24 Apr, 2026 Reviews received at journal 20 Apr, 2026 Reviewers agreed at journal 10 Apr, 2026 Reviewers agreed at journal 10 Apr, 2026 Reviews received at journal 09 Apr, 2026 Reviewers agreed at journal 07 Apr, 2026 Reviewers invited by journal 28 Mar, 2026 Editor invited by journal 26 Mar, 2026 Editor assigned by journal 25 Mar, 2026 Submission checks completed at journal 25 Mar, 2026 First submitted to journal 21 Mar, 2026 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-9187684","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":615480181,"identity":"7d96ba2e-2821-4317-82ee-4f0cf95b03ec","order_by":0,"name":"Dheyaa Al-Najafi","email":"","orcid":"","institution":"University of British Columbia","correspondingAuthor":false,"prefix":"","firstName":"Dheyaa","middleName":"","lastName":"Al-Najafi","suffix":""},{"id":615480182,"identity":"d13ff6bc-b294-4b14-abf2-a4196bfbc6c3","order_by":1,"name":"Katherine D. Krause","email":"","orcid":"","institution":"University of British Columbia","correspondingAuthor":false,"prefix":"","firstName":"Katherine","middleName":"D.","lastName":"Krause","suffix":""},{"id":615480183,"identity":"2b1b625e-f4fe-49b1-b159-7f46e65f4276","order_by":2,"name":"Yundi Wang","email":"","orcid":"","institution":"University of British Columbia","correspondingAuthor":false,"prefix":"","firstName":"Yundi","middleName":"","lastName":"Wang","suffix":""},{"id":615480184,"identity":"af7b4d1d-9191-4265-9f39-63bc8c26eeda","order_by":3,"name":"Qi Kang Zuo","email":"","orcid":"","institution":"University of British Columbia","correspondingAuthor":false,"prefix":"","firstName":"Qi","middleName":"Kang","lastName":"Zuo","suffix":""},{"id":615480185,"identity":"d0b4fde6-c4cd-4863-93f5-a26d51b0c359","order_by":4,"name":"Maya Koblanski","email":"","orcid":"","institution":"University of British Columbia","correspondingAuthor":false,"prefix":"","firstName":"Maya","middleName":"","lastName":"Koblanski","suffix":""},{"id":615480186,"identity":"936ab2a5-f20e-4351-a75b-16df04bba082","order_by":5,"name":"Cameron J. Leong","email":"","orcid":"","institution":"University of British Columbia","correspondingAuthor":false,"prefix":"","firstName":"Cameron","middleName":"J.","lastName":"Leong","suffix":""},{"id":615480187,"identity":"7297036f-8ef1-4b40-a2f3-92a5b597403a","order_by":6,"name":"Emma Schmidt","email":"","orcid":"","institution":"University of British Columbia","correspondingAuthor":false,"prefix":"","firstName":"Emma","middleName":"","lastName":"Schmidt","suffix":""},{"id":615480188,"identity":"0ea4b50f-7ca7-477a-8167-32a516de9cf5","order_by":7,"name":"Muhammad Faran","email":"","orcid":"","institution":"University of British Columbia","correspondingAuthor":false,"prefix":"","firstName":"Muhammad","middleName":"","lastName":"Faran","suffix":""},{"id":615480189,"identity":"9e95422f-1e8c-4fa3-908f-76c326e21d7f","order_by":8,"name":"Vanay Verma","email":"","orcid":"","institution":"University of British Columbia","correspondingAuthor":false,"prefix":"","firstName":"Vanay","middleName":"","lastName":"Verma","suffix":""},{"id":615480190,"identity":"46bf2009-854f-493e-bf4f-eea463141146","order_by":9,"name":"Ravi Vyas","email":"","orcid":"","institution":"University of British Columbia","correspondingAuthor":false,"prefix":"","firstName":"Ravi","middleName":"","lastName":"Vyas","suffix":""},{"id":615480191,"identity":"1b1f4284-68e9-473b-8a17-51fd644024bf","order_by":10,"name":"Matthew Campbell","email":"","orcid":"","institution":"University of British Columbia","correspondingAuthor":false,"prefix":"","firstName":"Matthew","middleName":"","lastName":"Campbell","suffix":""},{"id":615480192,"identity":"f3c8d7c8-c347-47ff-be23-94844a2de5e1","order_by":11,"name":"Jaehyun Hwang","email":"","orcid":"","institution":"University of British Columbia","correspondingAuthor":false,"prefix":"","firstName":"Jaehyun","middleName":"","lastName":"Hwang","suffix":""},{"id":615480193,"identity":"d28b7a8d-804c-4e44-8534-c461b79cbcfb","order_by":12,"name":"Jiawen Deng","email":"","orcid":"","institution":"St. Michael's Hospital","correspondingAuthor":false,"prefix":"","firstName":"Jiawen","middleName":"","lastName":"Deng","suffix":""},{"id":615480194,"identity":"4df38651-bd1b-433f-a53c-dc2c64310a57","order_by":13,"name":"Anita Palepu","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA5ElEQVRIiWNgGAWjYFACxgYwxQciPjAcIEpLI1gPG4g5gzgtUGtAWph5iNHCP7u5/cEPhjvybOy9x6Rt2+7IM7AffoBXi8Sdg42NPQzPDNt4zqVJ57Y9M2zgSTPAb82NxMYGHobDjG0SOWZALYcTGCQY8GuRB2pp/MNw2L5N/o2ZtCVYC/sHvFoMgFqagbYktknwmEkzgrXw4LfFEKhltozBs+Q2nrxky55zIE/lFODVIncj/cHHNxV3bPvZzx688aPsjjw/+/ENeLVAnXcASPBA2GxEqAcBJC2jYBSMglEwCtABALxTR03vHOBdAAAAAElFTkSuQmCC","orcid":"","institution":"University of British Columbia","correspondingAuthor":true,"prefix":"","firstName":"Anita","middleName":"","lastName":"Palepu","suffix":""}],"badges":[],"createdAt":"2026-03-21 19:08:18","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-9187684/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-9187684/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":106401806,"identity":"eb22fc5c-18b2-49c7-9bfd-3dc817ece146","added_by":"auto","created_at":"2026-04-08 09:09:46","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":385360,"visible":true,"origin":"","legend":"\u003cp\u003eComparison of the protocols for Human- and AI-gen MCQs. The initial inputs for both protocols included a course learning objective, a researcher-formulated question stem, and a standardized list of question-making guidelines, which specified that MCQs should be single-best-answer, case-based, and moderately difficult, with five answer options and plausible distractors. * denotes input components that were identical between the two study arms. The revision process for Human-gen MCQs involved peer review by an independent student; the revision process for AI-gen MCQs included revisions by both ChatGPT-4 and Google Gemini.\u003c/p\u003e","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-9187684/v1/ed195e0c27536f3423b1a978.png"},{"id":106058038,"identity":"861bf5e3-b45d-4795-9527-37cc05564c06","added_by":"auto","created_at":"2026-04-03 02:15:47","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":617490,"visible":true,"origin":"","legend":"\u003cp\u003eCONSORT flow diagram showing participant randomization, allocation to intervention arms, and inclusion in the per-protocol analysis for the AHEAD Trial.\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-9187684/v1/d9a88b16eace7dcb9318cd75.png"},{"id":106094298,"identity":"14a697cf-cdeb-4720-88b5-57c027993f02","added_by":"auto","created_at":"2026-04-03 11:42:05","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":361380,"visible":true,"origin":"","legend":"\u003cp\u003eHistogram showing the distribution of efficiency ratios for each of the 112 matched MCQs across the two exam protocols. The efficiency ratio was calculated as the time required to generate a Human-generated MCQ divided by the time required to generate the corresponding AI-generated MCQ for the same learning objective. Values greater than 1 indicate improved time efficiency for the AI-generated protocol.\u003c/p\u003e","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-9187684/v1/c496848fa0acaa1325bc09bc.png"},{"id":106095021,"identity":"4bac3b82-9598-48bd-bbe3-5e9fefcfe539","added_by":"auto","created_at":"2026-04-03 11:43:58","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":135613,"visible":true,"origin":"","legend":"\u003cp\u003eStudent-reported perceptions of AI- and Human-gen exams in the ITT population. Bars represent mean scores on a 10-point Likert scale, with error bars indicating standard error (\u003cem\u003eN\u003c/em\u003e = 127 for AI-gen, \u003cem\u003eN\u003c/em\u003e = 131 for Human-gen). \u0026nbsp;Statistical comparisons were conducted using Welch’s t-test; “*” denote p \u0026lt; \u0026nbsp;0.05 and “ns” indicates non-significance (p ≥ 0.05).\u003c/p\u003e","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-9187684/v1/aef21d3018781e70fca77fc6.png"},{"id":106094600,"identity":"379a39f3-7150-4297-b460-a1999d1b7fd6","added_by":"auto","created_at":"2026-04-03 11:42:58","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":281698,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003e(A)\u003c/strong\u003e Boxplot comparing discrimination indices (DI) for AI-gen and Human-gen items. Box midline represents the median value; box bounds represent the interquartile range (IQR, 25th–75th percentile). Whiskers extend to the most extreme data points within 1.5 × IQR of the quartiles. Data points lying outside this range are plotted individually. \u003cstrong\u003e(B)\u003c/strong\u003e Comparison of the DI distributions for AI-gen (blue) and Human-gen (grey) MCQs.\u003c/p\u003e","description":"","filename":"floatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-9187684/v1/9b45641501f3a22b72ce76dd.png"},{"id":106058040,"identity":"63b7f980-7eb8-4ad3-84fa-20801dac2c8e","added_by":"auto","created_at":"2026-04-03 02:15:47","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":209397,"visible":true,"origin":"","legend":"\u003cp\u003e(A) Boxplots comparing the DE values of the AI- and Human-gen exams in the ITT population. Box midline represents the median DE; the box limits indicate the IQR; and whiskers extend to 1.5 × IQR. Data points lying outside this range are plotted individually. (B)\u003cstrong\u003e \u003c/strong\u003eDistribution of DE values for AI-gen (blue) and human-gen (grey) MCQs. Dashed vertical lines indicate the mean DE for each exam (AI-gen in blue; human-gen in grey).\u003c/p\u003e","description":"","filename":"floatimage6.png","url":"https://assets-eu.researchsquare.com/files/rs-9187684/v1/863e2abf36e5a7b6caa55e3b.png"},{"id":106406790,"identity":"e9471fb8-c057-4a35-bd49-6083d89bc2d6","added_by":"auto","created_at":"2026-04-08 09:34:06","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":3226382,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-9187684/v1/73d8a6b1-1985-4d35-bd15-67e0bac98233.pdf"},{"id":106058036,"identity":"e77d29c3-1e16-4ea6-864a-916709c89b1f","added_by":"auto","created_at":"2026-04-03 02:15:47","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"supplement","size":910356,"visible":true,"origin":"","legend":"","description":"","filename":"SIAHEADTrial.docx","url":"https://assets-eu.researchsquare.com/files/rs-9187684/v1/2f974db3a4603801713d3c76.docx"}],"financialInterests":"No competing interests reported.","formattedTitle":"Psychometric Performance and Student Perceptions of AI- versus Human-Generated Multiple-Choice Questions: The AHEAD Randomized Controlled Trial","fulltext":[{"header":"Background","content":"\u003cp\u003eThe development of high-quality examination materials for medical students is a resource-intensive process, requiring that instructors balance clinical relevance, evolving guidelines, and psychometric criteria, while also avoiding common item-writing flaws.\u003csup\u003e\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e,\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e\u003c/sup\u003e Developing a single high-quality multiple-choice question (MCQ) is rigorous, often requiring multiple hours of drafting and editing.\u003csup\u003e\u003cspan additionalcitationids=\"CR4\" citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e\u003c/sup\u003e Consequently, cost estimates range from over US\u003cspan\u003e$\u003c/span\u003e200 (inflation-adjusted) per usable MCQ\u003csup\u003e\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u003c/sup\u003e to more than US\u003cspan\u003e$\u003c/span\u003e2,000 per MCQ,\u003csup\u003e4\u003c/sup\u003e with licensure exam question banks requiring multi-million dollar investments.\u003csup\u003e\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e\u003c/sup\u003e\u003c/p\u003e \u003cp\u003eLarge language models (LLMs), such as OpenAI\u0026rsquo;s Chat Generative Pre-trained Transformer (ChatGPT), have demonstrated strong performance on medical knowledge benchmarks, including the United States Medical Licensing Examination (USMLE).\u003csup\u003e\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e\u003c/sup\u003e Given LLMs\u0026rsquo; medical exam performance and the high cost of question development, there is growing interest in using LLMs to streamline question generation. Nevertheless, questions regarding the pedagogical utility of AI-generated exam questions remain.\u003csup\u003e\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u003c/sup\u003e\u003c/p\u003e \u003cp\u003eAs reviewed elsewhere,\u003csup\u003e7\u003c/sup\u003e several recent studies have explored the use of LLMs to generate medical examination questions across training levels and specialties.\u003csup\u003e\u003cspan additionalcitationids=\"CR8 CR9 CR10 CR11 CR12 CR13 CR14 CR15 CR16 CR17\" citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e\u003c/sup\u003e However, few studies\u003csup\u003e\u003cspan additionalcitationids=\"CR9 CR10\" citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e\u003c/sup\u003e have directly compared the use of LLM-generated and human-generated examination questions in medical education and none have used a randomized controlled trial (RCT) study design. Further, as originally conceptualized by van der Vleuten, evaluating the utility of a proposed assessment tool requires a comprehensive analysis of feasibility, reliability, validity, acceptability, and educational impact.\u003csup\u003e\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e,\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e\u003c/sup\u003e In contrast, most studies on the utility of LLM-generated medical exams have focused on isolated components of assessment utility rather than evaluating the exams holistically. Additionally, most evaluations have relied primarily on subject-matter expert judgment.\u003csup\u003e\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e,\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e,\u003cspan additionalcitationids=\"CR13 CR14 CR15 CR16\" citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e\u003c/sup\u003e While valuable, this does not account for students\u0026rsquo; experience and perceptions of such exams. Further methodological limitations include small item sets,\u003csup\u003e13,17,18\u003c/sup\u003e limited student samples,\u003csup\u003e9,11\u003c/sup\u003e and narrow disciplinary scope.\u003csup\u003e\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e,\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e,\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e,\u003cspan additionalcitationids=\"CR17\" citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e\u003c/sup\u003e\u003c/p\u003e \u003cp\u003eThe \u003cb\u003eAHEAD Trial\u003c/b\u003e (\u003cb\u003eA\u003c/b\u003eI vs \u003cb\u003eH\u003c/b\u003euman \u003cb\u003eE\u003c/b\u003exam \u003cb\u003eA\u003c/b\u003essessment and \u003cb\u003eD\u003c/b\u003eevelopment) is a single-center, participant-blinded RCT designed to address these gaps by evaluating how medical students perform on and perceive AI- versus human-generated MCQs. We engaged a large student cohort to evaluate an extensive and rigorously constructed bank of MCQs across multiple core medical disciplines, enabling robust psychometric and educational analyses. Guided by the Assessment Utility Framework,\u003csup\u003e19\u003c/sup\u003e the trial compares feasibility, reliability, validity, acceptability, and educational impact of exams generated by a dual-LLM pipeline (ChatGPT-4 with independent validation by Google Gemini) versus human authors.\u003c/p\u003e"},{"header":"Methods","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003eStudy design and oversight\u003c/h2\u003e \u003cp\u003eThis participant-blinded, parallel RCT compared the effect of LLM-generated versus human-generated MCQs in a mock final exam on MD students\u0026rsquo; performance and perceptions. This trial received approval from the University of British Columbia Behavioral (UBC) Ethics Research Board and was conducted at the UBC Faculty of Medicine on December 8 and 9, 2024 (Ethics ID: H16-00044). The study protocol, including the analysis plan and outcome measures, was pre-specified \u003cem\u003ea priori\u003c/em\u003e and was not modified retrospectively. This study was retrospectively registered on ClinicalTrials.gov (Identifier: NCT07481162; registered March 18, 2026). The study protocol and statistical analysis plan were pre-specified prior to data analysis. The trial is reported in accordance with CONSORT 2025 guidelines.\u003csup\u003e\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e\u003c/sup\u003e\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eParticipants\u003c/h3\u003e\n\u003cp\u003e First-year MD students (UBC MD 2028 cohort) enrolled in Foundations of Medical Practice I (MEDD 411) were recruited via class-wide email and announcements during review sessions. Participation was voluntary, uncompensated, and had no impact on academic standing; instead, the exam served as a preparatory opportunity aligned with summative objectives.\u003c/p\u003e\n\u003ch3\u003eInclusion and exclusion criteria\u003c/h3\u003e\n\u003cp\u003eEligible participants were first-year MD students enrolled in MEDD 411 who provided informed consent and completed the mock examination under study conditions. Exclusion criteria were defined a \u003cem\u003epriori\u003c/em\u003e, were independent of study outcomes, and included failure to submit the exam, clearly invalid demographic information (\u003cem\u003ei.e.\u003c/em\u003e, implausible age or GPA values), and completion time under 35 minutes as a proxy for insufficient effort responses. The rationale and validation of this time threshold are provided in \u003cb\u003eSupplementary Methods 1.1\u003c/b\u003e.\u003c/p\u003e\n\u003ch3\u003eStudy intervention and blinding\u003c/h3\u003e\n\u003cp\u003eParticipants were randomized 1:1 to complete either an AI-generated (AI-gen) or human-generated (Human-gen) MCQ mock exam. All consenting participants also completed identical pre- and post-exam surveys to collect baseline demographic and academic information and to assess perceptions of the exam; full surveys and outcome definitions are provided in \u003cb\u003eSupplementary Methods 1.6\u003c/b\u003e.\u003c/p\u003e \u003cp\u003eRandomization was performed by a third party with no prior knowledge of the trial, using a simple, non-blocked computer-generated sequence. The allocation sequence was concealed. Study investigators responsible for recruitment and data collection did not have access to the random allocation sequence. Participants were blinded to the source of the MCQs (AI-gen versus human-gen). To maintain blinding, all questions were reviewed to remove any indicators of AI or human authorship, and both examination versions were delivered using identical exam links, introductions, and pre- and post-exam surveys. All survey and mock MCQ final exam responses were anonymized and data were collected and securely managed using REDCap electronic data capture tools hosted at the UBC Faculty of Medicine.\u003csup\u003e\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e,\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e\u003c/sup\u003e\u003c/p\u003e\n\u003ch3\u003eAI- versus Human-generated mock exams\u003c/h3\u003e\n\u003cp\u003eTwo independent teams of six second-year MD students developed 112 case-based, single-best-answer MCQs aligned to the same MEDD 411 learning objectives, with comprehensive answer explanations provided for every option \u003cb\u003e(\u003c/b\u003eFig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e\u003cb\u003e).\u003c/b\u003e AI-generated items were produced using ChatGPT-4 and underwent iterative validation by ChatGPT-4 and a Google Gemini v1.5 Flash, with revisions performed until items were accurate and aligned to objectives. Human-generated items were authored without AI assistance and validated via independent peer review. Full protocol details and model prompts are provided in \u003cb\u003eSupplementary Methods 1.3\u003c/b\u003e. A randomly sampled subset of 30 paired AI-gen and human-gen MCQs is publicly available, along with a de-identified item-level discrimination index and answer explanation (\u003cb\u003eSupplementary Information, Section 6)\u003c/b\u003e.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003eSample size calculation\u003c/h2\u003e \u003cp\u003eA medium effect size was set (Cohen\u0026rsquo;s \u003cem\u003ed\u003c/em\u003e\u0026thinsp;=\u0026thinsp;0.5) to indicate potentially important differences in exam performance and students\u0026rsquo; perceptions between AI-gen and Human-gen mock exams.\u003csup\u003e\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e,\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e\u003c/sup\u003e This effect size is widely accepted as a benchmark for educational interventions.\u003csup\u003e\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e,\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e\u003c/sup\u003e The power was set to 80% (β\u0026thinsp;=\u0026thinsp;0.20) and the two-sided significance level to 0.05 (α\u0026thinsp;=\u0026thinsp;0.05). Based on these parameters and using standard formulas for comparing two independent means, the required sample size was 100 students per group. There were no interim analyses or stopping guidelines.\u003c/p\u003e \u003cp\u003e \u003cb\u003eOutcome Measures\u003c/b\u003e \u003c/p\u003e \u003cp\u003e \u003cb\u003e1. Feasibility Measures\u003c/b\u003e \u003c/p\u003e \u003cp\u003eResearchers recorded MCQ generation and proofreading time. The primary outcome was the efficiency ratio (Human-gen time / AI-gen time) per matched learning objective. Mean time per MCQ and total exam generation time were reported as descriptive feasibility metrics to contextualize differences between protocols.\u003c/p\u003e \u003cp\u003e \u003cb\u003e2. Acceptability Measures\u003c/b\u003e \u003c/p\u003e \u003cp\u003eWe assessed students\u0026rsquo; perceptions via a post-exam survey using 10-point Likert scales. These measures include students\u0026rsquo; ratings of MCQ clarity, relevance, overall quality, difficulty, adequacy of the exam time, and the exam\u0026rsquo;s effectiveness in identifying knowledge gaps, assessing clinical understanding, and aiding in information retention for future practice.\u003c/p\u003e \u003cp\u003e \u003cb\u003e3. Reliability Measures\u003c/b\u003e \u003c/p\u003e \u003cp\u003eReliability was assessed using two complementary psychometric properties: the Discrimination Index (DI) and Distractor Efficiency (DE).\u003c/p\u003e \u003cp\u003eDI quantifies how well each item differentiates high- from low-performing examinees. For each item, DI was calculated using the formula: (DI\u0026thinsp;=\u0026thinsp;p\u003csub\u003eupper\u003c/sub\u003e \u0026ndash; p\u003csub\u003elower\u003c/sub\u003e), where p\u003csub\u003eupper\u003c/sub\u003e and p\u003csub\u003elower\u003c/sub\u003e represent the proportion of correct responses among the upper and lower 27% of examinees based on total exam score.\u003csup\u003e\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e\u003c/sup\u003e This item-level analysis captures the distribution of discriminatory power across each of the two exams. For a given item, DE is the proportion of distractors selected by at least 5% of students, quantifying the plausibility of incorrect answers. Higher DEs indicate a higher proportion of effective distractors.\u003c/p\u003e \u003cp\u003e \u003cb\u003e4. Validity Measures\u003c/b\u003e \u003c/p\u003e \u003cp\u003eValidity was assessed using several complementary indicators\u0026mdash;performance outcomes, curricular theme performance, subgroup analyses\u0026mdash;to evaluate whether each exam measured the intended constructs and generated meaningful scores.\u003c/p\u003e \u003cp\u003eStudent performance on the AI- and Human-gen exam was quantified using score distributions and estimates of the effect size of group assignment. The structural validity and content alignment of the two exams was assessed by evaluating student performance stratified by curricular theme. To evaluate the fairness of the exams, subgroup analyses were conducted based on academic background prior to medical school (medical-related or non-medical-related majors) and on gender.\u003c/p\u003e \u003cp\u003e \u003cb\u003e5. Educational Impact\u003c/b\u003e \u003c/p\u003e \u003cp\u003eThe educational impact of participating in either mock exams was evaluated by assessing changes in the student\u0026rsquo;s self-rated perception of readiness for their upcoming summative exam (pre- versus post-exam) via a 10-point Likert scale (1\u0026thinsp;=\u0026thinsp;not at all prepared; 10\u0026thinsp;=\u0026thinsp;extremely prepared).\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eHarms:\u003c/h3\u003e\n\u003cp\u003e \u003cdiv class=\"BlockQuote\"\u003e \u003cp\u003eNo harms were anticipated or observed. Participation consisted of completing a voluntary formative mock examination that had no impact on course grades or academic standing. Although temporary exam-related stress may occur during testing, no adverse events or participant complaints were reported.\u003c/p\u003e \u003c/div\u003e \u003c/p\u003e \u003cdiv id=\"Sec10\" class=\"Section2\"\u003e \u003ch2\u003eStatistical analysis\u003c/h2\u003e \u003cp\u003eAnalyses were conducted in R (version 2024.09.1\u0026thinsp;+\u0026thinsp;394). Primary analyses followed the ITT principle, with PP analyses performed as sensitivity analyses. Continuous variables were summarized as mean\u0026thinsp;\u0026plusmn;\u0026thinsp;standard deviation and categorical variables as counts and percentages. Between-group comparisons used Welch\u0026rsquo;s t-tests for normally distributed outcomes and Wilcoxon rank-sum tests for non-normal distributions. Effect sizes were reported using Cohen\u0026rsquo;s \u003cem\u003ed\u003c/em\u003e for parametric comparisons and rank-biserial correlation (r\u003csub\u003erb\u003c/sub\u003e) for non-parametric comparisons, alongside two-sided \u003cem\u003ep\u003c/em\u003e-values and 95% confidence intervals. Item-level DI and DE were compared using non-parametric methods, with uncertainty in mean estimates quantified by using bootstrap resampling (5,000 iterations). Educational impact was evaluated using a linear mixed-effects repeated-measures model with fixed effects for time, group, and their interaction, and a random intercept for participants. Full analytic details are provided in \u003cb\u003eSupplementary Methods 1.5.\u003c/b\u003e\u003c/p\u003e \u003c/div\u003e"},{"header":"Results","content":"\u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003eSample characteristics\u003c/h2\u003e \u003cp\u003eAll 328 medical students enrolled in the class were eligible and invited to participate through email invitations and announcements during review sessions. Of these, 258 students (78.7%) provided informed consent and were randomized. The remaining 70 eligible students (21.3%) did not enroll; reasons for non-participation were not formally collected.\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003eAmong randomized participants, 127 (49.2%) were allocated to the AI-gen exam arm and 131 (50.8%) to the human-gen exam arm \u003cb\u003e(\u003c/b\u003eFig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e\u003cb\u003e)\u003c/b\u003e. Primary analyses were conducted according to the intention-to-treat (ITT) principle and included all randomized participants (N\u0026thinsp;=\u0026thinsp;258; AI-gen N\u0026thinsp;=\u0026thinsp;127, Human-gen N\u0026thinsp;=\u0026thinsp;131). Outcome data were available for all participants because completion of the mock examination and the pre- and post-exam surveys required responses to all items. Per-protocol analyses are reported as sensitivity analyses in \u003cb\u003eSupplementary Results 2.1\u0026ndash;2.4\u003c/b\u003e.\u003c/p\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eBaseline demographics are summarized in Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e. There were no statistical differences between the two groups in terms of age, gender identity, undergraduate GPA, study hours, or AI familiarity.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eBaseline Characteristics of the ITT population\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"4\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eVariable\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eAI-gen exam\u003c/p\u003e \u003cp\u003e(\u003cem\u003eN\u003c/em\u003e\u0026thinsp;=\u0026thinsp;127)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eHuman-gen exam\u003c/p\u003e \u003cp\u003e (\u003cem\u003eN\u003c/em\u003e\u0026thinsp;=\u0026thinsp;131)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u003cem\u003ep\u003c/em\u003e-Value\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAge, Mean (SD)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e24.9 (6.1)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e25.5 (7.7)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.490\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"4\" nameend=\"c4\" namest=\"c1\"\u003e \u003cp\u003e\u003cb\u003eStudents\u0026rsquo; academic level characteristics, Mean (SD)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGPA %\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e90.3 (14.1)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e90.1 (13.3)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.872\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eHours studying per week\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e22.3 (17.0)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e23.3 (16.4)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.658\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eKnowledge confidence*\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e5.9 (1.5)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e6.1 (1.7)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.323\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ePreparation level for real exam*\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e5.7 (1.5)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e5.8 (1.7)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.501\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eTest-taking skills*\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e6.6 (1.8)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e6.7 (1.8)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.771\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eTime management skills*\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e7.0 (2.0)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e6.8 (2.2)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.470\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eSufficiency of study resources*\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e6.5 (1.7)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e6.6 (1.6)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.565\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eRetention of information*\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e5.9 (1.7)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e5.9 (1.8)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.978\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eFlashcard usage*\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e5.9 (3.0)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e6.0 (2.8)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.685\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eGender (%)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eFemale\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e74 (58.3)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e65 (49.6)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.219\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMale\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e45 (35.4)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e59 (45.0)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eNon-binary\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e1 (0.8)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0 (0.0)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ePrefer not to say\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e7 (5.5)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e7 (5.3)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eMajor prior to MD school\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eHealth sciences\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e101 (79.5)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e115 (87.8)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.194\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eNon-health sciences\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e17 (13.4)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e11 (8.4)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eOther\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e9 (7.1)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e5 (3.8)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eFamiliarity with AI in education\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eVery familiar\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e9 (7.1)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e14 (10.7)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.571\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eFamiliar\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e44 (34.6)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e36 (27.5)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eNeutral\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e19 (15.0)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e17 (13.0)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eSomewhat familiar\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e46 (36.2)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e51 (38.9)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eNot familiar at all\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e9 (7.1)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e13 (9.9)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003ctfoot\u003e \u003ctr\u003e\u003ctd colspan=\"4\"\u003e* Denotes students\u0026rsquo; self-assessments on a 10-point Likert scale\u003c/td\u003e\u003c/tr\u003e \u003c/tfoot\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cb\u003e1. Feasibility\u003c/b\u003e \u003c/p\u003e \u003cp\u003eThe mean efficiency ratio (standard deviation) was 5.6 (SD 3.5). The distribution of per-item efficiency ratios is shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e. Generating 112 MCQs required 2,195 minutes using the Human-gen protocol compared with 467 minutes using the AI-gen protocol. The mean time per MCQ was 19.6\u0026thinsp;\u0026plusmn;\u0026thinsp;7.5 minutes for the Human-gen protocol and 4.2\u0026thinsp;\u0026plusmn;\u0026thinsp;1.9 minutes for the AI-gen protocol (p\u0026thinsp;\u0026lt;\u0026thinsp;0.001; \u003cb\u003eSupplementary Figure S2\u003c/b\u003e).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cb\u003e2. Acceptability\u003c/b\u003e \u003c/p\u003e \u003cp\u003eStudents\u0026rsquo; perceptions of the AI- and human-generated examinations are presented in Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e. Using Welch \u003cem\u003et\u003c/em\u003e-tests, ratings were similar across most acceptability domains. Four domains showed small differences favoring the human-generated examination: identifying knowledge gaps (AI 7.23 versus Human 7.80; \u003cem\u003ep\u003c/em\u003e\u0026thinsp;=\u0026thinsp;0.01), retention of information for future practice (AI 5.84 versus Human 6.34; \u003cem\u003ep\u003c/em\u003e\u0026thinsp;=\u0026thinsp;0.03), general preparedness for the real exam (AI 5.68 versus Human 6.12; \u003cem\u003ep\u003c/em\u003e\u0026thinsp;=\u0026thinsp;0.04), and understanding clinical concepts (AI 5.78 versus Human 6.18; \u003cem\u003ep\u003c/em\u003e\u0026thinsp;=\u0026thinsp;0.04). The remaining domains (overall difficulty, relevance to course material, enough time to complete exam, question quality, clarity of questions) were not statistically different. Detailed statistics for all acceptability domains, including mean differences (Human\u0026thinsp;\u0026minus;\u0026thinsp;AI), 95% confidence intervals, effect sizes, and \u003cem\u003ep\u003c/em\u003e-values, are provided in \u003cb\u003eSupplementary Table S3\u003c/b\u003e.\u003c/p\u003e \u003cp\u003eAcross all nine metrics, effect sizes were small (all Cohen\u0026rsquo;s |\u003cem\u003ed\u003c/em\u003e| \u0026le; 0.32), and absolute mean differences ranged from 0.09 to 0.57 points on a 10-point scale. Importantly, these differences were below or at the threshold of the pre-specified effect size of 0.5, indicating no practically meaningful differences in perceived acceptability between examination formats.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cb\u003e3. Reliability\u003c/b\u003e \u003c/p\u003e \u003cp\u003eHuman-gen MCQs demonstrated significantly higher item-level discrimination than AI-gen MCQs (mean DI: 0.25\u0026thinsp;\u0026plusmn;\u0026thinsp;0.13 versus 0.19\u0026thinsp;\u0026plusmn;\u0026thinsp;0.15; Mann\u0026ndash;Whitney \u003cem\u003eU\u003c/em\u003e test: \u003cem\u003eW\u003c/em\u003e\u0026thinsp;=\u0026thinsp;8214, \u003cem\u003ep\u003c/em\u003e\u0026thinsp;=\u0026thinsp;0.0001), corresponding to a small effect size (\u003cem\u003er\u003c/em\u003e\u003csub\u003e\u003cem\u003erb\u003c/em\u003e\u003c/sub\u003e = 0.310). Bootstrap resampling confirmed overlapping 95% confidence intervals (Human-gen: 0.22\u0026ndash;0.27; AI-gen: 0.16\u0026ndash;0.22; Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eThere is no statistically significant difference in DE between protocols (mean DE: 39.8% \u0026plusmn; 30.4% for Human-gen vs 32.6% \u0026plusmn; 27.2% for AI-gen; Wilcoxon rank-sum test, \u003cem\u003eW\u003c/em\u003e\u0026thinsp;=\u0026thinsp;7182, \u003cem\u003ep\u003c/em\u003e\u0026thinsp;=\u0026thinsp;0.070; Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003e). The associated effect size was small (\u003cem\u003er\u003c/em\u003e\u003csub\u003e\u003cem\u003erb\u003c/em\u003e\u003c/sub\u003e = 0.135). Bootstrap resampling demonstrated overlapping 95% confidence intervals for mean DE (Human-gen: 34.3%\u0026ndash;45.6%; AI-gen: 27.7%\u0026ndash;37.7%; Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e), supporting comparable distractor plausibility across protocols.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003col\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003e4. Validity Measures\u003c/b\u003e \u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003e4.1 Student Performance Outcomes\u003c/b\u003e \u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003c/ol\u003e \u003c/p\u003e \u003cp\u003eIn the ITT population (\u003cem\u003eN\u003c/em\u003e\u0026thinsp;=\u0026thinsp;258), mean exam scores were 73.4% \u0026plusmn; 12.3% for the Human-gen exam and 71.0% \u0026plusmn; 9.0% for the AI-gen exam. This difference was not statistically significant (\u003cem\u003ep\u003c/em\u003e\u0026thinsp;=\u0026thinsp;0.083) and corresponded to a small effect size (Cohen\u0026rsquo;s \u003cem\u003ed\u003c/em\u003e\u0026thinsp;=\u0026thinsp;0.22), indicating substantial overlap in performance between groups. In the PP population, the Human-gen group demonstrated modestly higher mean scores than the AI-gen group (p\u0026thinsp;=\u0026thinsp;0.005), though the effect size remained small (Cohen\u0026rsquo;s \u003cem\u003ed\u003c/em\u003e\u0026thinsp;=\u0026thinsp;0.36). Sensitivity analyses comparing ITT and PP results are presented in the \u003cb\u003eSupplementary Results 2.3.\u003c/b\u003e\u003c/p\u003e \u003cp\u003eTaken together, these findings indicate that student performance was broadly comparable across AI-gen and Human-gen exams. Score distributions for both exams are shown in \u003cb\u003eSupplementary Figure S5\u003c/b\u003e.\u003c/p\u003e \u003cp\u003e \u003cb\u003e4.2 Subgroup Analyses\u003c/b\u003e:\u003c/p\u003e \u003cp\u003eSubgroup analyses demonstrated no meaningful differences in performance by undergraduate academic background within either examination condition. Across both analytic populations (ITT and PP), mean scores and effect sizes were small and non-significant across background categories, indicating comparable performance regardless of prior academic training (Supplementary Results Section \u003cb\u003e2.3.2\u003c/b\u003e for PP and Section \u003cb\u003e3.1\u003c/b\u003e for ITT).\u003c/p\u003e \u003cp\u003eTheme-level subgroup analyses demonstrated consistent patterns across both PP and ITT populations. Students performed modestly better on Human-gen items in several curricular themes (Heart Murmur, Upper Gastrointestinal Tract, Pregnancy, and Diabetes), whereas performance favored AI-gen items in Nutrient Absorption. Other themes showed no meaningful differences between examination formats. Although several theme-level comparisons reached statistical significance, standardized effect sizes were uniformly small, indicating limited practical divergence between exam formats. Full statistical details are reported in Supplementary Results Section \u003cb\u003e2.3.3\u003c/b\u003e (PP) and Section \u003cb\u003e3.2\u003c/b\u003e (ITT).\u003c/p\u003e \u003cp\u003eGender-stratified analyses revealed a modest male performance advantage in the AI-gen exam that was not observed in the Human-gen exam. This pattern was consistent across analytic populations but remained small in magnitude and should therefore be interpreted as exploratory (Supplementary Results Section \u003cb\u003e2.3.4\u003c/b\u003e for PP and Section \u003cb\u003e3.3\u003c/b\u003e for ITT).\u003c/p\u003e \u003cp\u003e \u003cb\u003e5. Educational impact\u003c/b\u003e \u003c/p\u003e \u003cp\u003eStudents self-assessed their preparedness for the upcoming summative examination before and after completing the mock exams. A mixed-effects repeated-measures model (random intercept for participant) demonstrated no statistically significant main effects of time (pre- versus post-exam; \u003cem\u003eF\u003c/em\u003e(1,256)\u0026thinsp;=\u0026thinsp;3.53, \u003cem\u003ep\u003c/em\u003e\u0026thinsp;=\u0026thinsp;0.062, η\u0026sup2;p\u0026thinsp;=\u0026thinsp;0.01), group assignment (AI-gen versus Human-gen; \u003cem\u003eF\u003c/em\u003e(1,256)\u0026thinsp;=\u0026thinsp;1.59, \u003cem\u003ep\u003c/em\u003e\u0026thinsp;=\u0026thinsp;0.208, η\u0026sup2;p\u0026thinsp;=\u0026thinsp;0.006), or time \u0026times; group interaction (\u003cem\u003eF\u003c/em\u003e(1,256)\u0026thinsp;=\u0026thinsp;1.68, \u003cem\u003ep\u003c/em\u003e\u0026thinsp;=\u0026thinsp;0.196, η\u0026sup2;p\u0026thinsp;=\u0026thinsp;0.007).\u003c/p\u003e \u003cp\u003eWithin-group comparisons demonstrated small increases in preparedness in both groups (AI-gen: mean difference\u0026thinsp;=\u0026thinsp;0.056, SE [standard error]\u0026thinsp;=\u0026thinsp;0.141, \u003cem\u003ep\u003c/em\u003e\u0026thinsp;=\u0026thinsp;0.691; Human-gen: mean difference\u0026thinsp;=\u0026thinsp;0.305, SE\u0026thinsp;=\u0026thinsp;0.133, \u003cem\u003ep\u003c/em\u003e\u0026thinsp;=\u0026thinsp;0.023). Although the change in the Human-gen group reached statistical significance, the magnitude of the effect was modest and remained well below the pre-specified effect size of 0.5, indicating limited practical significance. Sensitivity analyses using the per-protocol population yielded comparable patterns and are reported in the \u003cb\u003eSupplementary Results 2.4\u003c/b\u003e.\u003c/p\u003e \u003c/div\u003e"},{"header":"Discussion","content":"\u003cp\u003eThis randomized trial provides the first RCT-level assessment of AI-generated versus human-generated MCQs for formative medical evaluation, examining feasibility, acceptability, reliability, validity, and educational impact. By applying the Assessment Utility Framework to AI-assisted item generation under exam conditions, these findings extend assessment utility theory into the context of LLM-supported assessment design. Collectively, our findings indicate that LLMs deliver significant gains in feasibility while achieving psychometric performance comparable to Human-gen MCQs. However, modest trade-offs in discrimination, topic-specific score variability, and signals of potential demographic performance differences underscore the ongoing necessity of human oversight of AI-gen exam implementation.\u003c/p\u003e \u003cp\u003eThe AI-gen protocol significantly accelerated question development without compromising exam acceptability for students. For a given learning objective, generating a Human-gen MCQ required on average 5.6 times longer than generating an AI-gen MCQ. This efficiency gain is consistent with the 5\u0026ndash;10-fold acceleration conferred by AI in previous studies of MCQ generation for medical education.\u003csup\u003e\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e,\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e,\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e\u003c/sup\u003e Critically, the decrease in time required per question did not have a practically meaningful detrimental effect on the acceptability of the exam to learners. Student perceptions of the two exams were either statistically equivalent or favoured the Human-gen exam, albeit with a small effect size. This is concordant with prior research which demonstrated that, despite nuanced differences in acceptability metrics, AI-gen MCQs were assessed by both content experts\u003csup\u003e\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e\u003c/sup\u003e and students\u003csup\u003e\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e\u003c/sup\u003e to be of comparable overall quality to those written by experts. Together, these results indicate that AI-gen MCQs are a feasible alternative to Human-gen MCQs for formative assessments and that the two exams were broadly equivalently acceptable to students.\u003c/p\u003e \u003cp\u003eThe DI values for both the AI- and Human-gen exams, while acceptable for a formative assessment, were suboptimal for high-stakes or formal summative assessments.\u003csup\u003e\u003cspan additionalcitationids=\"CR30\" citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e\u003c/sup\u003e The DI for the Human-gen exam was significantly higher than that for the AI-gen exam, but the effect size of this difference was small, suggesting that the protocols had only a minor effect on the discriminative power of the exams. The suppression of discrimination indices across both arms likely reflects the open-resource format of the mock exam, which may artificially inflate scores of lower-performing students. Moreover, the DE distributions for the Human- and AI-gen exams were indistinguishable, which indicates that the distractors produced using the AI-gen protocol were equivalently plausible to students as those produced with the Human-gen protocol.\u003c/p\u003e \u003cp\u003eScore comparability provides evidence supporting the construct validity of the AI-generated exam. While Human-generated items produced modestly higher scores in PP population, effect sizes were small and the difference attenuated in the ITT analysis. The substantial overlap in score distributions across analytic populations suggests that both formats measured similar underlying constructs without meaningful divergence in performance outcomes.\u003c/p\u003e \u003cp\u003eContent-level analysis showed theme-dependent performance differences between some AI-gen and Human-gen question themes. Notably, these differences should not be interpreted as a direct indicator of MCQ quality, but rather as a signal that students\u0026rsquo; performance varied depending on the MCQ generation process (AI versus Human) for the same themes and objectives. This suggests a potential interaction between generation protocols and curricular content. Future studies should test whether prompt engineering can mitigate theme-dependent variation in AI-generated MCQs.\u003c/p\u003e \u003cp\u003eAn exploratory subgroup analysis by gender identified a potential performance difference on the AI-gen exam, with male students outperforming female students with a small-to-moderate effect size (\u003cem\u003er\u003c/em\u003e\u003csub\u003e\u003cem\u003erb\u003c/em\u003e\u003c/sub\u003e = 0.26). This pattern was not observed on the Human-generated exam (\u003cem\u003er\u003c/em\u003e\u003csub\u003e\u003cem\u003erb\u003c/em\u003e\u003c/sub\u003e = 0.07). Although this analysis was exploratory and underpowered, the observed signal underscores the importance of ongoing human oversight and demographic monitoring as LLMs are integrated into medical education assessment pipelines.\u003c/p\u003e \u003cp\u003eThis study has several limitations that should be considered when interpreting the results. First, the trial was conducted at a single center with only first-year MD students. The use of a mock examination format may limit the generalizability of these findings to high-stakes summative assessments. Both question-generation teams consisted of senior medical students rather than faculty experts, which could underestimate the psychometric advantage achievable by experienced human item writers. The open-resource format of the mock exam may have artificially suppressed the discrimination indices for both groups. Findings should be interpreted in the context of rapidly evolving LLM capabilities, as ongoing improvements in model performance and alignment may affect the generalizability of these results over time.\u003csup\u003e\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e\u003c/sup\u003e Finally, this study was retrospectively registered, which may introduce risk of reporting bias; however, all outcomes and analyses were pre-specified prior to data analysis to mitigate this risk.\u003c/p\u003e"},{"header":"Conclusions","content":"\u003cp\u003eThe AHEAD Trial demonstrates that LLMs can dramatically accelerate MCQ development while producing formative assessments that are acceptable to learners and psychometrically comparable to human-generated questions. While human-generated items currently demonstrate modest advantages in discrimination, the performance gap was small in this study and may narrow further as LLMs continue to improve.\u003c/p\u003e"},{"header":"Abbreviations","content":"\u003cp\u003eAI, Artificial Intelligence\u003c/p\u003e\n\u003cp\u003eDI, Discrimination Index\u003c/p\u003e\n\u003cp\u003eLLM, Large Language Model\u003c/p\u003e\n\u003cp\u003eMCQ, Multiple-Choice Question\u003c/p\u003e\n\u003cp\u003eMD, Medical doctor\u003c/p\u003e\n\u003cp\u003eRCT, Randomized Controlled Trial\u003c/p\u003e\n\u003cp\u003eSD, Standard Deviation\u003c/p\u003e\n\u003cp\u003eAHEAD,\u003cstrong\u003e\u0026nbsp;A\u003c/strong\u003eI vs \u003cstrong\u003eH\u003c/strong\u003euman \u003cstrong\u003eE\u003c/strong\u003exam \u003cstrong\u003eA\u003c/strong\u003essessment and \u003cstrong\u003eD\u003c/strong\u003eevelopment\u003c/p\u003e\n\u003cp\u003eGPT, Generative Pre-trained Transformer\u003c/p\u003e\n\u003cp\u003eCI, Confidence Interval\u003c/p\u003e\n\u003cp\u003eREDCap, Research Electronic Data Capture\u003c/p\u003e"},{"header":"Declarations","content":"\u003ch2\u003eEthics approval and consent to participate\u003c/h2\u003e\n\u003cp\u003eThis study was approved by the University of British Columbia Behavioral Research Ethics Board (Ethics ID: H16-00044). All participants provided informed consent prior to enrollment. Participation was voluntary, uncompensated, and had no impact on academic standing.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAvailability of data and materials\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe datasets supporting the conclusions of this article are partially available in the Zenodo repository, https://zenodo.org/records/18284890 , which contains a randomized subset of 30 paired AI-generated and human-generated multiple-choice questions for transparency and reproducibility. Additional de-identified participant-level data, analysis code, and full datasets supporting the findings of this study are available from the corresponding author upon reasonable request. Relevant summary data are included within the article and its supplementary files.\u003c/p\u003e\n\u003ch2\u003eCompeting interests\u003c/h2\u003e\n\u003cp\u003eThe authors declare that they have no competing interests.\u003c/p\u003e\n\u003ch2\u003eFunding\u003c/h2\u003e\n\u003cp\u003eThis study received no external funding.\u003c/p\u003e\n\u003ch2\u003eConsent for publication\u003c/h2\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e\n\u003ch2\u003eAuthors\u0026rsquo; contributions\u003c/h2\u003e\n\u003cp\u003e\u003cstrong\u003eDheyaa Al-Najafi (DAN)\u003c/strong\u003e conceived and designed the study, led the development of both AI-generated and human-generated multiple-choice question (MCQ) protocols, coordinated data collection, conducted the statistical analyses, interpreted the data, generated figures, and drafted the initial manuscript.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eKatherine D. Krause (KDK)\u003c/strong\u003e contributed substantially to manuscript drafting and revision, advised on statistical methodology, assisted with figure generation and figure captions, and contributed to the overall structure and critical refinement of the manuscript.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eYundi Wang (YW)\u003c/strong\u003e contributed to the initial drafting of the manuscript and participated in critical revision and refinement of subsequent versions.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eQi Kang Zuo (QKZ)\u003c/strong\u003e contributed to the initial study design, participated in MCQ development and validation, and contributed to drafting and finalizing the manuscript.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eMaya Koblanski (MK), Cameron Leong (CL), Emma Schmidt (ES), Muhammad Faran (MF), Vanay Verma (VV), Ravi Vyas (RV), Matthew Campbell (MC), and Jaehyun Hwang (JH)\u003c/strong\u003e contributed to MCQ development, writing the initial draft, and reviewed the manuscript for important intellectual content.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eJiawen Deng (JD)\u003c/strong\u003e advised on psychometric methodology and statistical interpretation and critically revised the manuscript.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAnita Palepu (AP)\u003c/strong\u003e provided senior supervision and substantial intellectual leadership throughout the study. She contributed to study conception and design, provided critical methodological and statistical guidance, advised on analytic strategy and interpretation of findings, and critically revised the manuscript for intellectual rigor, clarity, and educational relevance.\u003c/p\u003e\n\u003cp\u003eAll authors reviewed and approved the final manuscript and agree to be accountable for the work.\u003c/p\u003e\n\u003ch2\u003eAcknowledgements\u003c/h2\u003e\n\u003cp\u003eThe authors thank the UBC MD Class of 2028 for their participation in this study. We also acknowledge the support of the UBC Student Learning Group (SLG) and the administrative and technical support provided by the REDCap platform at the University of British Columbia.\u003c/p\u003e\n\u003cp\u003eWe are especially grateful to Dr. Kevin Eva for his contributions to this work, including conceptual guidance on the application of the Assessment Utility Framework, input on study design and statistical methodology, and critical review of the manuscript.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eParekh P, Bahadoor V. The Utility of Multiple-Choice Assessment in Current Medical Education: A Critical Review. Cureus. 2024;16(5):e59778.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRoyal KD, Hedgpeth MW, Jeon T, Colford CM. Automated Item Generation: The Future of Medical Education Assessment. EMJ Innov. 2018;88\u0026ndash;93.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCase SM, Holtzman K, Ripkey DR. Developing an Item Pool for CBT: A Practical Comparison of Three Models of Item Writing. Acad Med. 2001;76(Supplement):S111\u0026ndash;3.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRudner LM. Implementing the Graduate Management Admission Test Computerized Adaptive Test. In: van der Linden W, Glas C, editors. Elements of Adaptive Testing [Internet]. New York, NY: Springer; 2009 [cited 2025 Jul 13]. pp. 151\u0026ndash;65. (Statistics for Social and Behavioral Sciences). Available from: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://link.springer.com/chapter/\u003c/span\u003e\u003cspan address=\"https://link.springer.com/chapter/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1007/978-0-387-85461-8_8#chapter-info\u003c/span\u003e\u003cspan address=\"10.1007/978-0-387-85461-8_8#chapter-info\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGierl MJ, Lai H, Turner SR. Using automatic item generation to create multiple-choice test items. Med Educ. 2012;46(8):757\u0026ndash;65.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBrin D, Sorin V, Konen E, Nadkarni G, Glicksberg BS, Klang E. How GPT models perform on the United States medical licensing examination: a systematic review. Discov Appl Sci. 2024;6(10):500.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eArtsi Y, Sorin V, Konen E, Glicksberg BS, Nadkarni G, Klang E. Large language models for generating medical examinations: systematic review. BMC Med Educ [Internet]. 2024 Mar 29 [cited 2025 Jul 23];24(1). Available from: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://bmcmededuc.biomedcentral.com/articles/\u003c/span\u003e\u003cspan address=\"https://bmcmededuc.biomedcentral.com/articles/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1186/s12909-024-05239-y\u003c/span\u003e\u003cspan address=\"10.1186/s12909-024-05239-y\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMistry NP, Saeed H, Rafique S, Le T, Obaid H, Adams SJ. Large Language Models as Tools to Generate Radiology Board-Style Multiple-Choice Questions. Acad Radiol. 2024;31(9):3872\u0026ndash;8.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLaw AK, So J, Lui CT, Choi YF, Cheung KH, Kei-ching Hung K et al. AI versus human-generated multiple-choice questions for medical education: a cohort study in a high-stakes examination. BMC Med Educ [Internet]. 2025 Feb 8 [cited 2025 Jul 23];25(1). Available from: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://bmcmededuc.biomedcentral.com/articles/\u003c/span\u003e\u003cspan address=\"https://bmcmededuc.biomedcentral.com/articles/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1186/s12909-025-06796-6\u003c/span\u003e\u003cspan address=\"10.1186/s12909-025-06796-6\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCheung BHH, Lau GKK, Wong GTC, Lee EYP, Kulkarni D, Seow CS, et al. ChatGPT versus human in generating medical graduate exam multiple choice questions\u0026mdash;A multinational prospective study (Hong Kong S.A.R., Singapore, Ireland, and the United Kingdom). PLoS ONE. 2023;18(8):e0290691.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eElzayyat M, Mohammad JN, Zaqout S. Assessing LLM-generated vs. expert-created clinical anatomy MCQs: a student perception-based comparative study in medical education. Med Educ Online. 2025;30(1):2554678.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCamarata T, McCoy L, Rosenberg R, Temprine Grellinger KR, Brettschnieder K, Berman J. LLM-Generated multiple choice practice quizzes for preclinical medical students. Adv Physiol Educ. 2025;49(3):758\u0026ndash;63.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBiswas S. Passing is Great: Can ChatGPT Conduct USMLE Exams? Ann Biomed Eng. 2023;51(9):1885\u0026ndash;6.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBalu A, Prvulovic ST, Fernandez Perez C, Kim A, Donoho DA, Keating G. Evaluating the value of AI-generated questions for USMLE step 1 preparation: A study using ChatGPT-3.5. Med Teach. 2025;1\u0026ndash;9.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKlang E, Portugez S, Gross R, Kassif Lerner R, Brenner A, Gilboa M, et al. Advantages and pitfalls in utilizing artificial intelligence for crafting medical examinations: a medical education pilot study with GPT-4. BMC Med Educ. 2023;23:772.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAyub I, Hamann D, Hamann CR, Davis MJ. Exploring the Potential and Limitations of Chat Generative Pre-trained Transformer (ChatGPT) in Generating Board-Style Dermatology Questions: A Qualitative Analysis. Cureus. 2023;15(8):e43717.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSevgi UT, Erol G, Doğruel Y, S\u0026ouml;nmez OF, Tubbs RS, G\u0026uuml;ngor A. The role of an open artificial intelligence platform in modern neurosurgical education: a preliminary study. Neurosurg Rev. 2023;46(1):86.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHan Z, Battaglia F, Udaiyar A, Fooks A, Terlecky SR. An explorative assessment of ChatGPT as an aid in medical education: Use it with caution. Med Teach. 2024;46(5):657\u0026ndash;64.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVan Der Vleuten CPM. The assessment of professional competence: Developments, research and practical implications. Adv Health Sci Educ. 1996;1(1):41\u0026ndash;67.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eColbert-Getz JM, Ryan M, Hennessey E, Lindeman B, Pitts B, Rutherford KA et al. Measuring Assessment Quality With an Assessment Utility Rubric for Medical Education. MedEdPORTAL. 2017;10588.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSchulz KF, Altman DG, Moher D, Group CONSORT. CONSORT 2010 Statement: updated guidelines for reporting parallel group randomised trials. BMC Med. 2010;8:18.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHarris PA, Taylor R, Thielke R, Payne J, Gonzalez N, Conde JG. Research electronic data capture (REDCap)\u0026mdash;A metadata-driven methodology and workflow process for providing translational research informatics support. J Biomed Inf. 2009;42(2):377\u0026ndash;81.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHarris PA, Taylor R, Minor BL, Elliott V, Fernandez M, O\u0026rsquo;Neal L, et al. The REDCap consortium: Building an international community of software platform partners. J Biomed Inf. 2019;95:103208.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCohen J. Statistical power analysis for the behavioral sciences. 2nd ed. Hillsdale, N.J: L. Erlbaum Associates; 1988. p. 567.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLakens D. Sample Size Justification. Ravenzwaaij D van. editor Collabra Psychol. 2022;8(1):33267.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eEbel RL, Frisbie DA. Evaluating Test and Item Characteristics. Essentials of Educational Measurement. 5th ed. Englewood Cliffs, NJ: Prentice-Hall Inc.; 1991. pp. 220\u0026ndash;40.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZuckerman M, Flood R, Tan RJB, Kelp N, Ecker DJ, Menke J, et al. ChatGPT for assessment writing. Med Teach. 2023;45(11):1224\u0026ndash;7.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWu H, Zerner T, Lee D, Court-Kowalski S, Devitt P, Palmer E. GPT-4 versus human authors in clinically complex MCQ creation: A blinded analysis of item quality. Med Teach. 2025;1\u0026ndash;14.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRao C, Kishan Prasad H, Sajitha K, Permi H, Shetty J. Item analysis of multiple choice questions: Assessing an assessment tool in medical students. Int J Educ Psychol Res. 2016;2(4):201.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHingorjo MR, Jaleel F. Analysis of One-Best MCQs: the Difficulty Index, Discrimination Index and Distractor Efficiency. J Pak Med Assoc. 2012;62(2).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAlKhatib HS, Brazeau G, Akour A, Almuhaissen SA. Evaluation of the effect of items\u0026rsquo; format and type on psychometric properties of sixth year pharmacy students clinical clerkship assessment items. BMC Med Educ. 2020;20:190.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eQiu Z, Jiang A, Qi C, Gan W, Zhu L, Mou W, et al. Temporal evolution of large language models (LLMs) in oncology. J Transl Med. 2025;23(1):1219.\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"bmc-medical-education","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"meed","sideBox":"Learn more about [BMC Medical Education](http://bmcmededuc.biomedcentral.com/)","snPcode":"","submissionUrl":"https://www.editorialmanager.com/meed/default.aspx","title":"BMC Medical Education","twitterHandle":"BMC_series","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"em","reportingPortfolio":"BMC Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"Artificial Intelligence, Medical Education, Multiple-Choice Questions, Large Language Models, ChatGPT, Formative Assessment, Student Perception, Exam Feasibility, Randomized Controlled Trial, Educational Technology","lastPublishedDoi":"10.21203/rs.3.rs-9187684/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-9187684/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003eBackground\u003c/h2\u003e \u003cp\u003eDeveloping high-quality multiple-choice examinations in medical education is time- and resource-intensive. Large language models (LLMs) offer a promising approach to accelerate question development; however, their utility for exam development remains underexplored.\u003c/p\u003e\u003ch2\u003eMethods\u003c/h2\u003e \u003cp\u003eThe AHEAD Trial (\u003cb\u003eA\u003c/b\u003eI vs \u003cb\u003eH\u003c/b\u003euman \u003cb\u003eE\u003c/b\u003exam \u003cb\u003eA\u003c/b\u003essessment and \u003cb\u003eD\u003c/b\u003eevelopment) was a participant-blinded, parallel-group randomized controlled trial conducted among first-year medical students. Students were randomized to complete a 112-item case-based, single-best-answer mock examination composed of either AI-generated or human-generated multiple-choice questions (MCQs). Questions were developed using identical curricular objectives. AI-generated items were produced via a dual-model workflow (ChatGPT for generation; Google Gemini for validation); human-generated items were authored by senior medical students. Outcomes were evaluated using Van der Vleuten\u0026rsquo;s Assessment Utility Framework across feasibility, acceptability, reliability, validity, and educational impact. Primary analyses were conducted in the intention-to-treat (ITT) population using appropriate parametric or non-parametric tests, with effect sizes and 95% confidence intervals reported.\u003c/p\u003e\u003ch2\u003eResults\u003c/h2\u003e \u003cp\u003eA total of 258 students were randomized, with 127 allocated to the AI-generated exam arm and 131 to the human-generated exam arm. LLM-assisted MCQ development achieved a 5.6-fold efficiency gain compared with human authorship (4.2\u0026thinsp;\u0026plusmn;\u0026thinsp;1.9 vs. 19.6\u0026thinsp;\u0026plusmn;\u0026thinsp;7.5 minutes per item; \u003cem\u003ep\u003c/em\u003e\u0026thinsp;\u0026lt;\u0026thinsp;0.0001). Student perceptions of exam acceptability\u0026mdash;including clarity, difficulty, relevance, and educational value\u0026mdash;were comparable between AI-generated and human-generated exams (all \u003cem\u003ep\u003c/em\u003e\u0026thinsp;\u0026gt;\u0026thinsp;0.05; effect sizes\u0026thinsp;\u0026lt;\u0026thinsp;0.5). Human-generated items demonstrated slightly higher discrimination indices than AI-generated items, though the effect size was small, and distractor efficiency did not differ between protocols. Student performance was marginally higher on the human-generated exam, though this difference was not significant in the ITT analysis. Exploratory analyses identified theme-specific performance variation and potential gender performance differences on the AI-generated exam. Neither exam meaningfully changed students\u0026rsquo; perceived preparedness.\u003c/p\u003e\u003ch2\u003eConclusions\u003c/h2\u003e \u003cp\u003eLLMs can substantially accelerate MCQ development while producing formative assessments that are psychometrically comparable and acceptable to learners. Although small differences persist, these findings support the integration of LLM-assisted item generation within a human-in-the-loop framework, combining AI efficiency with expert oversight to preserve psychometric quality and equity.\u003c/p\u003e\u003ch2\u003eTrial registration\u003c/h2\u003e \u003cp\u003eThis study was retrospectively registered on ClinicalTrials.gov (Identifier NCT07481162 registered March 18, 2026). Prospective registration was not performed as the study was conducted as an embedded educational intervention within a voluntary formative examination setting. The study protocol and statistical analysis plan were prespecified prior to data analysis. The trial is reported in accordance with CONSORT 2025 guidelines.\u003c/p\u003e","manuscriptTitle":"Psychometric Performance and Student Perceptions of AI- versus Human-Generated Multiple-Choice Questions: The AHEAD Randomized Controlled Trial","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-04-03 02:15:42","doi":"10.21203/rs.3.rs-9187684/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2026-04-24T09:06:31+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-04-24T06:43:06+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-04-20T20:05:13+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"190571688650171623981670409467260785941","date":"2026-04-10T08:19:54+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"73814195523860013388485265664839623713","date":"2026-04-10T05:18:31+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-04-09T10:45:11+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"52765885871740946278283658413951636592","date":"2026-04-07T10:23:55+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2026-03-28T19:47:24+00:00","index":"","fulltext":""},{"type":"editorInvited","content":"","date":"2026-03-26T11:26:46+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2026-03-25T11:02:42+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2026-03-25T11:01:46+00:00","index":"","fulltext":""},{"type":"submitted","content":"BMC Medical Education","date":"2026-03-21T18:54:42+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"bmc-medical-education","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"meed","sideBox":"Learn more about [BMC Medical Education](http://bmcmededuc.biomedcentral.com/)","snPcode":"","submissionUrl":"https://www.editorialmanager.com/meed/default.aspx","title":"BMC Medical Education","twitterHandle":"BMC_series","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"em","reportingPortfolio":"BMC Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"7fe53aa5-885e-4222-9f92-2a472d7db3c0","owner":[],"postedDate":"April 3rd, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"in-revision","subjectAreas":[],"tags":[],"updatedAt":"2026-04-24T09:39:24+00:00","versionOfRecord":[],"versionCreatedAt":"2026-04-03 02:15:42","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-9187684","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-9187684","identity":"rs-9187684","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00