Quality assurance and validity of AI-generated Single Best Answer questions

doi:10.21203/rs.3.rs-5666975/v1

Quality assurance and validity of AI-generated Single Best Answer questions

2025 · doi:10.21203/rs.3.rs-5666975/v1

preprint OA: closed

Full text JSON View at publisher

Full text 83,183 characters · extracted from preprint-html · click to expand

Quality assurance and validity of AI-generated Single Best Answer questions | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Quality assurance and validity of AI-generated Single Best Answer questions Ayla Ahmed, Ellen Kerr, Andrew O'Malley This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-5666975/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 25 Feb, 2025 Read the published version in BMC Medical Education → Version 1 posted 4 You are reading this latest preprint version Abstract Background Recent advancements in generative artificial intelligence have opened new avenues in educational methodologies, particularly in medical education. This study seeks to assess whether generative AI might be useful in addressing the depletion of assessment question banks, a challenge intensified during the Covid-era due to the prevalence of open-book examinations, and to augment the pool of formative assessment opportunities available to students. While many recent publications have sought to ascertain whether AI can achieve a passing standard in existing examinations, this study investigates the potential for AI to generate the exam itself. Summary of Work This research utilized a commercially available AI large language model (LLM), OpenAI GPT-4, to generate 220 single best answer (SBA) questions, adhering to Medical Schools Council Assessment Alliance guidelines the and a selection of Learning Outcomes (LOs) of the Scottish Graduate-Entry Medicine (ScotGEM) program. The AI-generated questions underwent quality-assurance screening to ensure compliance with the stipulated guidelines and LOs. A subset of these questions was then incorporated into an examination format alongside an equal number of human-authored questions and subsequently undertaken by a cohort of medical students. The performance of both AI-generated and human-authored questions was evaluated, focusing on facility and discrimination index as key metrics. Summary of Results The screening process revealed that 69% of AI-generated SBAs were fit for inclusion in the examinations with little or no modifications required. Modifications, when necessary, were predominantly due to reasons such as the inclusion of "all of the above" options, usage of American English spellings, and non-alphabetized answer choices. 31% of questions were rejected for inclusion in the examinations, due to factual inaccuracies and non-alignment with students’ learning. When included in an examination, post hoc statistical analysis indicated no significant difference in performance between the AI- and human- authored questions in terms of facility and discrimination index. Discussion and Conclusion The outcomes of this study suggest that AI LLMs can generate SBA questions that are in line with best-practice guidelines and specific LOs. However, the a robust quality assurance process is necessary to ensure that erroneous questions are identified and rejected. The insights gained from this research provide a foundation for further investigation into refining AI prompts, aiming for a more reliable generation of curriculum-aligned questions. LLMs show significant potential in supplementing traditional methods of question generation in medical education. This approach offers a viable solution to rapidly replenish and diversify assessment resources in medical curricula, marking a step forward in the intersection of AI and education. Figures Figure 1 Figure 2 Figure 3 Introduction The practice of active learning has been shown to increase examination performance among learners (1). Falling under this category of study, retrieval practice has been proven to be an effective strategy to enhance meaningful learning (2). More specifically, retrieval practice using single best answer (SBA) questions can greatly improve purposeful learning (3). SBAs are renowned for their objectivity, efficiency, and ability to briefly encompass a wide range of knowledge (4). They cultivate the development of critical thinking skills and the ability to make clinical decisions, not merely memorization (5). Particularly with the use of case-based SBAs these questions facilitate the integration of theoretical knowledge with practical implementation, helping with the preparation for professional practice (6). However, in the field of medical education, the quantity of SBAs available for formative use are scarce due to the challenges involved in producing them. Best-practice guidelines have been established in the United Kingdom by the Medical Schools Councill Assessment Alliance (MSCAA) to help examiners to produce SBAs that are of consistent high-quality (7). A high-quality SBA should include a stem, lead-in and five options for candidates to choose from, comprising one correct answer and four distractors. Guidelines also advise on question tagging (to assist with examination blueprinting) and the use of abbreviations and reference ranges. For example, the stem—which contains the information needed to answer the question—must be in present tense and in third person narration. The lead-in question must avoid negative phrasing—such as “which is the least likely?” —and should be possible to answer without looking at the options. The five options should all be plausible answers, homogenous in length in relation to each other and presented in alphabetical order. Abbreviation guidelines include advise about chemical compounds and units of time/measurement. The guidelines also contain a detailed style guide relating to, for example, the use of apostrophes, notation of bacteria, capital letters, abbreviations etcetera (19). These guidelines have been produced to foster consistency in the style and quality of SBAs across the United Kingdom ahead of the roll-out of national licencing examination, similar to what already exists in the United States in the Medical Licencing Examination (USMLE). Producing an SBA of appropriate difficulty is also a challenge that examiners face. Questions should be challenging enough to discriminate between those who understand the material well and those who do not, however they should not be so difficult that they are discouraging or unaligned to the teaching provided (8). To distinguish between higher and lower performing students, good SBAs should allow markedly better performance from those who tend score highly on exams than those who score poorly (9), in a concept known as ‘discrimination’. Moderately difficult items tend to demonstrate good discrimination, while very difficult and very easy SBA are more likely to show no discrimination or negative discrimination, whereby overall weaker students tend to do better than stronger students (10). Creating SBAs requires medical knowledge, conceptual integration, and avoiding pitfalls (8). Pitfalls can be identified by candidates with good examination technique (also known as “testwise” candidates) to occasionally correctly answer a question without possessing the underlying knowledge (11). These pitfalls can include mutually exclusive distractors—where two mutually exclusive responses are correct—and the use of absolute terms such as: always, never, and all (12). “Irrelevant difficulty” describes questions that are made difficult for reasons that are unrelated to the aim of the assessment (11). This can arise from negatively phrased, long, and overly complicated questions (11). Additionally, the process of constructing every SBA is a very time-consuming for medical educators (13). Even with all of these factors being met, once a question is created—due to answer memorisation—reusing questions from year to year can threaten the validity, efficacy, and test security of exams (14). This disposability further perpetuates the scarcity of SBAs. Although constructing ones owns SBAs can be effective, this exercise is unlikely to be met with enthusiasm due to it being unfamiliar and a perceived inefficient use of time (15). Unique challenges were introduced during the Covid-19 pandemic in 2020 and affected several subsequent academic years. In order to ensure the safety of students and staff during examinations during the pandemic, most medical schools opted for online open-book examinations (16). This decision resulted in vast numbers of SBAs essentially entering the public domain and reducing the number of questions available for use in subsequent years. Recent advancements in generative AI offer potential solutions to the challenges associated with producing large numbers of SBA questions. Generative Pre-Trained Transformer (GPT) is a language model developed by OpenAI that powers ChatGPT, a chatbot app, which is designed to generate human-like text responses based on the information it receives from a user (17). Due to their human-like text understanding and generation, OpenAI’s large language models (LLM’s) offer potential solutions to healthcare education (18). LLMs are trained to predict a sequence of words based on the words, and their context, that come before them (19). Therefore, LLMs can generate a novel sequence of words if trained on a sufficiently large amount of text data (19). So far, this model has already been able to successfully pass the United States Medical Licensing Examination (USMLE), so it is reasonable to hypothesize that the LLM could potentially write the exam itself (19). While many recent publications have sought to ascertain whether AI can achieve a passing standard in existing examinations, this study investigates the potential for AI to generate the exam itself. Materials & Methods Question Generation GPT-4 via ChatGPT, a commercially available AI large language model (LLM), was used to generate 220 single best answer (SBA) questions. A prompt (Textbox 1) was developed which incorporated abridged guidance from the Medical Schools Council Assessment Alliance (MSCAA) Style Guide, which was developed to define best practice in the style and format of single best answer (SBA) questions for the applied knowledge test (AKT) of the General Medical Council’s (GMC) upcoming Medical Licencing Examination (MLA). Also included in the prompt was an Intended Learning Outcome (ILO) from the case-based learning component of the Scottish graduate entry medical programme (ScotGEM) curriculum, which provided GPT 4 with the necessary context to generate the SBA. This prompt was presented to independent instances of GPT-4 until two hundred SBAs had been generated; each SBA was recorded in preparation for quality assurance checks before potential inclusion in an examination. Textbox 1: The prompt used to generate 220 SBA questions, based on abridged MSCAA guidance and Learning Outcomes that were addressed during the course. A good single best answer (SBA) question for medical students should have the following components: 1) A stem, which ensures the question is clinically relevant without the use of names for patients, bad practice/errors, setting of care (unless it influences decisions about correct answer), or any extraneous details, 2) A lead-in, which poses a specific question in which students can arrive at the correct answer without seeing the options and avoids negative phrasing or focus around bad practice, and 3) Five potential answers. There are some rules for the five potential answers: only one option should be correct; be relevant to the stem and lead-in; be plausible and realistic; the options must be listed in alphabetical order; neither "all of the above" or "none of the above" should be listed as options; be homogenous in content; there should always be five options. I will provide you with a learning outcome. You will write three good SBA questions for that learning outcome. You will also generate explanations for the correct answers to the questions. [The relevant ILO was inserted here] Quality Assurance The AI-generated SBAs underwent standard quality-assurance screening to ensure compliance with the stipulated guidelines and ILOs according to our standard assessment process. Each SBA was sent to the member of staff responsible for the ILO that was used in its generation. Staff were instructed to assess each question for suitability, alignment and quality. Questions were categorised as either acceptable, modifiable, or rejected. The reason for modification/rejection was recorded and categorised. Examination A subset of questions that were identified as either acceptable or modifiable were selected (n=50) and used alongside an equal number of human-authored questions (n=50) to construct two formative SBA examinations each of 50 items (one for Year 1 and one for Year 2 of the ScotGEM programme, with 25 AI-authored questions and 25 human-authored questions each), which were subsequently undertaken by medical students. Both examinations were delivered online via the Speedwell eSystem platform within a set time, following the usual process for the delivery of formative examinations. Students used their own devices to complete the formative examinations and could do so in a location of their choosing within the given time. Students were encouraged to complete the formative examination as a closed book exercise to better prepare for their in-person summative closed book examinations and give a better indication of their learning. The order of the questions, both AI generated and human-authored, were randomised in both examinations so that neither were grouped together. After the opportunity to undertake the formative examinations had closed, results and marking keys were released to students on the next working day. A feedback session on the overall performance in the formative examinations was also provided to both year groups in the week that results were released. Post-Hoc Item Analysis For each question facility was calculated. Facility indicates the proportion of student responses that were correct and is therefore occasionally referred to as "difficulty". A value of 0 means that no students answered the question correctly, while a value of 1 means that all students answered the question correctly. This was done by taking the sum of the actual marks (1 or 0) for each student and dividing this by the number of candidates. (20). For each question discrimination index (DI) was also calculated. This was done by subtracting the facility score calculated from the worst-performing 27% of students from the facility score of the best-performing 27% of students. These groups are categorized based on students' overall examination performance. DI therefore enables assessors to discern whether a given question is effective at separating out the best- and worst- performing students. A positive DI means that more of the best-performing students chose the correct answer than in the worst-performing students. A DI of 0means the best-performing and worst-performing students did equally well (or badly) at that question. A negative DI means that more of the worst-performing students selected the correct answer than those in the best-performing group. Items with a negative DI could indicate a problem with the question, such as a technical error (e.g. an incorrect answer is labelled as the correct answer in the assessment software) or an issue with alignment between the items and the students' learning (20). The performance of both AI-generated and human-authored questions was evaluated by comparing the F and DI scores of human- and AI-authored SBAs, and t-tests for each measurement (F and DI) was conducted between the AI- vs human- authored questions to ascertain if any significant difference existed between the questions. Ethical Considerations Ethical approval was awarded on 19 Oct 2024 by the School of Medicine Ethics Committee at University of St Andrews (Reference number MD17293). Since this study does not involve patients a clinical trial number was not applicable. All students received information about the study before attempting the exam, and a consent form was required to be completed. Students’ exam responses were only used in this study if they provided consent. As there was a dependent relationship between the researcher (i.e. teachers/assessors on the ScotGEM programme) and the students, it was made clear that withholding consent would not disadvantage the student and that they would be able to attempt the exams as normal. Students’ individual responses were not anonymised to provide them with feedback after the conclusion of the exam; however, the identities of the students were not presented to the research team for the purposes of the post-hoc analysis. Participants were provided with their induvial exam feedback (privately) and the general findings of this study (during a whole-class briefing session). Students’ responses and were stored securely on University cloud storage (OneDrive) and only accessible by the research team. Results The total number of participants (i.e. students who undertook the exam and consented to their data being included in this study) was 142, comprising 84 from Year 1 and 58 from Year 2. Quality Assurance Of the 220 SBA questions generated by GPT-4, 49 (22.2%) were usable without any amendments whatsoever, 103 (46.8%) required minor modifications to correct issues of style, content or alignment, and 68 (30.9%) were rejected because they were either unsalvageable or would have required prohibitively extensive amendments to enable their inclusion in an examination. The reasons for rejection or modification of a question were categorized into: “beyond student knowledge”, “improper house style”, “not sensible”, and “other”, which included items that were too simple, duplicates, or not items that were not aligned to the provided learning outcome. These findings are included in Figure 1. Beyond Student Knowledge These questions did not align with student learning. This included information that was either not taught in a lecture, taught in teaching that took place after the exam, or not in the medical school curriculum at all. A rejected example includes the question below that mentions the respiratory system, which had not been taught at this point: A 25-year-old man presents to his GP with a fever and a productive cough. A chest x-ray reveals consolidation in the right lower lobe. Which of the following is the most likely immune response to this infection? Activation of B cells to produce antibodies Activation of cytotoxic T cells Activation of natural killer cells Phagocytosis by neutrophils Release of interferons by infected cells House Style These questions involved failure to abide to the format required for medical school questions as outlined by the MSCAA Style Guide(7). Although these guidelines were incorporated into the prompt inputted into the AI, occasionally mistakes were still made by the model. These mistakes included: incorrect wording that does not affect the answer, unnecessary addition of information, an option being “all/none of the above”, a “NOT” question and Americanised spelling. Often, these questions abided to the other guidelines and were, therefore, easily modifiable and eligible for acceptance. A modifiable example includes a “NOT” in the question: A 65-year-old patient is admitted to the hospital with an acute confusional state. Which of the following is NOT a recommended management option for this patient? Administering antipsychotic medication Ensuring adequate hydration and nutrition Providing reality orientation Using bed alarms Using physical restraints Not Sensible This category encompasses all the questions that are inherently confusing for the student to answer. This includes questions that do not make sense, are factually incorrect, have multiple correct/similar options, do not pass the “cover test” which defines the ability to arrive at the correct answer without looking at the options, have incorrect answers, contain incorrect terminology that does affect the answer, are extremely vague, or have missing/incorrect crucial information (e.g. reference ranges). Sometimes questions were modified to ensure guideline adherence, but some were also rejected entirely. A modifiable example includes the lack of reference ranges for PaCO2 and pH, both required to correctly answer the question: A 60-year-old woman with a history of chronic obstructive pulmonary disease (COPD) presents to her primary care physician with worsening shortness of breath. Her arterial blood gas shows a PaCO2 of 60 mmHg and a pH of 7.30. What is the primary mechanism by which CO2 is transported in the blood? As bicarbonate ions As carbamino compounds Bound to albumin Bound to haemoglobin Dissolved in plasma Other Too Simple: A few questions produced—although correct and fully adhered to the guidelines—were too simple for the medical school level. This also entails questions in which the correct answer was mentioned somewhere in the question. A rejected example includes a question that is too easy: A 45-year-old female presents to her GP with fatigue, pallor, and shortness of breath. Her blood tests show a low haemoglobin level. What is the most likely diagnosis? Anaemia Asthma Chronic obstructive pulmonary disease Pneumonia Tuberculosis Repeat: This category entails questions that are so similar they are essentially repeats. One learning outcome is meant to produce 3 different questions. In these questions, one learning outcome produces 2 or more very similar—borderline exact—questions. Does not align to the LO: A few questions did not test the LO given. A rejected example is this question that asked for specific first-line treatment for Type 2 diabetes although the learning outcome was “Be aware of UK medicine legislation and principles of safe, effective, and sustainable prescribing”. This question was also incorrect as there is no UK Medicines Legislation that determines first line choice of treatments: A 45-year-old man with a history of hypertension presents to his GP with a new diagnosis of type 2 diabetes mellitus. According to UK medicine legislation, which of the following is the most appropriate first-line treatment for this patient? Gliclazide Glimepiride Metformin Pioglitazone Sitagliptin Post-hoc Item Analysis Facility There was no statistically significant difference in facility between AI- and human-authored questions (p = 0.176). However, descriptive statistics suggest that students found the AI-authored questions easier than human-authored ones. Discrimination Index There was no statistically significant difference in discrimination index between AI- and human-authored questions (p = 0.175). However, because facility was slightly higher in AI-authored questions (0.70 vs 0.64), they were less discriminating. Discussion The outcomes of this study suggest that AI LLMs can generate SBA questions that are in line with best-practice guidelines and specific LOs, showing significant potential in supplementing traditional methods of question generation in medical education. While 69% of questions were usable with no or minor modification, 31% of questions were not suitable for inclusion; these findings highlight the necessity of a systematic quality assurance process to ensure only high-quality items proceed into students’ examinations. Issues primarily relate to formatting/style, absent constructive alignment and inappropriate level of difficulty. When quality-assured AI-authored questions are used in examinations, descriptive statistics suggest that AI-generated questions are slightly easier and less discriminating that human-authored questions, although not to a statistically significant degree. Although there is a paucity of literature in this emerging area, the findings of this study broadly align with early reports elsewhere in the literature. There is broad agreement that models can generate questions that are often indistinguishable from human-written ones ( 21 – 25 ), there is also trepidation regarding the quality of the AI-generated questions. Although the this study did not detect a statistical difference in discrimination index between AI- and human- generated questions, other reports in the literature suggest this difference does exist in that AI-generated questions may have lower discriminatory power compared to human-written questions ( 23 , 26 ). In addition to concerns around quality, there are also emerging reports in the literature regarding outdated terminology, age- and gender- specific inaccuracies, and geographically insensitivities being detected in AI-generated examination questions ( 22 ). Similar issues relating to representation have also been detected when creating other types of content involving patients or clinical scenarios ( 27 – 29 ), and also when using generative AI to assist practitioners with clinical reasoning ( 30 ). These findings suggest that complete replacement of human-authored questions is not feasible. However, there is considerable potential for the use of this technology to assist humans. This approach offers a viable solution to rapidly replenish and diversify assessment resources in medical curricula, marking a step forward in the intersection of AI and education. Even when AI-generated questions do not satisfy the high-standards demanded by Universities and regulators, they can still serve to inspire new ideas for human authors. A portion of the process in producing questions involves the creative aspect of curating a stem, question, and 5 options. Even if a question is entirely rejected and re-written—not only modified—the initial ideas can be of great help. In this way, the AI can essentially aid in solving writer’s block. Due to the infancy and fast-moving capabilities of generative AI tools, there are some limitations associated with this study that could be overcome as the technology develops. Possible approaches to refining our method includes using more specific Learning Outcomes when inputting our prompt into the LLM. A common complaint of students is the vagueness of the LOs, which can complicate determining which facts are important to focus on. This distinguishment could be beneficial in a course such as medicine, where the volume of content is extremely large. An alternative—or addition—to this could be to provide the LLM with actual teaching materials or lecture recordings. This could produce questions that are better aligned with students’ learning. In terms of the LLM used, the exponential advancement of AI could potentially produce a more sophisticated model that could be incorporated instead as previously mentioned. Additionally, we could append our own question banks to train our own model. The simplicity of a specialized model could be used for scaling up the use of this technology. This technology could also display adaptive difficulty where questions can be adjusted in difficulty based on the student’s performance, ensuring appropriate levels of challenge. With sufficient trial and error, a fully trained model could be released to the public for student and teacher use. While this study focused on the development of SBAs, other forms of assessment used in medical teaching can be evaluated. This includes Short Written Answers (SWAs), Very Short Answer Questions (VSAQs), and Observed Structural Clinical Examinations (OSCEs). While these, and the SBAs, can be used in the production of formative questions there is also a possibility that this technology could be used in summative assessment as well. Focus groups of both students and staff could potentially highlight the direction this research could go in. By discovering the perspectives of student and staff on what they thought of the study, this could reveal information about where the data should be applied. Conclusion The outcomes of this study suggest that AI LLMs can generate SBA questions that are in line with best-practice guidelines and specific LOs. However, the necessity of a quality assurance process to fine-tune formatting and curriculum alignment is evident. When quality-assured AI-authored questions are used in exams, they do not perform any differently to human-authored questions. The insights gained from this research provide a foundation for further investigation into refining AI prompts, aiming for a more reliable generation of curriculum-aligned questions. Declarations Author Contribution AO coordinated the project and wrote the manuscript. AA generated the questions. EK coordinated the data collection. References Freeman S, Eddy SL, McDonough M, Smith MK, Okoroafor N, Jordt H, et al. Active learning increases student performance in science, engineering, and mathematics. Proc Natl Acad Sci U S A. 2014;111(23):8410-5. Karpicke JD, Blunt JR. Retrieval practice produces more learning than elaborative studying with concept mapping. Science. 2011;331(6018):772-5. Smith MA, Karpicke JD. Retrieval practice with short-answer, multiple-choice, and hybrid tests. Memory. 2014;22(7):784-802. Mujeeb AM, Pardeshi ML, Ghongane BB. Comparative assessment of multiple choice questions versus short essay questions in pharmacology examinations. Indian J Med Sci. 2010;64(3):118-24. Bassett MH. Teaching Critical Thinking without (Much) Writing: Multiple‐Choice and Metacognition. Teaching Theology & Religion. 2016;19(1):20-40. Khan MU, Aljarallah BM. Evaluation of Modified Essay Questions (MEQ) and Multiple Choice Questions (MCQ) as a tool for Assessing the Cognitive Skills of Undergraduate Medical Students. Int J Health Sci (Qassim). 2011;5(1):39-43. Alliance MSCA. Medical Schools Council Applied Knowledge Test Style Guide. 2022 July 2022. Artsi Y, Sorin V, Konen E, Glicksberg BS, Nadkarni G, Klang E. Large language models for generating medical examinations: systematic review. BMC Med Educ. 2024;24(1):354. Kumar D, Jaipurkar R, Shekhar A, Sikri G, Srinivas V. Item analysis of multiple choice questions: A quality assurance test for an assessment tool. Med J Armed Forces India. 2021;77(Suppl 1):S85-s9. Sim SM, Rasiah RI. Relationship between item difficulty and discrimination indices in true/false-type multiple choice questions of a para-clinical multidisciplinary paper. Ann Acad Med Singap. §6;35(2):67-71. Coughlin PA, Featherstone CR. How to Write a High Quality Multiple Choice Question (MCQ): A Guide for Clinicians. Eur J Vasc Endovasc Surg. 2017;54(5):654-8. Rush BR, Rankin DC, White BJ. The impact of item-writing flaws and item complexity on examination item difficulty and discrimination value. BMC Med Educ. 2016;16(1):250. Gilardi F, Alizadeh M, Kubli M. ChatGPT outperforms crowd workers for text-annotation tasks. Proc Natl Acad Sci U S A. 2023;120(30):e2305016120. Leo J, Kurdi G, Matentzoglu N, Parsia B, Sattler U, Forge S, et al. Ontology-Based Generation of Medical, Multi-term MCQs. International Journal of Artificial Intelligence in Education. 2019;29(2):145-88. Palmer E, Devitt P. Constructing multiple choice questions as a method for learning. Ann Acad Med Singap. 2006;35(9):604-8. Monaghan AM. Medical Teaching and Assessment in the Era of COVID-19. Journal of Medical Education and Curricular Development. 2020;7:238212052096525. Plevris V, Papazafeiropoulos G, Jiménez Rios A. Chatbots Put to the Test in Math and Logic Problems: A Comparison and Assessment of ChatGPT-3.5, ChatGPT-4, and Google Bard. Ai. 2023;4(4):949-69. Brin D, Sorin V, Vaid A, Soroush A, Glicksberg BS, Charney AW, et al. Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments. Sci Rep. 2023;13(1):16492. Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepano C, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198. Kelley TL. The selection of upper and lower groups for the validation of test items. Journal of Educational Psychology. 1939;30(1):17-24. Bedi S, Fleming SL, Chiang C-C, Morse K, Kumar A, Patel B, et al. QUEST-AI: A System for Question Generation, Verification, and Refinement using AI for USMLE-Style Exams. medRxiv. 2024. E K, S P, R G, R KL, A B, M G, et al. Advantages and pitfalls in utilizing artificial intelligence for crafting medical examinations: a medical education pilot study with GPT-4. BMC Med Educ. 2023;23(1):772. Laupichler MC, Rother JF, Grunwald Kadow IC, Ahmadi S, Raupach T. Large Language Models in Medical Education: Comparing ChatGPT- to Human-Generated Exam Questions. Acad Med. 2024;99(5):508-12. Zuckerman M, Flood R, Tan RJB, Kelp N, Ecker DJ, Menke J, et al. ChatGPT for assessment writing. Med Teach. 2023;45(11):1224-7. Kiyak YS, Emekli E. ChatGPT prompts for generating multiple-choice questions in medical education and evidence on their validity: a literature review. Postgrad Med J. 2024;100(1189):858-65. Coskun O, Kiyak YS, Budakoglu, II. ChatGPT to generate clinical vignettes for teaching and multiple-choice questions for assessment: A randomized controlled experiment. Med Teach. 2024:1-7. O'Malley A, Veenhuizen M, Ahmed A. Ensuring Appropriate Representation in Artificial Intelligence-Generated Medical Imagery: Protocol for a Methodological Approach to Address Skin Tone Bias. JMIR AI. 2024;3:e58275. Fan BE, Chow M, Winkler S. Artificial Intelligence-Generated Facial Images for Medical Education. Medical Science Educator. 2023. Ali R, Tang OY, Connolly ID, Abdulrazeq HF, Mirza FN, Lim RK, et al. Demographic Representation in 3 Leading Artificial Intelligence Text-to-Image Generators. JAMA Surgery. 2024;159(1):87-95. M'Gadzah SAT, O'Malley A. Does a complex prompt alter the diagnostic accuracy of common ophthalmological conditions by GPT-4? Journal of Medical Internet Research. 2024. Medical Schools Council (2022). Medical Schools Applied Knowledge Test Style Guide . Version 2. Additional Declarations No competing interests reported. Cite Share Download PDF Status: Published Journal Publication published 25 Feb, 2025 Read the published version in BMC Medical Education → Version 1 posted Editorial decision: Revision requested 26 Dec, 2024 Editor assigned by journal 22 Dec, 2024 Submission checks completed at journal 22 Dec, 2024 First submitted to journal 18 Dec, 2024 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-5666975","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":394494115,"identity":"c22fb525-0044-4e4a-bfd6-01537b03fe41","order_by":0,"name":"Ayla Ahmed","email":"","orcid":"","institution":"University of St Andrews","correspondingAuthor":false,"prefix":"","firstName":"Ayla","middleName":"","lastName":"Ahmed","suffix":""},{"id":394494116,"identity":"6a5191e9-0330-4ff4-b146-36db80981251","order_by":1,"name":"Ellen Kerr","email":"","orcid":"","institution":"University of St Andrews","correspondingAuthor":false,"prefix":"","firstName":"Ellen","middleName":"","lastName":"Kerr","suffix":""},{"id":394494117,"identity":"c9cdaffc-96f2-409a-9d4f-3681cfd9758f","order_by":2,"name":"Andrew O'Malley","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA4klEQVRIiWNgGAWjYHACxgcfIAwDhFADfi3MhjMgqonXwibNQ5IW/v7DB6Rt/vxh4J/dvO3Dx7Y6Bv72A2ySM/BokbiRlmCc22bAIHHnWPHMmW2HGSTOJLBJbsDnrBs8Bsm5DQb1DTdyjJl5tx0AijCwST7Ao0P+/BmDwxZ/DBjkQVr+bqsDMghoMTiQY9jMwGbAYADSwriNGchgwO8wwxtpyYy9bcYgRjFj77/DPIZnEpst8Xlf7vzh4z9+/JFjkLuRvJnhx5k6Obnjhw/e7MGjBQPwEI7IUTAKRsEoGAUEAQBCgktTrrNcQQAAAABJRU5ErkJggg==","orcid":"","institution":"University of St Andrews","correspondingAuthor":true,"prefix":"","firstName":"Andrew","middleName":"","lastName":"O'Malley","suffix":""}],"badges":[],"createdAt":"2024-12-18 07:23:21","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-5666975/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-5666975/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1186/s12909-025-06881-w","type":"published","date":"2025-02-25T15:57:05+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":75489260,"identity":"f8421dca-6eb5-4ec3-b409-de728e25ab12","added_by":"auto","created_at":"2025-02-05 06:54:12","extension":"jpeg","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":140345,"visible":true,"origin":"","legend":"Outcomes of the quality assurance assessment of the 220 AI-generated SBA questions.","description":"","filename":"floatimage1.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-5666975/v1/c811928f5a7f19efc9e97a1b.jpeg"},{"id":75489263,"identity":"8cd81a56-f934-48cf-918b-4d1b021eeae6","added_by":"auto","created_at":"2025-02-05 06:54:13","extension":"jpg","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":28340,"visible":true,"origin":"","legend":"\u003cp\u003eUnnumbered image in the Result section.\u003c/p\u003e","description":"","filename":"unfig1.jpg","url":"https://assets-eu.researchsquare.com/files/rs-5666975/v1/695f213814f9d82e5abdd3d9.jpg"},{"id":75489264,"identity":"b5605f5b-3d95-4cd2-8b3a-79daad002e05","added_by":"auto","created_at":"2025-02-05 06:54:14","extension":"jpg","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":29561,"visible":true,"origin":"","legend":"\u003cp\u003eUnnumbered image in the Result section.\u003c/p\u003e","description":"","filename":"unfig2.jpg","url":"https://assets-eu.researchsquare.com/files/rs-5666975/v1/a717015ce3a10216e62947fd.jpg"},{"id":77622875,"identity":"e320ffd3-edff-46a6-8876-1d8332c9eeaa","added_by":"auto","created_at":"2025-03-03 16:10:44","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":657458,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-5666975/v1/79e7d0cf-6e7a-47fe-90a8-ba376daeb3a0.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Quality assurance and validity of AI-generated Single Best Answer questions","fulltext":[{"header":"Introduction","content":"\u003cp\u003eThe practice of active learning has been shown to increase examination performance among learners (1). Falling under this category of study, retrieval practice has been proven to be an effective strategy to enhance meaningful learning (2). More specifically, retrieval practice using single best answer (SBA) questions can greatly improve purposeful learning (3). SBAs are renowned for their objectivity, efficiency, and ability to briefly encompass a wide range of knowledge (4). They cultivate the development of critical thinking skills and the ability to make clinical decisions, not merely memorization (5). Particularly with the use of case-based SBAs these questions facilitate the integration of theoretical knowledge with practical implementation, helping with the preparation for professional practice (6). However, in the field of medical education, the quantity of SBAs available for formative use are scarce due to the challenges involved in producing them.\u003c/p\u003e\n\u003cp\u003eBest-practice guidelines have been established in the United Kingdom by the Medical Schools Councill Assessment Alliance (MSCAA) to help examiners to produce SBAs that are of consistent high-quality (7). A high-quality SBA should include a stem, lead-in and five options for candidates to choose from, comprising one correct answer and four distractors. Guidelines also advise on question tagging (to assist with examination blueprinting) and the use of abbreviations and reference ranges. For example, the stem—which contains the information needed to answer the question—must be in present tense and in third person narration. The lead-in question must avoid negative phrasing—such as “which is the least likely?” —and should be possible to answer without looking at the options. The five options should all be plausible answers, homogenous in length in relation to each other and presented in alphabetical order. Abbreviation guidelines include advise about chemical compounds and units of time/measurement. The guidelines also contain a detailed style guide relating to, for example, the use of apostrophes, notation of bacteria, capital letters, abbreviations etcetera (19). These guidelines have been produced to foster consistency in the style and quality of SBAs across the United Kingdom ahead of the roll-out of national licencing examination, similar to what already exists in the United States in the Medical Licencing Examination (USMLE).\u003c/p\u003e\n\u003cp\u003eProducing an SBA of appropriate difficulty is also a challenge that examiners face. Questions should be challenging enough to discriminate between those who understand the material well and those who do not, however they should not be so difficult that they are discouraging or unaligned to the teaching provided (8). To distinguish between higher and lower performing students, good SBAs should allow markedly better performance from those who tend score highly on exams than those who score poorly (9), in a concept known as ‘discrimination’. Moderately difficult items tend to demonstrate good discrimination, while very difficult and very easy SBA are more likely to show no discrimination or negative discrimination, whereby overall weaker students tend to do better than stronger students (10).\u003c/p\u003e\n\u003cp\u003eCreating SBAs requires medical knowledge, conceptual integration, and avoiding pitfalls (8). Pitfalls can be identified by candidates with good examination technique (also known as “testwise” candidates) to occasionally correctly answer a question without possessing the underlying knowledge (11). These pitfalls can include mutually exclusive distractors—where two mutually exclusive responses are correct—and the use of absolute terms such as: always, never, and all (12). “Irrelevant difficulty” describes questions that are made difficult for reasons that are unrelated to the aim of the assessment (11). This can arise from negatively phrased, long, and overly complicated questions (11). Additionally, the process of constructing every SBA is a very time-consuming for medical educators (13). Even with all of these factors being met, once a question is created—due to answer memorisation—reusing questions from year to year can threaten the validity, efficacy, and test security of exams (14). This disposability further perpetuates the scarcity of SBAs. Although constructing ones owns SBAs can be effective, this exercise is unlikely to be met with enthusiasm due to it being unfamiliar and a perceived inefficient use of time (15).\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eUnique challenges were introduced during the Covid-19 pandemic in 2020 and affected several subsequent academic years. In order to ensure the safety of students and staff during examinations during the pandemic, most medical schools opted for online open-book examinations (16). This decision resulted in vast numbers of SBAs essentially entering the public domain and reducing the number of questions available for use in subsequent years.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eRecent advancements in generative AI offer potential solutions to the challenges associated with producing large numbers of SBA questions. Generative Pre-Trained Transformer (GPT) is a language model developed by OpenAI that powers ChatGPT, a chatbot app, which is designed to generate human-like text responses based on the information it receives from a user (17). Due to their human-like text understanding and generation, OpenAI’s large language models (LLM’s) offer potential solutions to healthcare education (18). LLMs are trained to predict a sequence of words based on the words, and their context, that come before them (19). Therefore, LLMs can generate a novel sequence of words if trained on a sufficiently large amount of text data (19). So far, this model has already been able to successfully pass the United States Medical Licensing Examination (USMLE), so it is reasonable to hypothesize that the LLM could potentially write the exam itself (19). While many recent publications have sought to ascertain whether AI can achieve a passing standard in existing examinations, this study investigates the potential for AI to generate the exam itself.\u0026nbsp;\u003c/p\u003e"},{"header":"Materials \u0026 Methods","content":"\u003ch2\u003eQuestion Generation\u003c/h2\u003e\n\u003cp\u003eGPT-4 via ChatGPT, a commercially available AI large language model (LLM), was used to generate 220 single best answer (SBA) questions. A prompt (Textbox 1) was developed which incorporated abridged guidance from the Medical Schools Council Assessment Alliance (MSCAA) Style Guide, which was developed to define best practice in the style and format of single best answer (SBA) questions for the applied knowledge test (AKT) of the General Medical Council’s (GMC) upcoming Medical Licencing Examination (MLA). Also included in the prompt was an Intended Learning Outcome (ILO) from the case-based learning component of the Scottish graduate entry medical programme (ScotGEM) curriculum, which provided GPT 4 with the necessary context to generate the SBA.\u0026nbsp;This prompt was presented to independent instances of GPT-4 until two hundred SBAs had been generated; each SBA was recorded in preparation for quality assurance checks before potential inclusion in an examination.\u003c/p\u003e\n\u003cp\u003eTextbox 1: The prompt used to generate 220 SBA questions, based on abridged MSCAA guidance and Learning Outcomes that were addressed during the course.\u003c/p\u003e\n\u003cp\u003eA good single best answer (SBA) question for medical\u0026nbsp;students should have the following components: 1) A\u0026nbsp;stem, which ensures the question is clinically relevant\u0026nbsp;without the use of names for patients, bad\u0026nbsp;practice/errors, setting of care (unless it influences\u0026nbsp;decisions about correct answer), or any extraneous\u0026nbsp;details, 2) A lead-in, which poses a specific question in\u0026nbsp;which students can arrive at the correct answer without\u0026nbsp;seeing the options and avoids negative phrasing or focus\u0026nbsp;around bad practice, and 3) Five potential answers.\u003c/p\u003e\n\u003cp\u003eThere are some rules for the five potential answers: only\u0026nbsp;one option should be correct; be relevant to the stem\u0026nbsp;and lead-in; be plausible and realistic; the options must be\u0026nbsp;listed in alphabetical order; neither \"all of the above\" or\u0026nbsp;\"none of the above\" should be listed as options; be\u0026nbsp;homogenous in content; there should always be five\u0026nbsp;options.\u003c/p\u003e\n\u003cp\u003eI will provide you with a learning outcome. You will write\u0026nbsp;three good SBA questions for that learning outcome.\u0026nbsp;You\u0026nbsp;will also generate explanations for the correct answers to\u0026nbsp;the questions.\u003c/p\u003e\n\u003cp\u003e[The relevant ILO was inserted here]\u003c/p\u003e\n\u003ch2\u003eQuality Assurance\u003c/h2\u003e\n\u003cp\u003eThe AI-generated SBAs underwent standard quality-assurance screening to ensure compliance with the stipulated guidelines and ILOs according to our standard assessment process. Each SBA was sent to the member of\u0026nbsp;staff responsible for the ILO that was used in its\u0026nbsp;generation. Staff were instructed to assess each question for suitability, alignment and quality. Questions were categorised as either\u0026nbsp;acceptable, modifiable, or rejected. The reason for modification/rejection was recorded and categorised.\u0026nbsp;\u003c/p\u003e\n\u003ch2\u003eExamination\u0026nbsp;\u003c/h2\u003e\n\u003cp\u003eA subset of questions that were identified as either acceptable or modifiable were selected (n=50) and used alongside an equal number of human-authored questions (n=50) to construct two formative SBA examinations each of 50 items (one for Year 1 and one for Year 2 of the ScotGEM programme, with 25 AI-authored questions and 25 human-authored questions each), which were subsequently undertaken by medical students.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eBoth examinations were delivered online via the Speedwell eSystem platform within a set time, following the usual process for the delivery of formative examinations. Students used their own devices to complete the formative examinations and could do so in a location of their choosing within the given time. Students were encouraged to complete the formative examination as a closed book exercise to better prepare for their in-person summative closed book examinations and give a better indication of their learning. The order of the questions, both AI generated and human-authored, were randomised in both examinations so that neither were grouped together. After the opportunity to undertake the formative examinations had closed, results and marking keys were released to students on the next working day. A feedback session on the overall performance in the formative examinations was also provided to both year groups in the week that results were released.\u003c/p\u003e\n\u003ch2\u003ePost-Hoc Item Analysis\u003c/h2\u003e\n\u003cp\u003eFor each question facility was\u0026nbsp;calculated. Facility indicates\u0026nbsp;the\u0026nbsp;proportion of student responses that were correct and is therefore occasionally referred to as \"difficulty\". A value of 0 means that no students answered the question correctly, while a value of 1 means that all students answered the question correctly. This was done by taking the sum of the actual marks (1 or 0) for each student and dividing this by the number of candidates. (20).\u003c/p\u003e\n\u003cp\u003eFor each question discrimination index (DI) was also calculated. This was done by subtracting the facility score calculated from the worst-performing 27% of students from the facility score of the best-performing 27% of students. These groups are categorized based on students' overall examination performance.\u0026nbsp;DI therefore enables assessors to discern whether a given question is effective at separating out the best- and worst- performing students.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eA\u0026nbsp;positive\u0026nbsp;DI means that more of the best-performing students chose the correct answer than in the worst-performing students. A\u0026nbsp;DI of\u0026nbsp;0means the best-performing and worst-performing students did equally well (or badly) at that question. A negative DI\u0026nbsp;means that more of the worst-performing students selected the correct answer than those in the best-performing group. Items with a negative DI could indicate a problem with the question, such as a technical error (e.g. an incorrect answer is labelled as the correct answer in the assessment software) or an issue with alignment between the items and the students' learning (20).\u003c/p\u003e\n\u003cp\u003eThe performance of both AI-generated and human-authored questions was evaluated by comparing the F and DI scores of human- and AI-authored SBAs, and t-tests for each measurement (F and DI) was conducted between the AI- vs human- authored questions to ascertain if any significant difference existed between the questions.\u0026nbsp;\u003c/p\u003e\n\u003ch2\u003eEthical Considerations\u003c/h2\u003e\n\u003cp\u003eEthical approval was awarded on 19 Oct 2024 by the School of Medicine Ethics Committee at University of St Andrews (Reference number MD17293). Since this study does not involve patients a clinical trial number was not applicable. All students received information about the study before attempting the exam, and a consent form was required to be completed. Students’ exam responses were only used in this study if they provided consent. As there was a dependent relationship between the researcher (i.e. teachers/assessors on the ScotGEM programme) and the students, it was made clear that withholding consent would not disadvantage the student and that they would be able to attempt the exams as normal.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eStudents’ individual responses were not anonymised to provide them with feedback after the conclusion of the exam; however, the identities of the students were not presented to the research team for the purposes of the post-hoc analysis. Participants were provided with their induvial exam feedback (privately) and the general findings of this study (during a whole-class briefing session). Students’ responses and were stored securely on University cloud storage (OneDrive) and only accessible by the research team.\u0026nbsp;\u003c/p\u003e"},{"header":"Results","content":"\u003cp\u003eThe total number of participants (i.e. students who undertook the exam and consented to their data being included in this study) was 142, comprising 84 from Year 1 and 58 from Year 2.\u0026nbsp;\u003c/p\u003e\n\u003ch2\u003eQuality Assurance\u003c/h2\u003e\n\u003cp\u003eOf the 220 SBA questions generated by GPT-4, 49 (22.2%) were usable without any amendments whatsoever, 103 (46.8%) required minor modifications to correct issues of style, content or alignment, and 68 (30.9%) were rejected because they were either unsalvageable or would have required prohibitively extensive amendments to enable their inclusion in an examination. The reasons for rejection or modification of a question were categorized into: \u0026ldquo;beyond student knowledge\u0026rdquo;, \u0026ldquo;improper house style\u0026rdquo;, \u0026ldquo;not sensible\u0026rdquo;, and \u0026ldquo;other\u0026rdquo;, which included items that were too simple, duplicates, or not items that were not aligned to the provided learning outcome. These findings are included in Figure 1.\u003c/p\u003e\n\u003ch3\u003eBeyond Student Knowledge\u003c/h3\u003e\n\u003cp\u003eThese questions did not align with student learning. This included information that was either not taught in a lecture, taught in teaching that took place after the exam, or not in the medical school curriculum at all. A rejected example includes the question below that mentions the respiratory system, which had not been taught at this point:\u003c/p\u003e\n\u003cp\u003eA 25-year-old man presents to his GP with a fever and a productive cough. A chest x-ray reveals consolidation in the right lower lobe.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eWhich of the following is the most likely immune response to this infection?\u0026nbsp;\u003c/p\u003e\n\u003col style=\"list-style-type: upper-alpha;\"\u003e\n \u003cli\u003eActivation of B cells to produce antibodies \u0026nbsp;\u003c/li\u003e\n \u003cli\u003eActivation of cytotoxic T cells \u0026nbsp;\u003c/li\u003e\n \u003cli\u003eActivation of natural killer cells \u0026nbsp;\u003c/li\u003e\n \u003cli\u003e\u003cstrong\u003ePhagocytosis by neutrophils\u003c/strong\u003e\u003c/li\u003e\n \u003cli\u003eRelease of interferons by infected cells\u0026nbsp;\u003c/li\u003e\n\u003c/ol\u003e\n\u003ch3\u003eHouse Style\u003c/h3\u003e\n\u003cp\u003eThese questions involved failure to abide to the format required for medical school questions as outlined by the MSCAA Style Guide(7). Although these guidelines were incorporated into the prompt inputted into the AI, occasionally mistakes were still made by the model. These mistakes included: incorrect wording that does not affect the answer, unnecessary addition of information, an option being \u0026ldquo;all/none of the above\u0026rdquo;, a \u0026ldquo;NOT\u0026rdquo; question and Americanised spelling. Often, these questions abided to the other guidelines and were, therefore, easily modifiable and eligible for acceptance. A modifiable example includes a \u0026ldquo;NOT\u0026rdquo; in the question:\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eA 65-year-old patient is admitted to the hospital with an acute confusional state.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eWhich of the following is NOT a recommended management option for this patient? \u0026nbsp;\u003c/p\u003e\n\u003col style=\"list-style-type: upper-alpha;\"\u003e\n \u003cli\u003eAdministering antipsychotic medication\u0026nbsp;\u003c/li\u003e\n \u003cli\u003eEnsuring adequate hydration and nutrition\u0026nbsp;\u003c/li\u003e\n \u003cli\u003eProviding reality orientation\u0026nbsp;\u003c/li\u003e\n \u003cli\u003eUsing bed alarms\u0026nbsp;\u003c/li\u003e\n \u003cli\u003e\u003cstrong\u003eUsing physical restraints\u003c/strong\u003e\u003c/li\u003e\n\u003c/ol\u003e\n\u003ch3\u003eNot Sensible\u003c/h3\u003e\n\u003cp\u003eThis category encompasses all the questions that are inherently confusing for the student to answer. This includes questions that do not make sense, are factually incorrect, have multiple correct/similar options, do not pass the \u0026ldquo;cover test\u0026rdquo; which defines the ability to arrive at the correct answer without looking at the options, have incorrect answers, contain incorrect terminology that does affect the answer, are extremely vague, or have missing/incorrect crucial information (e.g. reference ranges). Sometimes questions were modified to ensure guideline adherence, but some were also rejected entirely. A modifiable example includes the lack of reference ranges for PaCO2 and pH, both required to correctly answer the question:\u003c/p\u003e\n\u003cp\u003eA 60-year-old woman with a history of chronic obstructive pulmonary disease (COPD) presents to her primary care physician with worsening shortness of breath. Her arterial blood gas shows a PaCO2 of 60 mmHg and a pH of 7.30.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eWhat is the primary mechanism by which CO2 is transported in the blood? \u0026nbsp;\u003c/p\u003e\n\u003col style=\"list-style-type: upper-alpha;\"\u003e\n \u003cli\u003e\u003cstrong\u003eAs bicarbonate ions\u0026nbsp;\u003c/strong\u003e\u003c/li\u003e\n \u003cli\u003eAs carbamino compounds\u0026nbsp;\u003c/li\u003e\n \u003cli\u003eBound to albumin \u0026nbsp;\u003c/li\u003e\n \u003cli\u003eBound to haemoglobin \u0026nbsp;\u003c/li\u003e\n \u003cli\u003eDissolved in plasma \u0026nbsp;\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eOther\u003c/p\u003e\n\u003cp\u003eToo Simple: A few questions produced\u0026mdash;although correct and fully adhered to the guidelines\u0026mdash;were too simple for the medical school level. This also entails questions in which the correct answer was mentioned somewhere in the question. A rejected example includes a question that is too easy:\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eA 45-year-old female presents to her GP with fatigue, pallor, and shortness of breath. Her blood tests show a low haemoglobin level.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eWhat is the most likely diagnosis?\u0026nbsp;\u003c/p\u003e\n\u003col style=\"list-style-type: upper-alpha;\"\u003e\n \u003cli\u003e\u003cstrong\u003eAnaemia\u0026nbsp;\u003c/strong\u003e\u003c/li\u003e\n \u003cli\u003eAsthma \u0026nbsp;\u003c/li\u003e\n \u003cli\u003eChronic obstructive pulmonary disease \u0026nbsp;\u003c/li\u003e\n \u003cli\u003ePneumonia \u0026nbsp;\u003c/li\u003e\n \u003cli\u003eTuberculosis\u0026nbsp;\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eRepeat: This category entails questions that are so similar they are essentially repeats. One learning outcome is meant to produce 3 different questions. In these questions, one learning outcome produces 2 or more very similar\u0026mdash;borderline exact\u0026mdash;questions.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eDoes not align to the LO: A few questions did not test the LO given. A rejected example is this question that asked for specific first-line treatment for Type 2 diabetes although the learning outcome was \u0026ldquo;Be aware of UK medicine legislation and principles of safe, effective, and sustainable prescribing\u0026rdquo;. This question was also incorrect as there is no UK Medicines Legislation that determines first line choice of treatments:\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eA 45-year-old man with a history of hypertension presents to his GP with a new diagnosis of type 2 diabetes mellitus. \u0026nbsp;\u003c/p\u003e\n\u003cp\u003eAccording to UK medicine legislation, which of the following is the most appropriate first-line treatment for this patient? \u0026nbsp;\u003c/p\u003e\n\u003col style=\"list-style-type: upper-alpha;\"\u003e\n \u003cli\u003eGliclazide \u0026nbsp;\u003c/li\u003e\n \u003cli\u003eGlimepiride \u0026nbsp;\u003c/li\u003e\n \u003cli\u003e\u003cstrong\u003eMetformin\u0026nbsp;\u003c/strong\u003e\u003c/li\u003e\n \u003cli\u003ePioglitazone \u0026nbsp;\u003c/li\u003e\n \u003cli\u003eSitagliptin\u0026nbsp;\u003c/li\u003e\n\u003c/ol\u003e\n\u003ch2\u003ePost-hoc Item Analysis\u0026nbsp;\u003c/h2\u003e\n\u003ch3\u003eFacility\u003c/h3\u003e\n\u003cp\u003eThere was no statistically significant difference in facility between AI- and human-authored questions (p = 0.176). However, descriptive statistics suggest that students found the AI-authored questions easier than human-authored ones.\u003c/p\u003e\n\u003ch3\u003eDiscrimination Index\u003c/h3\u003e\n\u003cp\u003eThere was no statistically significant difference in discrimination index between AI- and human-authored questions (p = 0.175). However, because facility was slightly higher in AI-authored questions (0.70 vs 0.64), they were less discriminating.\u003c/p\u003e"},{"header":"Discussion","content":"\u003cp\u003e The outcomes of this study suggest that AI LLMs can generate SBA questions that are in line with best-practice guidelines and specific LOs, showing significant potential in supplementing traditional methods of question generation in medical education. While 69% of questions were usable with no or minor modification, 31% of questions were not suitable for inclusion; these findings highlight the necessity of a systematic quality assurance process to ensure only high-quality items proceed into students\u0026rsquo; examinations. Issues primarily relate to formatting/style, absent constructive alignment and inappropriate level of difficulty. When quality-assured AI-authored questions are used in examinations, descriptive statistics suggest that AI-generated questions are slightly easier and less discriminating that human-authored questions, although not to a statistically significant degree.\u003c/p\u003e \u003cp\u003eAlthough there is a paucity of literature in this emerging area, the findings of this study broadly align with early reports elsewhere in the literature. There is broad agreement that models can generate questions that are often indistinguishable from human-written ones (\u003cspan additionalcitationids=\"CR22 CR23 CR24\" citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e), there is also trepidation regarding the quality of the AI-generated questions. Although the this study did not detect a statistical difference in discrimination index between AI- and human- generated questions, other reports in the literature suggest this difference does exist in that AI-generated questions may have lower discriminatory power compared to human-written questions (\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e, \u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eIn addition to concerns around quality, there are also emerging reports in the literature regarding outdated terminology, age- and gender- specific inaccuracies, and geographically insensitivities being detected in AI-generated examination questions (\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e). Similar issues relating to representation have also been detected when creating other types of content involving patients or clinical scenarios (\u003cspan additionalcitationids=\"CR28\" citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e), and also when using generative AI to assist practitioners with clinical reasoning (\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eThese findings suggest that complete replacement of human-authored questions is not feasible. However, there is considerable potential for the use of this technology to assist humans. This approach offers a viable solution to rapidly replenish and diversify assessment resources in medical curricula, marking a step forward in the intersection of AI and education. Even when AI-generated questions do not satisfy the high-standards demanded by Universities and regulators, they can still serve to inspire new ideas for human authors. A portion of the process in producing questions involves the creative aspect of curating a stem, question, and 5 options. Even if a question is entirely rejected and re-written\u0026mdash;not only modified\u0026mdash;the initial ideas can be of great help. In this way, the AI can essentially aid in solving writer\u0026rsquo;s block.\u003c/p\u003e \u003cp\u003eDue to the infancy and fast-moving capabilities of generative AI tools, there are some limitations associated with this study that could be overcome as the technology develops. Possible approaches to refining our method includes using more specific Learning Outcomes when inputting our prompt into the LLM. A common complaint of students is the vagueness of the LOs, which can complicate determining which facts are important to focus on. This distinguishment could be beneficial in a course such as medicine, where the volume of content is extremely large. An alternative\u0026mdash;or addition\u0026mdash;to this could be to provide the LLM with actual teaching materials or lecture recordings. This could produce questions that are better aligned with students\u0026rsquo; learning.\u003c/p\u003e \u003cp\u003eIn terms of the LLM used, the exponential advancement of AI could potentially produce a more sophisticated model that could be incorporated instead as previously mentioned. Additionally, we could append our own question banks to train our own model. The simplicity of a specialized model could be used for scaling up the use of this technology. This technology could also display adaptive difficulty where questions can be adjusted in difficulty based on the student\u0026rsquo;s performance, ensuring appropriate levels of challenge. With sufficient trial and error, a fully trained model could be released to the public for student and teacher use.\u003c/p\u003e \u003cp\u003eWhile this study focused on the development of SBAs, other forms of assessment used in medical teaching can be evaluated. This includes Short Written Answers (SWAs), Very Short Answer Questions (VSAQs), and Observed Structural Clinical Examinations (OSCEs). While these, and the SBAs, can be used in the production of formative questions there is also a possibility that this technology could be used in summative assessment as well.\u003c/p\u003e \u003cp\u003eFocus groups of both students and staff could potentially highlight the direction this research could go in. By discovering the perspectives of student and staff on what they thought of the study, this could reveal information about where the data should be applied.\u003c/p\u003e"},{"header":"Conclusion","content":"\u003cp\u003eThe outcomes of this study suggest that AI LLMs can generate SBA questions that are in line with best-practice guidelines and specific LOs. However, the necessity of a quality assurance process to fine-tune formatting and curriculum alignment is evident. When quality-assured AI-authored questions are used in exams, they do not perform any differently to human-authored questions. The insights gained from this research provide a foundation for further investigation into refining AI prompts, aiming for a more reliable generation of curriculum-aligned questions.\u003c/p\u003e"},{"header":"Declarations","content":"\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003eAO coordinated the project and wrote the manuscript. AA generated the questions. EK coordinated the data collection.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eFreeman S, Eddy SL, McDonough M, Smith MK, Okoroafor N, Jordt H, et al. Active learning increases student performance in science, engineering, and mathematics. Proc Natl Acad Sci U S A. 2014;111(23):8410-5.\u003c/li\u003e\n\u003cli\u003eKarpicke JD, Blunt JR. Retrieval practice produces more learning than elaborative studying with concept mapping. Science. 2011;331(6018):772-5.\u003c/li\u003e\n\u003cli\u003eSmith MA, Karpicke JD. Retrieval practice with short-answer, multiple-choice, and hybrid tests. Memory. 2014;22(7):784-802.\u003c/li\u003e\n\u003cli\u003eMujeeb AM, Pardeshi ML, Ghongane BB. Comparative assessment of multiple choice questions versus short essay questions in pharmacology examinations. Indian J Med Sci. 2010;64(3):118-24.\u003c/li\u003e\n\u003cli\u003eBassett MH. Teaching Critical Thinking without (Much) Writing: Multiple‐Choice and Metacognition. Teaching Theology \u0026amp; Religion. 2016;19(1):20-40.\u003c/li\u003e\n\u003cli\u003eKhan MU, Aljarallah BM. Evaluation of Modified Essay Questions (MEQ) and Multiple Choice Questions (MCQ) as a tool for Assessing the Cognitive Skills of Undergraduate Medical Students. Int J Health Sci (Qassim). 2011;5(1):39-43.\u003c/li\u003e\n\u003cli\u003eAlliance MSCA. Medical Schools Council Applied Knowledge Test Style Guide. 2022 July 2022.\u003c/li\u003e\n\u003cli\u003eArtsi Y, Sorin V, Konen E, Glicksberg BS, Nadkarni G, Klang E. Large language models for generating medical examinations: systematic review. BMC Med Educ. 2024;24(1):354.\u003c/li\u003e\n\u003cli\u003eKumar D, Jaipurkar R, Shekhar A, Sikri G, Srinivas V. Item analysis of multiple choice questions: A quality assurance test for an assessment tool. Med J Armed Forces India. 2021;77(Suppl 1):S85-s9.\u003c/li\u003e\n\u003cli\u003eSim SM, Rasiah RI. Relationship between item difficulty and discrimination indices in true/false-type multiple choice questions of a para-clinical multidisciplinary paper. Ann Acad Med Singap. \u0026sect;6;35(2):67-71.\u003c/li\u003e\n\u003cli\u003eCoughlin PA, Featherstone CR. How to Write a High Quality Multiple Choice Question (MCQ): A Guide for Clinicians. Eur J Vasc Endovasc Surg. 2017;54(5):654-8.\u003c/li\u003e\n\u003cli\u003eRush BR, Rankin DC, White BJ. The impact of item-writing flaws and item complexity on examination item difficulty and discrimination value. BMC Med Educ. 2016;16(1):250.\u003c/li\u003e\n\u003cli\u003eGilardi F, Alizadeh M, Kubli M. ChatGPT outperforms crowd workers for text-annotation tasks. Proc Natl Acad Sci U S A. 2023;120(30):e2305016120.\u003c/li\u003e\n\u003cli\u003eLeo J, Kurdi G, Matentzoglu N, Parsia B, Sattler U, Forge S, et al. Ontology-Based Generation of Medical, Multi-term MCQs. International Journal of Artificial Intelligence in Education. 2019;29(2):145-88.\u003c/li\u003e\n\u003cli\u003ePalmer E, Devitt P. Constructing multiple choice questions as a method for learning. Ann Acad Med Singap. 2006;35(9):604-8.\u003c/li\u003e\n\u003cli\u003eMonaghan AM. Medical Teaching and Assessment in the Era of COVID-19. Journal of Medical Education and Curricular Development. 2020;7:238212052096525.\u003c/li\u003e\n\u003cli\u003ePlevris V, Papazafeiropoulos G, Jim\u0026eacute;nez Rios A. Chatbots Put to the Test in Math and Logic Problems: A Comparison and Assessment of ChatGPT-3.5, ChatGPT-4, and Google Bard. Ai. 2023;4(4):949-69.\u003c/li\u003e\n\u003cli\u003eBrin D, Sorin V, Vaid A, Soroush A, Glicksberg BS, Charney AW, et al. Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments. Sci Rep. 2023;13(1):16492.\u003c/li\u003e\n\u003cli\u003eKung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepano C, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198.\u003c/li\u003e\n\u003cli\u003eKelley TL. The selection of upper and lower groups for the validation of test items. Journal of Educational Psychology. 1939;30(1):17-24.\u003c/li\u003e\n\u003cli\u003eBedi S, Fleming SL, Chiang C-C, Morse K, Kumar A, Patel B, et al. QUEST-AI: A System for Question Generation, Verification, and Refinement using AI for USMLE-Style Exams. medRxiv. 2024.\u003c/li\u003e\n\u003cli\u003eE K, S P, R G, R KL, A B, M G, et al. Advantages and pitfalls in utilizing artificial intelligence for crafting medical examinations: a medical education pilot study with GPT-4. BMC Med Educ. 2023;23(1):772.\u003c/li\u003e\n\u003cli\u003eLaupichler MC, Rother JF, Grunwald Kadow IC, Ahmadi S, Raupach T. Large Language Models in Medical Education: Comparing ChatGPT- to Human-Generated Exam Questions. Acad Med. 2024;99(5):508-12.\u003c/li\u003e\n\u003cli\u003eZuckerman M, Flood R, Tan RJB, Kelp N, Ecker DJ, Menke J, et al. ChatGPT for assessment writing. Med Teach. 2023;45(11):1224-7.\u003c/li\u003e\n\u003cli\u003eKiyak YS, Emekli E. ChatGPT prompts for generating multiple-choice questions in medical education and evidence on their validity: a literature review. Postgrad Med J. 2024;100(1189):858-65.\u003c/li\u003e\n\u003cli\u003eCoskun O, Kiyak YS, Budakoglu, II. ChatGPT to generate clinical vignettes for teaching and multiple-choice questions for assessment: A randomized controlled experiment. Med Teach. 2024:1-7.\u003c/li\u003e\n\u003cli\u003eO\u0026apos;Malley A, Veenhuizen M, Ahmed A. Ensuring Appropriate Representation in Artificial Intelligence-Generated Medical Imagery: Protocol for a Methodological Approach to Address Skin Tone Bias. JMIR AI. 2024;3:e58275.\u003c/li\u003e\n\u003cli\u003eFan BE, Chow M, Winkler S. Artificial Intelligence-Generated Facial Images for Medical Education. Medical Science Educator. 2023.\u003c/li\u003e\n\u003cli\u003eAli R, Tang OY, Connolly ID, Abdulrazeq HF, Mirza FN, Lim RK, et al. Demographic Representation in 3 Leading Artificial Intelligence Text-to-Image Generators. JAMA Surgery. 2024;159(1):87-95.\u003c/li\u003e\n\u003cli\u003eM\u0026apos;Gadzah SAT, O\u0026apos;Malley A. Does a complex prompt alter the diagnostic accuracy of common ophthalmological conditions by GPT-4? Journal of Medical Internet Research. 2024.\u003c/li\u003e\n\u003cli\u003eMedical Schools Council (2022). \u003cem\u003eMedical Schools Applied Knowledge Test Style Guide\u003c/em\u003e. Version 2.\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"bmc-medical-education","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"meed","sideBox":"Learn more about [BMC Medical Education](http://bmcmededuc.biomedcentral.com/)","snPcode":"","submissionUrl":"https://www.editorialmanager.com/meed/default.aspx","title":"BMC Medical Education","twitterHandle":"BMC_series","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"em","reportingPortfolio":"BMC Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-5666975/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-5666975/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eBackground\u003c/p\u003e\n\u003cp\u003eRecent advancements in generative artificial intelligence have opened new avenues in educational methodologies, particularly in medical education. This study seeks to assess whether generative AI might be useful in addressing the depletion of assessment question banks, a challenge intensified during the Covid-era due to the prevalence of open-book examinations, and to augment the pool of formative assessment opportunities available to students. While many recent publications have sought to ascertain whether AI can achieve a passing standard in existing examinations, this study investigates the potential for AI to generate the exam itself.\u003c/p\u003e\n\u003cp\u003eSummary of Work\u003c/p\u003e\n\u003cp\u003eThis research utilized a commercially available AI large language model (LLM), OpenAI GPT-4, to generate 220 single best answer (SBA) questions, adhering to Medical Schools Council Assessment Alliance guidelines the and a selection of Learning Outcomes (LOs) of the Scottish Graduate-Entry Medicine (ScotGEM) program. The AI-generated questions underwent quality-assurance screening to ensure compliance with the stipulated guidelines and LOs. A subset of these questions was then incorporated into an examination format alongside an equal number of human-authored questions and subsequently undertaken by a cohort of medical students. The performance of both AI-generated and human-authored questions was evaluated, focusing on facility and discrimination index as key metrics.\u003c/p\u003e\n\u003cp\u003eSummary of Results\u003c/p\u003e\n\u003cp\u003eThe screening process revealed that 69% of AI-generated SBAs were fit for inclusion in the examinations with little or no modifications required. Modifications, when necessary, were predominantly due to reasons such as the inclusion of \"all of the above\" options, usage of American English spellings, and non-alphabetized answer choices. 31% of questions were rejected for inclusion in the examinations, due to factual inaccuracies and non-alignment with students’ learning. When included in an examination, post hoc statistical analysis indicated no significant difference in performance between the AI- and human- authored questions in terms of facility and discrimination index.\u003c/p\u003e\n\u003cp\u003eDiscussion and Conclusion\u003c/p\u003e\n\u003cp\u003eThe outcomes of this study suggest that AI LLMs can generate SBA questions that are in line with best-practice guidelines and specific LOs. However, the a robust quality assurance process is necessary to ensure that erroneous questions are identified and rejected. The insights gained from this research provide a foundation for further investigation into refining AI prompts, aiming for a more reliable generation of curriculum-aligned questions. LLMs show significant potential in supplementing traditional methods of question generation in medical education. This approach offers a viable solution to rapidly replenish and diversify assessment resources in medical curricula, marking a step forward in the intersection of AI and education.\u003c/p\u003e","manuscriptTitle":"Quality assurance and validity of AI-generated Single Best Answer questions","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-02-05 06:53:38","doi":"10.21203/rs.3.rs-5666975/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2024-12-26T06:35:40+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2024-12-23T02:28:08+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2024-12-23T02:27:59+00:00","index":"","fulltext":""},{"type":"submitted","content":"BMC Medical Education","date":"2024-12-18T07:11:32+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"bmc-medical-education","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"meed","sideBox":"Learn more about [BMC Medical Education](http://bmcmededuc.biomedcentral.com/)","snPcode":"","submissionUrl":"https://www.editorialmanager.com/meed/default.aspx","title":"BMC Medical Education","twitterHandle":"BMC_series","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"em","reportingPortfolio":"BMC Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"dfe65e3e-69c7-4a13-a52b-6c474fd36f3d","owner":[],"postedDate":"February 5th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[],"tags":[],"updatedAt":"2025-03-03T16:07:13+00:00","versionOfRecord":{"articleIdentity":"rs-5666975","link":"https://doi.org/10.1186/s12909-025-06881-w","journal":{"identity":"bmc-medical-education","isVorOnly":false,"title":"BMC Medical Education"},"publishedOn":"2025-02-25 15:57:05","publishedOnDateReadable":"February 25th, 2025"},"versionCreatedAt":"2025-02-05 06:53:38","video":"","vorDoi":"10.1186/s12909-025-06881-w","vorDoiUrl":"https://doi.org/10.1186/s12909-025-06881-w","workflowStages":[]},"version":"v1","identity":"rs-5666975","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-5666975","identity":"rs-5666975","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00