GPT-4 versus human authors in clinically complex MCQ creation: a blinded analysis of item quality | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article GPT-4 versus human authors in clinically complex MCQ creation: a blinded analysis of item quality Hannah Wu, Toby Zerner, Daniel Lee, Stefan Court-Kowalski, Peter Devitt, and 1 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-4831476/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract MCQs are a popular assessment format in medical education. Creating clinically complex MCQs can be a time-consuming task for subject matter experts. Large language models such as GPT-4, a type of generative artificial intelligence (AI), are a potential tool for MCQ design. Clinically complex human-generated MCQs, at both novice and expert level, were compared with AI MCQs. A generic prompt for GPT-4 was engineered, which included item-writing guidance, example MCQs, and key learning points. A standardised scoring system was developed for a consensus panel to objectively evaluate each item, blinded to the author, on categories including content validity, scope, item anatomy, cognitive skill level, item-writing flaws (IWFs), feedback comprehensiveness, veracity, adequacy of clinical reasoning, and global impression of fitness for use. Analysis showed that all groups (novice, expert, and AI) were able generate items within scope. Expert items performed better than Novice items in all categories. Expert items performed better than AI in content validity, feedback veracity and clinical reasoning. They also tended to test higher order cognitive skills. There was no difference in the global impressions of Expert and AI items, which suggests they may be comparable overall. With adequate prompt engineering, GPT-4 can produce MCQs testing clinically complex concepts for medical assessment. The quality of AI outputs is comparable to experts, however human validation is necessary to ensure content validity. The AI-generated explanatory feedback was adequate in veracity and clinical reasoning, which may serve as an educational tool for learners. Multiple choice question higher order cognitive skills Bloom’s taxonomy artificial intelligence prompt engineering ChatGPT INTRODUCTION The multiple choice question (MCQ) is ubiquitous in medical education, able to efficiently test a broad range of content and, if well-constructed, can assess higher-order cognitive skills at least as effectively as open-ended formats (Schuwirth & Van Der Vleuten, 2003; Palmer et al., 2010; Hift, 2014). Constructing high-quality, context-rich MCQs typically requires a significant investment of time and expertise from subject matter experts, posing a perpetual challenge for institutions to develop and maintain a comprehensive bank of up-to-date exam questions. Content shortages often force faculties to reuse items verbatim in subsequent assessments, which can influence students’ learning strategy toward rote memorisation over conceptual understanding (Papinczak et al., 2012; Jobs et al., 2013). The use of lower-quality items with item-writing flaws compromise examination validity (Haladyna & Downing, 1989; Pham et al., 2018) and may disproportionately disadvantage higher-performing students (Tarrant & Ware, 2008). Even with the input of subject matter experts, maintaining consistently high quality, fidelity, novelty, and volume of assessment items is problematic for faculties. Standard workflows for generating assessment material are inherently resource-intensive and widely variable in output quality, and evidence-based interventions to enhance item construction - such as targeted faculty training (Naeem et al., 2012; Gupta et al., 2020) or peer review (Abozaid et al., 2017; Smeby et al., 2019) - only compound the resource requirement. These limitations have prompted exploration of alternative avenues for item generation at scale, including leveraging student authors (Shah et al., 2019; Pham et al., 2023) and, more recently, artificial intelligence. A Large Language Model (LLM) is a form of artificial intelligence (AI) comprising a neural network trained via machine learning against a vast corpus of text to perform Natural Language Processing tasks including content generation, summarisation, translation, classification, dynamic dialogue, and others. This versatile technology has been met with cautious optimism by educators eager to explore its potential to revolutionise learning, teaching, and assessment but cognisant of its limitations and pitfalls (Abd-Alrazaq et al., 2023; Benítez et al., 2024). For example, the tendency of LLMs to “hallucinate” - that is, to invent inaccurate or unfounded information and silently integrate it into its output - is unacceptable in the safety-critical field of healthcare (Giuffrè et al., 2024). The availability of publicly accessible interfaces, most notably OpenAI’s ChatGPT (Chat Generative Pre-trained Transformer), has precipitated an explosion of interest in the capabilities and operationalisation of these systems; among such uses, LLM-aided generation of MCQ items has emerged as an appealing avenue to address the challenges of assessment content development for medical faculties. However, the formal evidence base remains nascent and the ideal process for reliably and efficiently producing high-quality medical content with generative AI remains unknown. Although early research is accumulating regarding the basic feasibility of using ChatGPT to generate medical MCQs, methodologies and study quality are highly heterogeneous and reports of resulting MCQ item quality are mixed (Artsi et al., 2024). Structural characteristics of AI-generated items, encompassing cognitive skill level, flaws, and item anatomy, have been only sparsely evaluated. Klang et al. (2023) used ChatGPT-4 to generate a 210-item MCQ examination across internal medicine, surgery, obstetrics/gynaecology, psychiatry, and paediatrics, which was reviewed by specialist clinicians in each field blinded to the AI-generated nature of the items. Of the generated items, 15% were judged to require revision for structural flaws, ranging from 30% of the surgical items to 0% of the psychiatry items, and one item (0.5%) was deemed completely invalid. This appears to be an improvement from the previous version of ChatGPT. Using ChatGPT3.5, 60% of 40 dermatology items generated were judged unsuitable for examination use by two non-blinded board-certified dermatologists against criteria of accuracy, complexity, and clarity (Ayub et al., 2023), and 68% of 60 immunology items were either incorrect, significantly misleading, or only appropriate for use with substantial modification (Ngo et al., 2024). Evaluations of the psychometric performance of AI-generated items are even more scant: work at Gazi University in Turkey has included a study in which two ChatGPT-generated hypertension MCQs demonstrated acceptable discrimination (point-biserial correlations of 0.41 and 0.39) in a medical student examination (Kıyak et al., 2024), and another in which 15 general medical MCQs were generated, of which six (40%) demonstrated acceptable discrimination (point-biserial correlation >0.30) and ten (66%) were judged unsuitable for use (Coşkun et al., 2024). An examination of the appended items in Coskun et al. (2024) and Ngo et al. (2024) show items generated at low clinical complexity levels, testing knowledge and recall. There is a paucity of data on the ability of AI to generate high complexity clinical material emulating the advanced clinical reasoning expected of medical graduates. Comparative data evaluating AI-generated materials versus those authored by humans is also minimal. Cheung et al. (2023) performed a blinded expert analysis with fifty each of ChatGPT-generated versus human-generated MCQs, finding comparable average ratings of question appropriateness, clarity, relevance, discriminative power, and suitability for graduate-level examination. However, when compared head-to-head, human-generated items outperformed AI items on each scale in the majority of instances (60% of the time for cumulative rating score), with AI items showing wider variability on each metric. Among the factors that may explain this wide variability in output quality, besides the rapidly advancing technology that underpins the LLM functionality, is a lack of standardisation in input. The output quality of an LLM is critically dependent upon the ‘user prompt’, the specific instruction provided to direct its behaviour and generate desired outputs (Giray, 2023; Heston & Khun, 2023). ‘Prompt engineering’ refers to the iterative process by which users of LLMs critically evaluate and craft their prompts to optimise responses. This process may involve serially refining prompts based on outputs; adding contextualising information such as target audience, key learning points, desired level of complexity, etc; and providing example content for the model to emulate. Specific reference to action verbs from Bloom’s taxonomy has also been recommended (Jacobsen & Weber, 2023). It is perhaps partly explanatory to note that the higher-quality outputs reported in the literature were generated using prompts that included a full example examination (Klang et al., 2023) or relevant chapters from two authoritative textbooks (Cheung et al., 2023), although lower-quality outputs were also derived from credible source material such as Continuing Medical Education articles from a top-tier specialty journal (Ayub et al., 2023). A thorough understanding of prompt engineering may help educators and curriculum designers produce quality relevant material, and as such it is likely a necessary skill for effectively leveraging this technology (Heston, 2023; Meskó, 2023). From this survey of the available evidence, it is clear there is much to learn about the inherent capabilities, influencing factors, and optimisation processes of generative AI in producing medical MCQ materials. Nonetheless, the potential for AI to reduce the resource burden of item generation for medical faculties holds significant appeal, if its operationalisation can be optimised and its output validated. Aims Evaluate and compare the structural qualities of clinically complex MCQs authored by humans (both novice and expert constructors) versus Generative Pre-trained Transformer 4 (GPT-4)-generated items; and Evaluate the comprehensiveness and articulated clinical reasoning of explanatory feedback on items generated by humans (both novice and expert) and GPT-4. Research questions: What is the structural quality of clinically complex MCQs generated by GPT-4? Can GPT-4 articulate accurate explanatory feedback for clinically complex MCQs? What points in the workflow of AI-assisted MCQ generation still necessitate human input? METHODS This mixed-methods study was performed to critically evaluate the structural quality of AI-generated MCQs in comparison to human-generated items. AI items were prospectively generated, while human-authored items were retrospectively sourced from an existing content bank, as further detailed below. Item structure and test blueprint construction A single-best-answer MCQ format was employed. A complete item included a detailed contextual stem, a question, five options (with correct answer indicated), and explanatory text articulating the logic of correct versus incorrect options. As explained below, the explanatory text was evaluated separately from the other elements due to it being not universally included in standard MCQs. A test blueprint was constructed to emulate a standard medical school examination at the level of the graduating student. Content areas included Medicine, Surgery, Paediatrics, Obstetrics, Gynaecology, Psychiatry, Population Health, and General Practice. 125 items were included, comprised of 40 from three sources, Novices, Experts, and AI. A surplus of 5 Expert human-generated items were included in the scoring process, as they were intended for future use in a mock examination where only satisfactory items would be included, and an element of redundancy was required for that purpose. An excerpt of the test blueprint is given in Appendix 1 . Development of a standardised scoring system A standardised scoring rubric was developed to facilitate consistent evaluation of human and AI-generated MCQs. This incorporated elements such as content validity (encompassing factual accuracy, fidelity, and realism), scope, correct item anatomy, specific item-writing flaws, and cognitive skill level. This rubric drew on established frameworks, including modified Bloom’s taxonomy and item-writing guidelines (Haladyna et al., 2002), and is presented as Tables 1-3 . A global impression criterion was included as a proxy for whether the item was considered fit for use in a summative examination for graduating medical students. A separate secondary evaluation was undertaken to assess the quality of each item’s explanatory feedback text for comprehensiveness, veracity, and articulation of clinical reasoning. Table 1. Standardised scoring rubric of all MCQs CORE ITEM ELEMENT Score key Content validity: The item has content validity, being factually accurate and realistic to clinical practice Entirely does not meet criteria Mostly does not meet criteria Mostly meets criteria Entirely meets criteria Within scope: The item tests concepts that are within scope for the target audience of a graduating medical student Entirely does not meet criteria Mostly does not meet criteria Mostly meets criteria Entirely meets criteria Item anatomy: The anatomy of the item is correct and complete Entirely does not meet criteria Mostly does not meet criteria Mostly meets criteria Entirely meets criteria Item-writing flaws (IWF): How many item-writing flaws are present? What type of flaws are present?* ● Content ● Style ● Formatting ● Stem ● Options ● Numeric count ● Type of IWF also documented Cognitive skill level: What is the cognitive skill level of the item? Using a modified Bloom’s taxonomy: Level I: Remembering Level II: Understanding Level III: Applying, analyzing, evaluating, and creating Global impression (structural): Global impression of the stem, question, and options: This item is fit for use in a summative examination for graduating med student No (unsalvageable) No (major further editing) Yes (minor further editing) Yes (no further editing) SCORING OF EXPLANATORY TEXT Score key Feedback comprehensiveness: The feedback was appropriately comprehensive, addressing the correct option and distractors Entirely does not meet criteria Mostly does not meet criteria Mostly meets criteria Entirely meets criteria Feedback veracity and clinical reasoning: The science and clinical reasoning in the feedback was satisfactory 1. Entirely does not meet criteria 2. Mostly does not meet criteria 3. Mostly meets criteria 4. Entirely meets criteria Global impression (overall): This item, including its written feedback, is fit for use in a summative examination for graduating med student No (unsalvageable) No (major further editing) Yes (minor further editing) Yes (no further editing) * Referencing item-writing guidelines as laid out by (Haladyna et al., 2002). Table 2. Modified Bloom’s taxonomy Level Cognitive domains Level I Remember (identifying and retrieving information) Level II Understand (interpreting and summarizing information) Level III Apply, analyze, evaluate, and create (implementing, organizing, and critiquing information) Table 3. Examples of item-writing guidelines (adapted from Haladyna et al., 2002). ● Content concerns : Use novel material to test higher level learning. Paraphrase textbook language or language used during instruction when used in a test item to avoid testing for recall. ● Formatting concerns : Format the item vertically instead of horizontally ● Style concerns : Use correct grammar, punctuation, capitalisation, and spelling ● Writing the stem : Include the central idea in the stem instead of the choices; word the stem positively, avoid negatives such as NOT or EXCEPT. ● Writing the choices : Place choices in a logical or numerical order; keep the length of choices about equal; avoid All-of-the-above; avoid giving clues to the right answer, such as pairs or triplets of options that clue the test-taker to the correct choice; make all distractors plausible. Human-generated MCQs – Novice and Expert A total of 85 human-generated MCQs were sourced from an existing Australian commercial medical education provider (eMedici2 Pty Ltd, Adelaide, Australia; https://emedici.com ). This content bank is derived from submissions by medical students and junior doctors, which pass through a pipeline of peer review, expert clinician review, and editorial approval prior to acceptance. Human authors are provided detailed written item-writing guidelines referencing style and item anatomy, at the time of item submission. Only items tagged as testing higher-order cognitive skills were included in this study. Each item was otherwise randomly selected from the content bank based on its recorded topic by two authors who were otherwise blinded to the content of the item. Of the 85 human items, 40 were written by a non-expert and had not passed through a peer review or other editorial process, and as such were deemed at ‘Novice’ level of authorship, while 45 had been edited and/or approved by subject matter experts and were thus deemed ‘Expert’ level. Subject areas were matched between groups. AI-generated MCQs GPT-4 was used in this study (model number: gpt-4-0125-preview) based on favourable reported performance against the Massive Multi-task Language Understanding benchmarks (OpenAI, 2023). A programmed script was used in GPT-4 with the prompt and key learning points - created as below - to generate outputs in an unsupervised fashion such that all authors were blinded to the GPT-4 outputs. Prompt engineering Construction of a tailored prompt for GPT-4 took place across three reference group meetings by the six-panel team (the authors) who have broad educational, item-writing, clinical, and technical expertise. The aim was to develop a generic prompt template that maximised the potential of GPT-4 to produce structurally sound items testing higher order cognitive skills and could be easily adapted for a wide range of learning points or item topics with minimum subsequent human effort. The prompt was engineered incrementally, with each output assessed subjectively until the quality was deemed to be at the ceiling point. The prompt template included: Information on the setting and the target audience of the MCQ; The inclusions and exclusions in the clinical stem to meet basic item anatomy requirements; Advice on avoidance of specific item-writing flaws, with instructions sourced from a full taxonomy of item-writing guidelines by Haladyna et al. (2002), as outlined in Table 3; Instruction on the number of question options and distractors; Instruction to produce explanatory feedback including clinical reasoning for the answer and distractors of the MCQ; Instruction to include references to recent peer-reviewed articles; Five examples of peer-reviewed, high quality MCQs covering a range of medical topics; and A key learning point of the intended MCQ in the form of a factual statement, which included the question topic (in accordance with the test blueprint – Appendix 1 ). Variability of GPT-4 outputs Among the input variables of the GPT-4 interface is ‘temperature’, which broadly determines the level of variety in subsequently generated text. This parameter ranges from 0 to 2, with a lower value resulting in more consistent outputs. Preliminary investigations have yet to identify the ideal temperature for medical MCQ generation (Agarwal et al., 2024), and it is likely to vary in different settings. To maximise reproducibility, we used a temperature of 0.0. To confirm the predictability of outputs at temperature 0.0, six learning points were used to generate three consecutive outputs without interval prompt modification. These 18 items were evaluated by a consensus panel of five authors against the scoring rubric, then independently reviewed by another author. The results of this variability testing are given in Appendix 2 . References The veracity of the references generated by GPT-4 to support the generated feedback, were also evaluated from the 18 ‘variability testing’ items. These were evaluated against the criteria: ‘The references included were real, relevant to the MCQ, formatted, and peer-reviewed.’ The items were scored on a scale of 1 (entirely does not meet criteria), 2 (mostly does not meet criteria), 3 (mostly meets criteria), to 4 (entirely meets criteria). Specific inaccuracies in the references were documented. Item appraisal - consensus panel scoring All items, AI and human-generated, were pooled and then evaluated in random order by a consensus panel of five authors blinded to the origin of the item using the prespecified scoring rubric. One duplicate item was identified and excluded. Examples of a novice, expert, and AI-generated item used in this study are presented in Appendix 3 . The panel, by majority vote, also recorded their prediction for whether the item was authored by a novice, expert, or GPT-4. Ethics approval This project received approval from the University of Adelaide Human Research Ethics Committee (HREC-2023-285). Data analysis There was no identifiable data involved in this study. Mean scores for measures of item quality between author types were compared using ANOVA with post-hoc Bonferroni or Tamhane tests as appropriate (the latter was used where the largest variance was at double or higher compared to the lowest). The distribution of the global impression scores (including and excluding feedback) were tallied and presented as a percentage of items. The distribution of item-writing flaws and cognitive skill level were tallied by author type. Summary descriptions are provided for the frequency of correct answer identification and placement, and assessments of quality of referencing. A p value <0.05 was considered significant. RESULTS Item quality evaluation - individual structural characteristics The scoring of items based on their source (Novice, Expert, AI) is summarised in Table 4 . One duplicate Novice item was identified and removed prior to data analysis. For human-authored MCQs, Expert items outperformed Novice items in all categories except for appropriateness of scope and the comprehensiveness of explanatory text, for which no difference was observed between any groups. Expert items modestly outperformed AI items in mean content validity scores (3.98 vs 3.73, p<0.001), cognitive skill level (2.58 vs 2.25, p<0.05), and clinical reasoning in feedback text (3.96 vs 3.65, p<0.001), while no significant difference was observed in item anatomy, scope, number of item-writing flaws, or feedback comprehensiveness. AI items modestly outperformed Novice items in item-writing flaws (0.80 vs 1.33, p<0.05); no significant difference was observed for all other parameters. Table 4: Item quality scores by author group. Expert (n=45) Novice (n=39) AI (n=40) Group differences CORE ITEM ELEMENTS Content validity (1-4) 3.98 (0.15) 3.33 (0.90) 3.73 (0.60) EN**, EA** Within scope (1-4) 3.89 (0.38) 3.92 (0.35) 4 (0) n/s Item anatomy (1-4) 4 (0) 3.79 (0.41) 3.90 (0.30) EN* Item writing flaws (count) 0.82 (0.91) 1.33 (1.24) 0.80 (1.04) EN*, NA* Cognitive skill level (1-3) 2.58 (0.54) 2.26 (0.64) 2.25 (0.54) EN*, EA* EXPLANATORY TEXT Comprehensiveness (1-4) 3.96 (0.20) 3.79 (0.52) 3.85 (0.36) n/s Veracity and clinical reasoning (1-4) 3.96 (0.21) 3.41 (0.85) 3.65 (0.58) EN**, EA** The nature of any significant difference is given in the final column. ‘EN’ denotes a statistically significant difference between Expert and Novice items, ‘EA’ denotes a difference between Expert and AI items, and ‘NA’ denotes a difference between Novice and AI items. All data are presented as mean (SD). Asterisks denote P<0.05 (*) or P<0.001(*). ‘n/s’ denotes no significant difference. Item quality evaluation - global impressions Global impression scores of structural quality for each item are summarised in Table 5 . Overall, excluding the explanatory text section, Expert items were most frequently deemed fit for purpose with or without minor edits (95%), followed by AI items (86%), then Novice items (61%). Only Novice items included some deemed unsalvageable (13%). Including the explanatory text section resulted in Expert items deemed most fit for purpose with or without minor edits (95%), followed by AI items (85%), then Novice items (59%). Table 5: Summary of global impression scores for item structure, by group. Score counts are given as n (%), mean scores are given as mean (SD). Asterisks (**) denote a difference versus the other groups with P<0.001. GLOBAL IMPRESSION SCORE 1 2 3 4 Mean score Fit for purpose No No Yes Yes – Edits required Unsalvageable Major Minor None – EXCLUDING EXPLANATORY TEXT SECTION Expert (n=45) 0 (0%) 2 (4%) 19 (42%) 24 (53%) 3.49 (0.59) Novice (n=39) 5 (13%) 10 (26%) 13 (33%) 11 (28%) 2.77 (1.01)** AI (n=40) 0 (0%) 6 (15%) 13 (33%) 21 (53%) 3.38 (0.74) INCLUDING EXPLANATORY TEXT SECTION Expert (n=45) 0 (0%) 2 (4%) 19 (42%) 24 (53%) 3.49 (0.59) Novice (n=39) 5 (13%) 11 (28%) 14 (36%) 9 (23%) 2.69 (0.98)** AI (n=40) 0 (0%) 6 (15%) 14 (35%) 20 (50%) 3.35 (0.74) When averaged within groups, mean scores for global impression whether including or excluding explanatory text were similar between Expert and AI items. However, AI items outperformed Novice items in global scores both including and excluding explanatory text (3.38 vs 2.77, p<0.001; and 3.35 vs 2.69, p<0.001; respectively). Item-writing flaws (IWFs) The number and type of IWFs per item across author groups is presented in Table 6 . Though the overall rate of IWFs differed only slightly between Expert, Novice, and AI items (0.8 vs 1.3 vs 0.8, respectively, p<0.05), Novice items were least likely to have zero IWFs and most likely to have 3 or more. Most IWFs belonged to the category ‘Writing the choices’ across all author groups, using the item-writing guidelines outlined in Table 3 . Table 6: Summary of item-writing flaws between groups. All data are presented as counts unless otherwise specified. Expert (n=45) Novice (n=39) AI (n=40) IWFs per item , count distribution 0 21 10 21 1 13 16 10 2 9 7 6 3 2 3 2 4 0 2 1 5 0 1 0 IWFs per item , mean (SD) 0.8 (0.91) 1.3 (1.24) 0.8 (1.04) IWFs by type , count distribution Content concerns 8 14 10 Style concerns 0 5 0 Formatting concerns 0 0 0 Writing the stem 11 14 2 Writing the choices 18 19 30 Cognitive skill level Table 7 summarises the distribution of cognitive skill level across author groups. The majority of Expert items were at cognitive skill level III, compared to level II when for Novice or AI items. Novice items were most likely to be assigned a cognitive skill level of I. Table 7: Distribution of assigned cognitive skill levels vis a vis Bloom’s modified taxonomy. Data presented as n (%). COGNITIVE SKILL LEVEL I II III Expert (n=45) 1 (2%) 17 (38%) 27 (60%) Novice (n=39) 4 (10%) 21 (54%) 14 (36%) AI (n=40) 2 (5%) 26 (65%) 12 (30%) Correct option veracity and placement The correct answer was appropriately indicated (i.e. the option indicated to be correct by the author was corroborated by panel consensus) in 100% of Expert, 90% of Novice, and 85% of AI-generated items. Significant differences were observed in the distribution of the correct option placement, as shown in Table 8 . Both AI-generated (45%, p=0.028) and Expert-authored (33%, p=0.002) items disproportionately positioned the correct answer as option C versus the Novice-authored items. Table 8: Distribution of correct option position. All data are presented as percentages. Option position of correct answer A B C D E Expert (n=45) 9 20 33 29 7 Novice (n=39) 15 33 13 26 13 AI (n=40) 8 15 45 25 8 References The references generated by GPT-4 varied in quality. Of the 18 variability testing items’ 52 references that were scored ( Appendix 2 ), the average score against the stated criteria was 3.06 out of 4 (range 2-4), with six items scoring 2, five items scoring 3, and seven deemed perfect with a score of 4. The most common flaws were an incorrect DOI, the reference being an old edition of a guideline, or incorrect details, as summarised in Table 9 . Table 9. Summary of types of errors in the references generated by GPT-4 Flaw in reference Count Incorrect DOI 7 Incorrect URL 2 Old edition 5 Incorrect details 5 Not peer reviewed 3 Non-existent 2 DISCUSSION This study has added to the growing evidence base demonstrating that pre-trained generative AI systems can produce medical MCQs of broadly comparable structural quality to expert item-writers, with several important caveats relevant to educators interested in leveraging this technology. MCQs remain ubiquitous in medical education, and the resource-intensiveness of generating high quality items has prompted the exploration of new avenues of content generation. We present the most comprehensive evaluation to date of clinical medical MCQs generated by GPT-4. Taken together, these results suggest significant efficiency gains are feasible in MCQ creation via the conscientious integration of generative AI but underscore the ongoing necessity of human expert review as part of such a workflow to ensure quality, veracity, and fitness-for-purpose of AI-generated materials. Granular appraisal of the intrinsic properties of MCQ items in this study yielded a rough hierarchy of quality from Novice-authored items (written by medical students or junior doctors and otherwise unedited), through AI-generated items (produced by GPT-4), to Expert items (written, edited, and/or approved by subject matter experts and experienced item-writers), with each successive group matching or outperforming the former on particular metrics. Regarding human authors, 39% of Novice items fell into a global impression category signifying either outright unsuitability or a need for major editing to achieve fitness for purpose, while 95% of Expert items were deemed fit for purpose requiring minor editing at most. This aligns with previously demonstrated distinctions between student-authored and expert-authored MCQs (Pham et al., 2023) and highlights that engaging medical trainees to author MCQs, while known to be beneficial for learning (Touissi et al., 2022), nonetheless necessitates a careful review process to guide item development to a usable standard. The interposition of AI capability between Novice and Expert human authors in our analysis is a novel finding, though perhaps intuitive to educators in light of the constellation of medical knowledge, clinical experience, and pedagogical training required to generate high-quality MCQs with practical relevance and verisimilitude. Existing comparative data is sparse but corroborative: the only other such study as yet found that expert authors tended to outperform AI on assessments of quality when MCQs on the same subject matter were compared head-to-head, having achieved comparable quality scores in aggregate (Cheung et al., 2023). As less direct context, GPT-4 has also been shown to summarise medical information into clinical synopses (Van Veen et al., 2024) or patient information sheets (Currie et al., 2023; Lockie & Choi, 2024; Verran, 2024) to standards that variably exceed, meet, or fall short of those by expert clinicians. Conversely, while AI has not been compared to student or trainee authors on the specific task of MCQ generation prior to our study, GPT-4’s ability to interpret clinical MCQ material has been repeatedly shown to rival or exceed that of medical students and trainees - cohorts corresponding to the Novice group in our dataset - in the profusion of reports in which it achieved passing grades in qualifying examinations around the world (Abbas et al., 2024; Knoedler et al., 2024; Maitland et al., 2024; Meyer et al., 2024; Rojas et al., 2024; Tanaka et al., 2024). Taken together, the available evidence loosely suggests that GPT-4’s proficiency in handling clinical MCQ materials falls somewhere toward the upper end of a range between students/trainees and experts. Focussing on the AI-generated items in this study, multiple results are important to highlight in evaluating the applicability of this technology to medical MCQ creation. AI items demonstrated high quality in aggregate, with 85% rated as fit for use upfront with minor edits at most, 95% deemed to test higher-order cognitive skills (modified Bloom’s level II or III), and excellent average scores across content validity (3.7/4), scope (4/4), and item anatomy (3.9/4). Though prior non-comparative studies have reported widely variable AI-generated MCQ quality (Artsi et al., 2024), our results support the view that expert-standard items at a clinically complex content level are within the capability of GPT-4. This high-quality output is likely closely dependent upon the input of a detailed prompt invoking best-practice item-writing principles, the inclusion of high quality example MCQs, and a well-articulated learning point, which was a feature of this study and strongly endorsed in literature (Meskó, 2023). Development of such a prompt - especially one that encompasses local requirements for structure, content, and style - necessitates human input at the outset of an AI-based workflow. A generic prompt that relies on the encoded clinical knowledge in the LLM minimises the human input required at subsequent stages of MCQ design. In this study, the provision of a factually verified key learning point by the human superseded the need to provide GPT-4 with any lengthy reference texts. However, it is also crucial to consider not just the average quality across an entire exam, but outputs on a per-item basis in a safety-critical field like healthcare where high minimum standards of quality and veracity are necessary for all individual assessment items (Giuffrè et al., 2024). In this regard, it must be emphasised that 1 in 7 AI-generated items were deemed unfit for use without major edits (versus 1 in 25 by human experts), 1 in 7 indicated an erroneous correct answer, and almost half had correct answers default positioned as option C. Any of these issues would severely compromise assessment validity if left unchecked, and their presence indicates the vital necessity of incorporating human expert review prior to deployment of AI-generated items. Some of these issues may arise from technical limitations of GPT-4 capability, but it is also very likely that the AI engine is recapitulating the shortcomings of its training data: for example, a pervasive ‘middle bias’ is known to exist in MCQ assessments (Attali & Bar‐Hillel, 2003), in which correct options are disproportionately clustered in middle positions (e.g. option C in an A-E structure), and it is likely that this tendency has been encoded into the engine via its training data – along with other more pernicious biases of which educators must be cognisant (Zack et al., 2024). Though detailed technical inquiry into LLM clinical reasoning was beyond our scope (readers are directed to Liévin et al. (2024), this study demonstrated GPT-4’s capability to generate cogent and cohesive explanatory text that, while marginally outperformed by human experts on ratings of veracity and articulated reasoning, ultimately equalled experts and exceeded novices on global impression. The complete fabrication of references by GPT-4 in this study also echoes known issues with LLM hallucination. This contrasts to prior literature in which 32-76% of generated MCQ explanations were deemed valid (Agarwal et al., 2023; Choi, 2023; Ngo et al., 2024), and may reflect interval improvement from previous versions of the GPT engine. Explanatory feedback to MCQs may enhance the acquisition or consolidation of contextualised knowledge and clinical reasoning for learners, functioning as a ‘virtual teaching assistant’. GPT-4 represents a highly efficient potential avenue to generate such material. LIMITATIONS ● This mixed-methods study used a combination of pre-existing (Novice and Expert) and prospectively generated (AI) items, which introduces a lack of standardisation between instructions given to human authors and the GPT-4 prompt that may have influenced the respective item quality. ● GPT-4 ‘temperature’: Preliminary investigations have yet to identify the ideal temperature for medical MCQ generation (Agarwal et al., 2024), and it is likely to vary in different settings. A temperature of 0.0 was selected in this study to maximise external validity and reproducibility but may potentially have constrained output quality. ● Cognitive level and IWFs were considered contributory to item quality, based on standard item-writing guidance, however impact on discriminatory power is empirically inconsistent (Tarrant & Ware, 2008; Caldwell & Pate, 2013; Ali & Ruit, 2015; Pais et al., 2016; Rush et al., 2016; Pham et al., 2018). ● This study explicitly focussed on items requiring higher-order cognitive processing. GPT-4 was recently shown to make the most errors with lower-order reasoning in a set of psychiatric MCQs (Herrmann-Werner et al., 2024), and the output quality in this study is therefore not generalisable to the creation of medical MCQs targeting these lower-order processes. FUTURE DIRECTIONS The results of an evaluation of the psychometric properties of AI-generated MCQs will be reported by the authors in a future article. Further areas of study include exploring the quality of outputs using other LLMs, exploring the role of clinical images in AI-assisted design of MCQs, and evaluating the educational value of interacting with AI-generated explanatory feedback on student learning. More sophisticated prompt engineering with automatic provision of reference texts could also be explored, as well as investigating fine-tuning processes of LLMs with large volumes of high quality MCQs. CONCLUSION In summary, this study suggests that while human experts most reliably produce superior-quality complex clinical MCQs, GPT-4 is capable of producing comparable items in most instances and outperforms human novices in this task. An AI-integrated workflow for creating such items would still necessitate direct human expert input at multiple points, including prompt engineering in accordance with item-writing guidelines and local pedagogy, formulation of curriculum-specific key learning points, and validation of subsequent outputs with editing as required. Declarations Project contributions HW and SCK conceived the project idea. HW and EP designed the study. DL constructed the ethics application, which HW contributed to. HW, DL, and SCK conducted the literature review. All authors contributed to the design of the standardised scoring rubric, prompt engineering, and to data collection. HW, PD, and TZ prepared the items for scoring. EP conducted the statistical analysis. HW and SCK drafted the manuscript, which all authors then contributed to and provided critical review. The tables were prepared by both HW and SCK. Acknowledgements None. Declaration of interest statement Authors HW, TZ, SCK, and PD are co-directors of eMedici, a commercial medical education platform. DL and EP have no conflicts of interest to declare. The authors did not receive support from any organization for the submitted work. References Abbas, A., Rehman, M. S., & Rehman, S. S. (2024). Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions. Cureus , 16 (3). https://doi.org/10.7759/cureus.55991 Abd-Alrazaq, A., AlSaad, R., Alhuwail, D., Ahmed, A., Healy, P. M., Latifi, S., Aziz, S., Damseh, R., Alrazak, S. A., & Sheikh, J. (2023). Large language models in medical education: opportunities, challenges, and future directions. JMIR Medical Education , 9 (1), e48291. https://doi.org/10.2196/48291 Abozaid, H., Park, Y. S., & Tekian, A. (2017). Peer review improves psychometric characteristics of multiple choice questions. Medical teacher , 39 (sup1), S50-S54. https://doi.org/10.1080/0142159X.2016.1254743 Agarwal, A., Mittal, K., Doyle, A., Sridhar, P., Wan, Z., Doughty, J. A., Savelka, J., & Sakr, M. (2024). Understanding the Role of Temperature in Diverse Question Generation by GPT-4. Proceedings of the 55th ACM Technical Symposium on Computer Science Education V. 2, Agarwal, M., Goswami, A., & Sharma, P. (2023). Evaluating ChatGPT-3.5 and Claude-2 in answering and explaining conceptual medical physiology multiple-choice questions. Cureus , 15 (9). https://doi.org/10.7759/cureus.46222 Ali, S. H., & Ruit, K. G. (2015). The Impact of item flaws, testing at low cognitive level, and low distractor functioning on multiple-choice question quality. Perspectives on medical education , 4 , 244-251. https://doi.org/10.1007/s40037-015-0212-x Artsi, Y., Sorin, V., Konen, E., Glicksberg, B. S., Nadkarni, G., & Klang, E. (2024). Large language models for generating medical examinations: systematic review. BMC Medical Education , 24 (1), 354. https://doi.org/10.1186/s12909-024-05239-y Attali, Y., & Bar‐Hillel, M. (2003). Guess where: The position of correct answers in multiple‐choice test items as a psychometric variable. Journal of Educational Measurement , 40 (2), 109-128. https://doi.org/10.1111/j.1745-3984.2003.tb01099.x Ayub, I., Hamann, D., Hamann, C. R., & Davis, M. J. (2023). Exploring the potential and limitations of chat generative pre-trained transformer (ChatGPT) in generating board-style dermatology questions: a qualitative analysis. Cureus , 15 (8). Benítez, T. M., Xu, Y., Boudreau, J. D., Kow, A. W. C., Bello, F., Van Phuoc, L., Wang, X., Sun, X., Leung, G. K.-K., & Lan, Y. (2024). Harnessing the potential of large language models in medical education: promise and pitfalls. Journal of the American Medical Informatics Association , 31 (3), 776-783. https://academic.oup.com/jamia/article-abstract/31/3/776/7588721?redirectedFrom=fulltext Caldwell, D. J., & Pate, A. N. (2013). Effects of question formats on student and item performance. American journal of pharmaceutical education , 77 (4), 71. https://doi.org/10.5688/ajpe77471 Cheung, B. H. H., Lau, G. K. K., Wong, G. T. C., Lee, E. Y. P., Kulkarni, D., Seow, C. S., Wong, R., & Co, M. T.-H. (2023). ChatGPT versus human in generating medical graduate exam multiple choice questions—A multinational prospective study (Hong Kong SAR, Singapore, Ireland, and the United Kingdom). PLoS ONE , 18 (8), e0290691. https://doi.org/10.1371/journal.pone.0290691 Choi, W. (2023). Assessment of the capacity of ChatGPT as a self-learning tool in medical pharmacology: a study using MCQs. BMC Medical Education , 23 (1), 864. https://doi.org/10.1186/s12909-023-04832-x Coşkun, Ö., Kıyak, Y. S., & Budakoğlu, I. İ. (2024). ChatGPT to generate clinical vignettes for teaching and multiple-choice questions for assessment: A randomized controlled experiment. Medical teacher , 1-7. https://doi.org/10.1080/0142159X.2024.2327477 Currie, G., Robbie, S., & Tually, P. (2023). ChatGPT and patient information in nuclear medicine: GPT-3.5 versus GPT-4. Journal of Nuclear Medicine Technology , 51 (4), 307-313. https://doi.org/10.2967/jnmt.123.266151 Giray, L. (2023). Prompt engineering with ChatGPT: a guide for academic writers. Annals of biomedical engineering , 51 (12), 2629-2633. https://doi.org/10.1007/s10439-023-03272-4 Giuffrè, M., You, K., & Shung, D. L. (2024). Evaluating ChatGPT in medical contexts: the imperative to guard against hallucinations and partial accuracies. Clinical Gastroenterology and Hepatology , 22 (5), 1145-1146. https://www.cghjournal.org/article/S1542-3565(23)00835-2/pdf Gupta, P., Meena, P., Khan, A. M., Malhotra, R. K., & Singh, T. (2020). Effect of faculty training on quality of multiple-choice questions. International Journal of Applied and Basic Medical Research , 10 (3), 210-214. https://doi.org/10.4103/ijabmr.IJABMR_30_20 Haladyna, T. M., & Downing, S. M. (1989). Validity of a taxonomy of multiple-choice item-writing rules. Applied measurement in education , 2 (1), 51-78. https://doi.org/10.1207/s15324818ame0201_4 Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied measurement in education , 15 (3), 309-333. https://doi.org/10.1207/S15324818AME1503_5 Herrmann-Werner, A., Festl-Wietek, T., Holderried, F., Herschbach, L., Griewatz, J., Masters, K., Zipfel, S., & Mahling, M. (2024). Assessing ChatGPT’s mastery of Bloom’s taxonomy using psychosomatic medicine exam questions: mixed-methods study. Journal of medical Internet research , 26 , e52113. https://doi.org/10.2196/52113 Heston, T. F. (2023). Prompt engineering for students of medicine and their teachers. arXiv preprint arXiv:2308.11628 . Heston, T. F., & Khun, C. (2023). Prompt engineering in medical education. International Medical Education , 2 (3), 198-205. https://doi.org/10.3390/ime2030019 Hift, R. J. (2014). Should essays and other “open-ended”-type questions retain a place in written summative assessment in clinical medicine? BMC Medical Education , 14 , 1-18. https://doi.org/10.1186/s12909-014-0249-2 Jacobsen, L. J., & Weber, K. E. (2023). The promises and pitfalls of ChatGPT as a feedback provider in higher education: An exploratory study of prompt engineering and the quality of AI-driven feedback. https://doi.org/10.31219/osf.io/cr257 Jobs, A., Twesten, C., Göbel, A., Bonnemeier, H., Lehnert, H., & Weitz, G. (2013). Question-writing as a learning tool for students–outcomes from curricular exams. BMC Medical Education , 13 , 1-7. https://doi.org/10.1186/1472-6920-13-89 Kıyak, Y. S., Coşkun, Ö., Budakoğlu, I. İ., & Uluoğlu, C. (2024). ChatGPT for generating multiple-choice questions: evidence on the use of artificial intelligence in automatic item generation for a rational pharmacotherapy exam. European journal of clinical pharmacology , 80 (5), 729-735. https://doi.org/10.1007/s00228-024-03649-x Klang, E., Portugez, S., Gross, R., Brenner, A., Gilboa, M., Ortal, T., Ron, S., Robinzon, V., Meiri, H., & Segal, G. (2023). Advantages and pitfalls in utilizing artificial intelligence for crafting medical examinations: a medical education pilot study with GPT-4. BMC Medical Education , 23 . https://doi.org/10.1186/s12909-023-04752-w Knoedler, L., Alfertshofer, M., Knoedler, S., Hoch, C. C., Funk, P. F., Cotofana, S., Maheta, B., Frank, K., Brébant, V., & Prantl, L. (2024). Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative Analysis. JMIR Medical Education , 10 (1), e51148. https://doi.org/10.2196/51148 Liévin, V., Hother, C. E., Motzfeldt, A. G., & Winther, O. (2024). Can large language models reason about medical questions? Patterns , 5 (3). Lockie, E., & Choi, J. (2024). Evaluation of a chat GPT generated patient information leaflet about laparoscopic cholecystectomy. ANZ Journal of Surgery , 94 (3), 353-355. https://doi.org/10.1111/ans.18834 Maitland, A., Fowkes, R., & Maitland, S. (2024). Can ChatGPT pass the MRCP (UK) written examinations? Analysis of performance and errors using a clinical decision-reasoning framework. BMJ open , 14 (3), e080558. https://doi.org/10.1136/bmjopen-2023-080558 Meskó, B. (2023). Prompt engineering as an important emerging skill for medical professionals: tutorial. Journal of medical Internet research , 25 , e50638. Meyer, A., Riese, J., & Streichert, T. (2024). Comparison of the performance of GPT-3.5 and GPT-4 with that of medical students on the written German medical licensing examination: observational study. JMIR Medical Education , 10 , e50965. https://doi.org/10.2196/50965 Naeem, N., van der Vleuten, C., & Alfaris, E. A. (2012). Faculty development on item writing substantially improves item quality. Advances in health sciences education , 17 , 369-376. https://doi.org/10.1007/s10459-011-9315-2 Ngo, A., Gupta, S., Perrine, O., Reddy, R., Ershadi, S., & Remick, D. (2024). ChatGPT 3.5 fails to write appropriate multiple choice practice exam questions. Academic Pathology , 11 (1), 100099. https://www.academicpathologyjournal.org/article/S2374-2895(23)00031-3/pdf OpenAI. (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774 . Pais, J., Silva, A., Guimarães, B., Povo, A., Coelho, E., Silva-Pereira, F., Lourinho, I., Ferreira, M. A., & Severo, M. (2016). Do item-writing flaws reduce examinations psychometric quality? BMC Research Notes , 9 , 1-7. https://doi.org/10.1186/s13104-016-2202-4 Palmer, E. J., Duggan, P., Devitt, P. G., & Russell, R. (2010). The modified essay question: its exit from the exit examination? Medical teacher , 32 (7), e300-e307. https://doi.org/10.3109/0142159X.2010.488705 Papinczak, T., Peterson, R., Babri, A. S., Ward, K., Kippers, V., & Wilkinson, D. (2012). Using student-generated questions for student-centred assessment. Assessment & Evaluation in Higher Education , 37 (4), 439-452. https://doi.org/10.1080/02602938.2010.538666 Pham, H., Besanko, J., & Devitt, P. (2018). Examining the impact of specific types of item-writing flaws on student performance and psychometric properties of the multiple choice question. MedEdPublish , 7 . https://doi.org/10.15694/mep.2018.0000225.1 Pham, H., Court-Kowalski, S., Chan, H., & Devitt, P. (2023). Writing Multiple Choice Questions—Has the Student Become the Master? Teaching and Learning in Medicine , 35 (3), 356-367. https://doi.org/10.1080/10401334.2022.2050240 Rojas, M., Rojas, M., Burgess, V., Toro-Pérez, J., & Salehi, S. (2024). Exploring the Performance of ChatGPT Versions 3.5, 4, and 4 With Vision in the Chilean Medical Licensing Examination: Observational Study. JMIR Medical Education , 10 , e55048. https://doi.org/10.2196/55048 Rush, B. R., Rankin, D. C., & White, B. J. (2016). The impact of item-writing flaws and item complexity on examination item difficulty and discrimination value. BMC Medical Education , 16 , 1-10. https://doi.org/10.1186/s12909-016-0773-3 Schuwirth, L. W., & Van Der Vleuten, C. P. (2003). Written assessment.(ABC of learning and teaching in medicine). British Medical Journal , 326 (7390), 643-646. https://doi.org/10.1136/bmj.326.7390.643 Shah, M. P., Lin, B. R., Lee, M., Kahn, D., & Hernandez, E. (2019). Student-written multiple-choice questions—a practical and educational approach. Medical Science Educator , 29 , 41-43. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8368101/pdf/40670_2018_Article_646.pdf Smeby, S. S., Lillebo, B., Gynnild, V., Samstad, E., Standal, R., Knobel, H., Vik, A., & Slørdahl, T. S. (2019). Improving assessment quality in professional higher education: Could external peer review of items be the answer? Cogent Medicine , 6 (1), 1659746. https://doi.org/10.1080/2331205X.2019.1659746 Tanaka, Y., Nakata, T., Aiga, K., Etani, T., Muramatsu, R., Katagiri, S., Kawai, H., Higashino, F., Enomoto, M., & Noda, M. (2024). Performance of generative pretrained transformer on the national medical licensing examination in Japan. PLOS Digital Health , 3 (1), e0000433. https://doi.org/10.1371/journal.pdig.0000433 Tarrant, M., & Ware, J. (2008). Impact of item‐writing flaws in multiple‐choice questions on student achievement in high‐stakes nursing assessments. Medical Education , 42 (2), 198-206. https://doi.org/10.1111/j.1365-2923.2007.02957.x Touissi, Y., Hjiej, G., Hajjioui, A., Ibrahimi, A., & Fourtassi, M. (2022). Does developing multiple-choice questions improve medical students’ learning? A systematic review. Medical Education Online , 27 (1), 2005505. https://doi.org/10.1080/10872981.2021.2005505 Van Veen, D., Van Uden, C., Blankemeier, L., Delbrouck, J.-B., Aali, A., Bluethgen, C., Pareek, A., Polacin, M., Reis, E. P., & Seehofnerová, A. (2024). Adapted large language models can outperform medical experts in clinical text summarization. Nature medicine , 30 (4), 1134-1142. https://doi.org/10.1038/s41591-024-02855-5 Verran, C. (2024). Artificial intelligence-generated patient information leaflets: a comparison of contents according to British Association of Dermatologists standards. Clinical and Experimental Dermatology , llad461. https://doi.org/10.1093/ced/llad461 Zack, T., Lehman, E., Suzgun, M., Rodriguez, J. A., Celi, L. A., Gichoya, J., Jurafsky, D., Szolovits, P., Bates, D. W., & Abdulnour, R.-E. E. (2024). Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study. The Lancet Digital Health , 6 (1), e12-e22. https://doi.org/10.1016/S2589-7500(23)00225-X (Erratum in: Lancet Digit Health. 2024 Jul;6(7):e445. doi: 10.1016/S2589-7500(24)00120-1) Additional Declarations Competing interest reported. Authors HW, TZ, SCK, and PD are co-directors of eMedici, a commercial medical education platform. DL and EP have no conflicts of interest to declare. The authors did not receive support from any organization for the submitted work Supplementary Files Appendix.docx Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-4831476","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":337631779,"identity":"6cbe067b-609d-45aa-8d4b-94c392a6690b","order_by":0,"name":"Hannah Wu","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABD0lEQVRIiWNgGAWjYBACAxDB2AAk2BsYJBgKGHgYDgA5PERp4TkA1GJAkhaJBLAWBoJazNnPHnvwcwdD4oabjx/e+GBgJ8N3vIHxwds2BnmDA9i1WPbkpRv2ngFquZ1mbDnDIJlH8swBZsO5bQyGG3BoMTiQYybB28aQu+F2gpk0jwEzj8GNBDZpoAgjTi3n35hJ/gVpuXn8m/Qfg3oeg/sP2H8Dtdjj1HIjx0wabMsNHjNpBoPDQFsY2JiBIom4tbwxk5Ztk6ifeSan2LLH4DjQL4nNknPOSSTPxOmwHDPJt202xnzHj2+88aOi2p7v+OGDH96U2dj24dACBRLIHEg04VU/CkbBKBgFowA/AAAgfV9NvgK/3wAAAABJRU5ErkJggg==","orcid":"","institution":"eMedici","correspondingAuthor":true,"prefix":"","firstName":"Hannah","middleName":"","lastName":"Wu","suffix":""},{"id":337631780,"identity":"b4642df4-33b3-4750-8f4e-21f4d2790290","order_by":1,"name":"Toby Zerner","email":"","orcid":"","institution":"eMedici","correspondingAuthor":false,"prefix":"","firstName":"Toby","middleName":"","lastName":"Zerner","suffix":""},{"id":337631781,"identity":"080c551d-634e-446b-a226-fde6a1c942ee","order_by":2,"name":"Daniel Lee","email":"","orcid":"","institution":"University of Adelaide","correspondingAuthor":false,"prefix":"","firstName":"Daniel","middleName":"","lastName":"Lee","suffix":""},{"id":337631782,"identity":"a4cb775d-2223-44ef-8905-1166fc5c0b45","order_by":3,"name":"Stefan Court-Kowalski","email":"","orcid":"","institution":"eMedici","correspondingAuthor":false,"prefix":"","firstName":"Stefan","middleName":"","lastName":"Court-Kowalski","suffix":""},{"id":337631783,"identity":"cf277a8f-b67c-4c89-9fd0-9b3bf7d719b4","order_by":4,"name":"Peter Devitt","email":"","orcid":"","institution":"eMedici","correspondingAuthor":false,"prefix":"","firstName":"Peter","middleName":"","lastName":"Devitt","suffix":""},{"id":337631784,"identity":"e5ef3c04-30d1-47f9-8bf0-6d512d306aea","order_by":5,"name":"Edward Palmer","email":"","orcid":"","institution":"University of Adelaide","correspondingAuthor":false,"prefix":"","firstName":"Edward","middleName":"","lastName":"Palmer","suffix":""}],"badges":[],"createdAt":"2024-07-31 00:23:32","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-4831476/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-4831476/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":63973129,"identity":"40011449-a974-419d-aa85-c62a9ed833b9","added_by":"auto","created_at":"2024-09-04 11:36:11","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1027478,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4831476/v1/da090a1c-26bb-4ec4-ae38-3e29e828ba8f.pdf"},{"id":62922321,"identity":"4d071b12-64c3-4ba9-91ec-3912a70e2eaf","added_by":"auto","created_at":"2024-08-21 05:55:31","extension":"docx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":26224,"visible":true,"origin":"","legend":"","description":"","filename":"Appendix.docx","url":"https://assets-eu.researchsquare.com/files/rs-4831476/v1/7637e92efd533b545ce14ffe.docx"}],"financialInterests":"Competing interest reported. Authors HW, TZ, SCK, and PD are co-directors of eMedici, a commercial medical education platform. DL and EP have no conflicts of interest to declare. The authors did not receive support from any organization for the submitted work","formattedTitle":"GPT-4 versus human authors in clinically complex MCQ creation: a blinded analysis of item quality","fulltext":[{"header":"INTRODUCTION","content":"\u003cp\u003eThe multiple choice question (MCQ) is ubiquitous in medical education, able to efficiently test a broad range of content and, if well-constructed, can assess higher-order cognitive skills at least as effectively as open-ended formats (Schuwirth \u0026amp; Van Der Vleuten, 2003; Palmer et al., 2010; Hift, 2014). Constructing high-quality, context-rich MCQs typically requires a significant investment of time and expertise from subject matter experts, posing a perpetual challenge for institutions to develop and maintain a comprehensive bank of up-to-date exam questions.\u0026nbsp; Content shortages often force faculties to reuse items verbatim in subsequent assessments, which can influence students\u0026rsquo; learning strategy toward rote memorisation over conceptual understanding (Papinczak et al., 2012; Jobs et al., 2013). The use of lower-quality items with item-writing flaws compromise examination validity (Haladyna \u0026amp; Downing, 1989; Pham et al., 2018) and may disproportionately disadvantage higher-performing students (Tarrant \u0026amp; Ware, 2008).\u003c/p\u003e\n\u003cp\u003eEven with the input of subject matter experts, maintaining consistently high quality, fidelity, novelty, and volume of assessment items is problematic for faculties. Standard workflows for generating assessment material are inherently resource-intensive and widely variable in output quality, and evidence-based interventions to enhance item construction - such as targeted faculty training (Naeem et al., 2012; Gupta et al., 2020) or peer review (Abozaid et al., 2017; Smeby et al., 2019) - only compound the resource requirement. These limitations have prompted exploration of alternative avenues for item generation at scale, including leveraging student authors (Shah et al., 2019; Pham et al., 2023) and, more recently, artificial intelligence.\u003c/p\u003e\n\u003cp\u003eA Large Language Model (LLM) is a form of artificial intelligence (AI) comprising a neural network trained via machine learning against a vast corpus of text to perform Natural Language Processing tasks including content generation, summarisation, translation, classification, dynamic dialogue, and others. This versatile technology has been met with cautious optimism by educators eager to explore its potential to revolutionise learning, teaching, and assessment but cognisant of its limitations and pitfalls (Abd-Alrazaq et al., 2023; Ben\u0026iacute;tez et al., 2024). For example, the tendency of LLMs to \u0026ldquo;hallucinate\u0026rdquo; - that is, to invent inaccurate or unfounded information and silently integrate it into its output - is unacceptable in the safety-critical field of healthcare (Giuffr\u0026egrave; et al., 2024). The availability of publicly accessible interfaces, most notably OpenAI\u0026rsquo;s ChatGPT (Chat Generative Pre-trained Transformer), has precipitated an explosion of interest in the capabilities and operationalisation of these systems; among such uses, LLM-aided generation of MCQ items has emerged as an appealing avenue to address the challenges of assessment content development for medical faculties. However, the formal evidence base remains nascent and the ideal process for reliably and efficiently producing high-quality medical content with generative AI remains unknown.\u003c/p\u003e\n\u003cp\u003eAlthough early research is accumulating regarding the basic feasibility of using ChatGPT to generate medical MCQs, methodologies and study quality are highly heterogeneous and reports of resulting MCQ item quality are mixed (Artsi et al., 2024). Structural characteristics of AI-generated items, encompassing cognitive skill level, flaws, and item anatomy, have been only sparsely evaluated. Klang et al. (2023) used ChatGPT-4 to generate a 210-item MCQ examination across internal medicine, surgery, obstetrics/gynaecology, psychiatry, and paediatrics, which was reviewed by specialist clinicians in each field blinded to the AI-generated nature of the items. Of the generated items, 15% were judged to require revision for structural flaws, ranging from 30% of the surgical items to 0% of the psychiatry items, and one item (0.5%) was deemed completely invalid. This appears to be an improvement from the previous version of ChatGPT. Using ChatGPT3.5, 60% of 40 dermatology items generated were judged unsuitable for examination use by two non-blinded board-certified dermatologists against criteria of accuracy, complexity, and clarity (Ayub et al., 2023), and 68% of 60 immunology items were either incorrect, significantly misleading, or only appropriate for use with substantial modification (Ngo et al., 2024).\u003c/p\u003e\n\u003cp\u003eEvaluations of the psychometric performance of AI-generated items are even more scant: work at Gazi University in Turkey has included a study in which two ChatGPT-generated hypertension MCQs demonstrated acceptable discrimination (point-biserial correlations of 0.41 and 0.39) in a medical student examination (Kıyak et al., 2024), and another in which 15 general medical MCQs were generated, of which six (40%) demonstrated acceptable discrimination (point-biserial correlation \u0026gt;0.30) and ten (66%) were judged unsuitable for use (Coşkun et al., 2024). An examination of the appended items in Coskun et al. (2024) and Ngo et al. (2024) show items generated at low clinical complexity levels, testing knowledge and recall. There is a paucity of data on the ability of AI to generate high complexity clinical material emulating the advanced clinical reasoning expected of medical graduates.\u003c/p\u003e\n\u003cp\u003eComparative data evaluating AI-generated materials versus those authored by humans is also minimal. Cheung et al. (2023) performed a blinded expert analysis with fifty each of ChatGPT-generated versus human-generated MCQs, finding comparable average ratings of question appropriateness, clarity, relevance, discriminative power, and suitability for graduate-level examination. However, when compared head-to-head, human-generated items outperformed AI items on each scale in the majority of instances (60% of the time for cumulative rating score), with AI items showing wider variability on each metric.\u003c/p\u003e\n\u003cp\u003eAmong the factors that may explain this wide variability in output quality, besides the rapidly advancing technology that underpins the LLM functionality, is a lack of standardisation in input. The output quality of an LLM is critically dependent upon the \u0026lsquo;user prompt\u0026rsquo;, the specific instruction provided to direct its behaviour and generate desired outputs (Giray, 2023; Heston \u0026amp; Khun, 2023). \u0026lsquo;Prompt engineering\u0026rsquo; refers to the iterative process by which users of LLMs critically evaluate and craft their prompts to optimise responses. This process may involve serially refining prompts based on outputs; adding contextualising information such as target audience, key learning points, desired level of complexity, etc; and providing example content for the model to emulate. Specific reference to action verbs from Bloom\u0026rsquo;s taxonomy has also been recommended (Jacobsen \u0026amp; Weber, 2023). It is perhaps partly explanatory to note that the higher-quality outputs reported in the literature were generated using prompts that included a full example examination (Klang et al., 2023) or relevant chapters from two authoritative textbooks (Cheung et al., 2023), although lower-quality outputs were also derived from credible source material such as Continuing Medical Education articles from a top-tier specialty journal (Ayub et al., 2023). A thorough understanding of prompt engineering may help educators and curriculum designers produce quality relevant material, and as such it is likely a necessary skill for effectively leveraging this technology (Heston, 2023; Mesk\u0026oacute;, 2023).\u003c/p\u003e\n\u003cp\u003eFrom this survey of the available evidence, it is clear there is much to learn about the inherent capabilities, influencing factors, and optimisation processes of generative AI in producing medical MCQ materials. Nonetheless, the potential for AI to reduce the resource burden of item generation for medical faculties holds significant appeal, if its operationalisation can be optimised and its output validated.\u003c/p\u003e\n\u003cp\u003e\u0026nbsp;\u003cstrong\u003e\u003cem\u003eAims\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003eEvaluate and compare the structural qualities of clinically complex MCQs authored by humans (both novice and expert constructors) versus Generative Pre-trained Transformer 4 (GPT-4)-generated items; and\u003c/li\u003e\n\u003cli\u003eEvaluate the comprehensiveness and articulated clinical reasoning of explanatory feedback on items generated by humans (both novice and expert) and GPT-4.\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cem\u003eResearch questions:\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003eWhat is the structural quality of clinically complex MCQs generated by GPT-4?\u003c/li\u003e\n\u003cli\u003eCan GPT-4 articulate accurate explanatory feedback for clinically complex MCQs?\u003c/li\u003e\n\u003cli\u003eWhat points in the workflow of AI-assisted MCQ generation still necessitate human input?\u003c/li\u003e\n\u003c/ol\u003e"},{"header":"METHODS","content":"\u003cp\u003eThis mixed-methods study was performed to critically evaluate the structural quality of AI-generated MCQs in comparison to human-generated items. AI items were prospectively generated, while human-authored items were retrospectively sourced from an existing content bank, as further detailed below.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eItem structure and test blueprint construction\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eA single-best-answer MCQ format was employed. A complete item included a detailed contextual stem, a question, five options (with correct answer indicated), and explanatory text articulating the logic of correct versus incorrect options. As explained below, the explanatory text was evaluated separately from the other elements due to it being not universally included in standard MCQs.\u003c/p\u003e\n\u003cp\u003eA test blueprint was constructed to emulate a standard medical school examination at the level of the graduating student. Content areas included Medicine, Surgery, Paediatrics, Obstetrics, Gynaecology, Psychiatry, Population Health, and General Practice. 125 items were included, comprised of 40 from three sources, Novices, Experts, and AI. A surplus of 5 Expert human-generated items were included in the scoring process, as they were intended for future use in a mock examination where only satisfactory items would be included, and an element of redundancy was required for that purpose. An excerpt of the test blueprint is given in \u003cstrong\u003eAppendix 1\u003c/strong\u003e.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eDevelopment of a standardised scoring system\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eA standardised scoring rubric was developed to facilitate consistent evaluation of human and AI-generated MCQs. This incorporated elements such as content validity (encompassing factual accuracy, fidelity, and realism), scope, correct item anatomy, specific item-writing flaws, and cognitive skill level. This rubric drew on established frameworks, including modified Bloom\u0026rsquo;s taxonomy and item-writing guidelines (Haladyna et al., 2002), and is presented as\u003cstrong\u003e Tables 1-3\u003c/strong\u003e. A global impression criterion was included as a proxy for whether the item was considered fit for use in a summative examination for graduating medical students. A separate secondary evaluation was undertaken to assess the quality of each item\u0026rsquo;s explanatory feedback text for comprehensiveness, veracity, and articulation of clinical reasoning.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTable 1. Standardised scoring rubric of all MCQs\u003c/strong\u003e\u003c/p\u003e\n\u003ctable border=\"1\" width=\"600\"\u003e\n\u003ctbody\u003e\n\u003ctr\u003e\n\u003ctd width=\"324\"\u003e\n\u003cp\u003e\u003cstrong\u003eCORE ITEM ELEMENT\u003c/strong\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"276\"\u003e\n\u003cp\u003e\u003cstrong\u003eScore key\u003c/strong\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"324\"\u003e\n\u003cp\u003e\u003cstrong\u003eContent validity: \u003c/strong\u003eThe item has content validity, being factually accurate and realistic to clinical practice\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"276\"\u003e\n\u003col\u003e\n\u003cli\u003eEntirely does not meet criteria\u003c/li\u003e\n\u003cli\u003eMostly does not meet criteria\u003c/li\u003e\n\u003cli\u003eMostly meets criteria\u003c/li\u003e\n\u003cli\u003eEntirely meets criteria\u003c/li\u003e\n\u003c/ol\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"324\"\u003e\n\u003cp\u003e\u003cstrong\u003eWithin scope: \u003c/strong\u003eThe item tests concepts that are within scope for the target audience of a graduating medical student\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"276\"\u003e\n\u003col\u003e\n\u003cli\u003eEntirely does not meet criteria\u003c/li\u003e\n\u003cli\u003eMostly does not meet criteria\u003c/li\u003e\n\u003cli\u003eMostly meets criteria\u003c/li\u003e\n\u003cli\u003eEntirely meets criteria\u003c/li\u003e\n\u003c/ol\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"324\"\u003e\n\u003cp\u003e\u003cstrong\u003eItem anatomy: \u003c/strong\u003eThe anatomy of the item is correct and complete\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"276\"\u003e\n\u003col\u003e\n\u003cli\u003eEntirely does not meet criteria\u003c/li\u003e\n\u003cli\u003eMostly does not meet criteria\u003c/li\u003e\n\u003cli\u003eMostly meets criteria\u003c/li\u003e\n\u003cli\u003eEntirely meets criteria\u003c/li\u003e\n\u003c/ol\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"324\"\u003e\n\u003cp\u003e\u003cstrong\u003eItem-writing flaws (IWF): \u003c/strong\u003eHow many item-writing flaws are present? What type of flaws are present?*\u003c/p\u003e\n\u003cp\u003e●\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp; Content\u003c/p\u003e\n\u003cp\u003e●\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp; Style\u003c/p\u003e\n\u003cp\u003e●\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp; Formatting\u003c/p\u003e\n\u003cp\u003e●\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp; Stem\u003c/p\u003e\n\u003cp\u003e●\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp; Options\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"276\"\u003e\n\u003cp\u003e●\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp; Numeric count\u003c/p\u003e\n\u003cp\u003e●\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp; Type of IWF also documented\u003c/p\u003e\n\u003cp\u003e\u0026nbsp;\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"324\"\u003e\n\u003cp\u003e\u003cstrong\u003eCognitive skill level: \u003c/strong\u003eWhat is the cognitive skill level of the item?\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"276\"\u003e\n\u003cp\u003eUsing a modified Bloom\u0026rsquo;s taxonomy:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003eLevel I: Remembering\u003c/li\u003e\n\u003cli\u003eLevel II: Understanding\u003c/li\u003e\n\u003cli\u003eLevel III: Applying, analyzing, evaluating, and creating\u003c/li\u003e\n\u003c/ol\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"324\"\u003e\n\u003cp\u003e\u003cstrong\u003eGlobal impression (structural):\u003c/strong\u003e Global impression of the stem, question, and options: This item is fit for use in a summative examination for graduating med student\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"276\"\u003e\n\u003col\u003e\n\u003cli\u003eNo (unsalvageable)\u003c/li\u003e\n\u003cli\u003eNo (major further editing)\u003c/li\u003e\n\u003cli\u003eYes (minor further editing)\u003c/li\u003e\n\u003cli\u003eYes (no further editing)\u003c/li\u003e\n\u003c/ol\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"324\"\u003e\n\u003cp\u003e\u003cstrong\u003eSCORING OF EXPLANATORY TEXT\u003c/strong\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"276\"\u003e\n\u003cp\u003e\u003cstrong\u003eScore key\u003c/strong\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"324\"\u003e\n\u003cp\u003e\u003cstrong\u003eFeedback comprehensiveness: \u003c/strong\u003eThe feedback was appropriately comprehensive, addressing the correct option and distractors\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"276\"\u003e\n\u003col\u003e\n\u003cli\u003eEntirely does not meet criteria\u003c/li\u003e\n\u003cli\u003eMostly does not meet criteria\u003c/li\u003e\n\u003cli\u003eMostly meets criteria\u003c/li\u003e\n\u003cli\u003eEntirely meets criteria\u003c/li\u003e\n\u003c/ol\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"324\"\u003e\n\u003cp\u003e\u003cstrong\u003eFeedback veracity and clinical reasoning: \u003c/strong\u003eThe science and clinical reasoning in the feedback was satisfactory\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"276\"\u003e\n\u003cp\u003e\u003cstrong\u003e1.\u003c/strong\u003e\u0026nbsp; \u0026nbsp;\u0026nbsp;\u0026nbsp; Entirely does not meet criteria\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e2.\u003c/strong\u003e\u0026nbsp; \u0026nbsp;\u0026nbsp;\u0026nbsp; Mostly does not meet criteria\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e3.\u003c/strong\u003e\u0026nbsp; \u0026nbsp;\u0026nbsp;\u0026nbsp; Mostly meets criteria\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e4.\u003c/strong\u003e\u0026nbsp; \u0026nbsp;\u0026nbsp;\u0026nbsp; Entirely meets criteria\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"324\"\u003e\n\u003cp\u003e\u003cstrong\u003eGlobal impression (overall): \u003c/strong\u003eThis item, including its written feedback, is fit for use in a summative examination for graduating med student\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"276\"\u003e\n\u003col\u003e\n\u003cli\u003eNo (unsalvageable)\u003c/li\u003e\n\u003cli\u003eNo (major further editing)\u003c/li\u003e\n\u003cli\u003eYes (minor further editing)\u003c/li\u003e\n\u003cli\u003eYes (no further editing)\u003c/li\u003e\n\u003c/ol\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u003cstrong\u003e*\u003c/strong\u003e Referencing item-writing guidelines as laid out by (Haladyna et al., 2002).\u003c/p\u003e\n\u003cp\u003e\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTable 2. Modified Bloom\u0026rsquo;s taxonomy\u003c/strong\u003e\u003c/p\u003e\n\u003ctable border=\"1\" width=\"599\"\u003e\n\u003ctbody\u003e\n\u003ctr\u003e\n\u003ctd width=\"132\"\u003e\n\u003cp\u003e\u003cstrong\u003eLevel\u003c/strong\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"467\"\u003e\n\u003cp\u003e\u003cstrong\u003eCognitive domains\u003c/strong\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"132\"\u003e\n\u003cp\u003e\u003cstrong\u003eLevel I\u003c/strong\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"467\"\u003e\n\u003cp\u003eRemember (identifying and retrieving information)\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"132\"\u003e\n\u003cp\u003e\u003cstrong\u003eLevel II\u003c/strong\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"467\"\u003e\n\u003cp\u003eUnderstand (interpreting and summarizing information)\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"132\"\u003e\n\u003cp\u003e\u003cstrong\u003eLevel III\u003c/strong\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"467\"\u003e\n\u003cp\u003eApply, analyze, evaluate, and create (implementing, organizing, and critiquing information)\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u003cstrong\u003e\u0026nbsp;Table 3. Examples of item-writing guidelines (adapted from Haladyna et al., 2002).\u003c/strong\u003e\u003c/p\u003e\n\u003ctable border=\"1\" width=\"600\"\u003e\n\u003ctbody\u003e\n\u003ctr\u003e\n\u003ctd width=\"600\"\u003e\n\u003cp\u003e●\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp; \u003cstrong\u003eContent concerns\u003c/strong\u003e: Use novel material to test higher level learning. Paraphrase textbook language or language used during instruction when used in a test item to avoid testing for recall.\u003c/p\u003e\n\u003cp\u003e●\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp; \u003cstrong\u003eFormatting\u003c/strong\u003e \u003cstrong\u003econcerns\u003c/strong\u003e: Format the item vertically instead of horizontally\u003c/p\u003e\n\u003cp\u003e●\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp; \u003cstrong\u003eStyle\u003c/strong\u003e \u003cstrong\u003econcerns\u003c/strong\u003e: Use correct grammar, punctuation, capitalisation, and spelling\u003c/p\u003e\n\u003cp\u003e●\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp; \u003cstrong\u003eWriting the stem\u003c/strong\u003e: Include the central idea in the stem instead of the choices; word the stem positively, avoid negatives such as NOT or EXCEPT.\u003c/p\u003e\n\u003cp\u003e●\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp; \u003cstrong\u003eWriting the choices\u003c/strong\u003e: Place choices in a logical or numerical order; keep the length of choices about equal; avoid All-of-the-above; avoid giving clues to the right answer, such as pairs or triplets of options that clue the test-taker to the correct choice; make all distractors plausible.\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eHuman-generated MCQs \u0026ndash; Novice and Expert\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eA total of 85 human-generated MCQs were sourced from an existing Australian commercial medical education provider (eMedici2 Pty Ltd, Adelaide, Australia; \u003ca href=\"http://emedici.com/\"\u003ehttps://emedici.com\u003c/a\u003e). This content bank is derived from submissions by medical students and junior doctors, which pass through a pipeline of peer review, expert clinician review, and editorial approval prior to acceptance. Human authors are provided detailed written item-writing guidelines referencing style and item anatomy, at the time of item submission. Only items tagged as testing higher-order cognitive skills were included in this study. Each item was otherwise randomly selected from the content bank based on its recorded topic by two authors who were otherwise blinded to the content of the item. Of the 85 human items, 40 were written by a non-expert and had not passed through a peer review or other editorial process, and as such were deemed at \u0026lsquo;Novice\u0026rsquo; level of authorship, while 45 had been edited and/or approved by subject matter experts and were thus deemed \u0026lsquo;Expert\u0026rsquo; level. Subject areas were matched between groups.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAI-generated MCQs\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eGPT-4 was used in this study (model number: gpt-4-0125-preview) based on favourable reported performance against the Massive Multi-task Language Understanding benchmarks (OpenAI, 2023). A programmed script was used in GPT-4 with the prompt and key learning points - created as below - to generate outputs in an unsupervised fashion such that all authors were blinded to the GPT-4 outputs.\u003c/p\u003e\n\u003cp\u003e\u003cem\u003ePrompt engineering\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003eConstruction of a tailored prompt for GPT-4 took place across three reference group meetings by the six-panel team (the authors) who have broad educational, item-writing, clinical, and technical expertise. The aim was to develop a generic prompt template that maximised the potential of GPT-4 to produce structurally sound items testing higher order cognitive skills and could be easily adapted for a wide range of learning points or item topics with minimum subsequent human effort. The prompt was engineered incrementally, with each output assessed subjectively until the quality was deemed to be at the ceiling point. The prompt template included:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eInformation on the setting and the target audience of the MCQ;\u003c/li\u003e\n\u003cli\u003eThe inclusions and exclusions in the clinical stem to meet basic item anatomy requirements;\u003c/li\u003e\n\u003cli\u003eAdvice on avoidance of specific item-writing flaws, with instructions sourced from a full taxonomy of item-writing guidelines by Haladyna et al. (2002), as outlined in \u003cstrong\u003eTable 3;\u003c/strong\u003e\u003c/li\u003e\n\u003cli\u003eInstruction on the number of question options and distractors;\u003c/li\u003e\n\u003cli\u003eInstruction to produce explanatory feedback including clinical reasoning for the answer and distractors of the MCQ;\u003c/li\u003e\n\u003cli\u003eInstruction to include references to recent peer-reviewed articles;\u003c/li\u003e\n\u003cli\u003eFive examples of peer-reviewed, high quality MCQs covering a range of medical topics; and\u003c/li\u003e\n\u003cli\u003eA key learning point of the intended MCQ in the form of a factual statement, which included the question topic (in accordance with the test blueprint \u0026ndash; \u003cstrong\u003eAppendix 1\u003c/strong\u003e).\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cem\u003eVariability of GPT-4 outputs\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003eAmong the input variables of the GPT-4 interface is \u0026lsquo;temperature\u0026rsquo;, which broadly determines the level of variety in subsequently generated text. This parameter ranges from 0 to 2, with a lower value resulting in more consistent outputs. Preliminary investigations have yet to identify the ideal temperature for medical MCQ generation (Agarwal et al., 2024), and it is likely to vary in different settings. To maximise reproducibility, we used a temperature of 0.0.\u003c/p\u003e\n\u003cp\u003eTo confirm the predictability of outputs at temperature 0.0, six learning points were used to generate three consecutive outputs without interval prompt modification. These 18 items were evaluated by a consensus panel of five authors against the scoring rubric, then independently reviewed by another author. The results of this variability testing are given in \u003cstrong\u003eAppendix 2\u003c/strong\u003e.\u003c/p\u003e\n\u003cp\u003e\u003cem\u003eReferences\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003eThe veracity of the references generated by GPT-4 to support the generated feedback, were also evaluated from the 18 \u0026lsquo;variability testing\u0026rsquo; items. These were evaluated against the criteria: \u003cem\u003e\u0026lsquo;The references included were real, relevant to the MCQ, formatted, and peer-reviewed.\u0026rsquo;\u003c/em\u003e The items were scored on a scale of 1 (entirely does not meet criteria), 2 (mostly does not meet criteria), 3 (mostly meets criteria), to 4 (entirely meets criteria). Specific inaccuracies in the references were documented.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eItem appraisal - consensus panel scoring\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAll items, AI and human-generated, were pooled and then evaluated in random order by a consensus panel of five authors blinded to the origin of the item using the prespecified scoring rubric. One duplicate item was identified and excluded.\u003c/p\u003e\n\u003cp\u003eExamples of a novice, expert, and AI-generated item used in this study are presented in \u003cstrong\u003eAppendix 3\u003c/strong\u003e.\u003c/p\u003e\n\u003cp\u003eThe panel, by majority vote, also recorded their prediction for whether the item was authored by a novice, expert, or GPT-4.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eEthics approval\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis project received approval from the University of Adelaide Human Research Ethics Committee (HREC-2023-285).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eData analysis\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThere was no identifiable data involved in this study. Mean scores for measures of item quality between author types were compared using ANOVA with post-hoc Bonferroni or Tamhane tests as appropriate (the latter was used where the largest variance was at double or higher compared to the lowest). The distribution of the global impression scores (including and excluding feedback) were tallied and presented as a percentage of items. The distribution of item-writing flaws and cognitive skill level were tallied by author type. Summary descriptions are provided for the frequency of correct answer identification and placement, and assessments of quality of referencing. A p value \u0026lt;0.05 was considered significant.\u003c/p\u003e"},{"header":"RESULTS","content":"\u003cp\u003e\u003cstrong\u003eItem quality evaluation - individual structural characteristics\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe scoring of items based on their source (Novice, Expert, AI) is summarised in \u003cstrong\u003eTable 4\u003c/strong\u003e. One duplicate Novice item was identified and removed prior to data analysis.\u003c/p\u003e\n\u003cp\u003eFor human-authored MCQs, Expert items outperformed Novice items in all categories except for appropriateness of scope and the comprehensiveness of explanatory text, for which no difference was observed between any groups. Expert items modestly outperformed AI items in mean content validity scores (3.98 vs 3.73, p\u0026lt;0.001), cognitive skill level (2.58 vs 2.25, p\u0026lt;0.05), and clinical reasoning in feedback text (3.96 vs 3.65, p\u0026lt;0.001), while no significant difference was observed in item anatomy, scope, number of item-writing flaws, or feedback comprehensiveness. AI items modestly outperformed Novice items in item-writing flaws (0.80 vs 1.33, p\u0026lt;0.05); no significant difference was observed for all other parameters.\u003c/p\u003e\n\u003cp\u003eTable 4: Item quality scores by author group.\u003c/p\u003e\n\u003ctable border=\"1\" width=\"626\"\u003e\n\u003ctbody\u003e\n\u003ctr\u003e\n\u003ctd width=\"182\"\u003e\n\u003cp\u003e\u0026nbsp;\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"110\"\u003e\n\u003cp\u003e\u003cstrong\u003eExpert\u003cbr /\u003e \u0026nbsp;\u003c/strong\u003e(n=45)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"110\"\u003e\n\u003cp\u003e\u003cstrong\u003eNovice\u003cbr /\u003e \u0026nbsp;\u003c/strong\u003e(n=39)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"109\"\u003e\n\u003cp\u003e\u003cstrong\u003eAI\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e(n=40)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"116\"\u003e\n\u003cp\u003e\u003cstrong\u003eGroup differences\u003c/strong\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd colspan=\"5\" width=\"626\"\u003e\n\u003cp\u003e\u003cstrong\u003eCORE ITEM ELEMENTS\u003c/strong\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"182\"\u003e\n\u003cp\u003e\u003cstrong\u003eContent validity\u003cbr /\u003e \u0026nbsp;\u003c/strong\u003e(1-4)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"110\"\u003e\n\u003cp\u003e3.98 (0.15)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"110\"\u003e\n\u003cp\u003e3.33 (0.90)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"109\"\u003e\n\u003cp\u003e3.73 (0.60)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"116\"\u003e\n\u003cp\u003eEN**, EA**\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"182\"\u003e\n\u003cp\u003e\u003cstrong\u003eWithin scope\u003cbr /\u003e \u0026nbsp;\u003c/strong\u003e(1-4)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"110\"\u003e\n\u003cp\u003e3.89 (0.38)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"110\"\u003e\n\u003cp\u003e3.92 (0.35)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"109\"\u003e\n\u003cp\u003e4 (0)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"116\"\u003e\n\u003cp\u003en/s\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"182\"\u003e\n\u003cp\u003e\u003cstrong\u003eItem anatomy\u003cbr /\u003e \u0026nbsp;\u003c/strong\u003e(1-4)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"110\"\u003e\n\u003cp\u003e4 (0)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"110\"\u003e\n\u003cp\u003e3.79 (0.41)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"109\"\u003e\n\u003cp\u003e3.90 (0.30)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"116\"\u003e\n\u003cp\u003eEN*\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"182\"\u003e\n\u003cp\u003e\u003cstrong\u003eItem writing flaws\u003cbr /\u003e \u0026nbsp;\u003c/strong\u003e(count)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"110\"\u003e\n\u003cp\u003e0.82 (0.91)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"110\"\u003e\n\u003cp\u003e1.33 (1.24)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"109\"\u003e\n\u003cp\u003e0.80 (1.04)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"116\"\u003e\n\u003cp\u003eEN*, NA*\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"182\"\u003e\n\u003cp\u003e\u003cstrong\u003eCognitive skill level\u003cbr /\u003e \u0026nbsp;\u003c/strong\u003e(1-3)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"110\"\u003e\n\u003cp\u003e2.58 (0.54)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"110\"\u003e\n\u003cp\u003e2.26 (0.64)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"109\"\u003e\n\u003cp\u003e2.25 (0.54)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"116\"\u003e\n\u003cp\u003eEN*, EA*\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd colspan=\"5\" width=\"626\"\u003e\n\u003cp\u003e\u003cstrong\u003eEXPLANATORY TEXT\u003c/strong\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"182\"\u003e\n\u003cp\u003e\u003cstrong\u003eComprehensiveness\u003cbr /\u003e \u0026nbsp;\u003c/strong\u003e(1-4)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"110\"\u003e\n\u003cp\u003e3.96 (0.20)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"110\"\u003e\n\u003cp\u003e3.79 (0.52)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"109\"\u003e\n\u003cp\u003e3.85 (0.36)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"116\"\u003e\n\u003cp\u003en/s\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"182\"\u003e\n\u003cp\u003e\u003cstrong\u003eVeracity and\u003cbr /\u003e \u0026nbsp;clinical reasoning\u003cbr /\u003e \u0026nbsp;\u003c/strong\u003e(1-4)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"110\"\u003e\n\u003cp\u003e3.96 (0.21)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"110\"\u003e\n\u003cp\u003e3.41 (0.85)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"109\"\u003e\n\u003cp\u003e3.65 (0.58)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"116\"\u003e\n\u003cp\u003eEN**, EA**\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003eThe nature of any significant difference is given in the final column. \u0026lsquo;EN\u0026rsquo; denotes a statistically significant difference between Expert and Novice items, \u0026lsquo;EA\u0026rsquo; denotes a difference between Expert and AI items, and \u0026lsquo;NA\u0026rsquo; denotes a difference between Novice and AI items. All data are presented as mean (SD). Asterisks denote P\u0026lt;0.05 (*) or P\u0026lt;0.001(*). \u0026lsquo;n/s\u0026rsquo; denotes no significant difference.\u003c/p\u003e\n\u003cp\u003e\u003cem\u003e\u0026nbsp;\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eItem quality evaluation - global impressions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eGlobal impression scores of structural quality for each item are summarised in \u003cstrong\u003eTable 5\u003c/strong\u003e.\u003c/p\u003e\n\u003cp\u003eOverall, excluding the explanatory text section, Expert items were most frequently deemed fit for purpose with or without minor edits (95%), followed by AI items (86%), then Novice items (61%). Only Novice items included some deemed unsalvageable (13%). Including the explanatory text section resulted in Expert items deemed most fit for purpose with or without minor edits (95%), followed by AI items (85%), then Novice items (59%).\u003c/p\u003e\n\u003cp\u003e\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTable 5: Summary of global impression scores for item structure, by group. \u003c/strong\u003eScore counts are given as n (%), mean scores are given as mean (SD). Asterisks (**) denote a difference versus the other groups with P\u0026lt;0.001.\u003c/p\u003e\n\u003cp\u003e\u0026nbsp;\u003c/p\u003e\n\u003ctable border=\"1\" width=\"533\"\u003e\n\u003ctbody\u003e\n\u003ctr\u003e\n\u003ctd rowspan=\"2\" width=\"107\"\u003e\n\u003cp\u003e\u0026nbsp;\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd colspan=\"5\" width=\"427\"\u003e\n\u003cp\u003e\u003cstrong\u003eGLOBAL IMPRESSION SCORE\u003c/strong\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"110\"\u003e\n\u003cp\u003e\u003cstrong\u003e1\u003c/strong\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"75\"\u003e\n\u003cp\u003e\u003cstrong\u003e2\u003c/strong\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"75\"\u003e\n\u003cp\u003e\u003cstrong\u003e3\u003c/strong\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"75\"\u003e\n\u003cp\u003e\u003cstrong\u003e4\u003c/strong\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"93\"\u003e\n\u003cp\u003e\u003cstrong\u003eMean score\u003c/strong\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"107\"\u003e\n\u003cp\u003e\u003cstrong\u003eFit for purpose\u003c/strong\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"110\"\u003e\n\u003cp\u003eNo\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"75\"\u003e\n\u003cp\u003eNo\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"75\"\u003e\n\u003cp\u003eYes\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"75\"\u003e\n\u003cp\u003eYes\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"93\"\u003e\n\u003cp\u003e\u0026ndash;\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"107\"\u003e\n\u003cp\u003e\u003cstrong\u003eEdits required\u003c/strong\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"110\"\u003e\n\u003cp\u003eUnsalvageable\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"75\"\u003e\n\u003cp\u003eMajor\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"75\"\u003e\n\u003cp\u003eMinor\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"75\"\u003e\n\u003cp\u003eNone\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"93\"\u003e\n\u003cp\u003e\u0026ndash;\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd colspan=\"6\" width=\"533\"\u003e\n\u003cp\u003e\u0026nbsp;\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd colspan=\"6\" width=\"533\"\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cu\u003eEXCLUDING\u003c/u\u003e\u003c/strong\u003e\u003cstrong\u003e EXPLANATORY TEXT SECTION\u003c/strong\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"107\"\u003e\n\u003cp\u003e\u003cstrong\u003eExpert\u003c/strong\u003e (n=45)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"110\"\u003e\n\u003cp\u003e0 (0%)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"75\"\u003e\n\u003cp\u003e2 (4%)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"75\"\u003e\n\u003cp\u003e19 (42%)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"75\"\u003e\n\u003cp\u003e24 (53%)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"93\"\u003e\n\u003cp\u003e3.49 (0.59)\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"107\"\u003e\n\u003cp\u003e\u003cstrong\u003eNovice \u003c/strong\u003e(n=39)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"110\"\u003e\n\u003cp\u003e5 (13%)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"75\"\u003e\n\u003cp\u003e10 (26%)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"75\"\u003e\n\u003cp\u003e13 (33%)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"75\"\u003e\n\u003cp\u003e11 (28%)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"93\"\u003e\n\u003cp\u003e2.77 (1.01)**\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"107\"\u003e\n\u003cp\u003e\u003cstrong\u003eAI\u003c/strong\u003e (n=40)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"110\"\u003e\n\u003cp\u003e0 (0%)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"75\"\u003e\n\u003cp\u003e6 (15%)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"75\"\u003e\n\u003cp\u003e13 (33%)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"75\"\u003e\n\u003cp\u003e21 (53%)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"93\"\u003e\n\u003cp\u003e3.38 (0.74)\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd colspan=\"6\" width=\"533\"\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cu\u003eINCLUDING\u003c/u\u003e\u003c/strong\u003e\u003cstrong\u003e EXPLANATORY TEXT SECTION\u003c/strong\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"107\"\u003e\n\u003cp\u003e\u003cstrong\u003eExpert\u003c/strong\u003e (n=45)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"110\"\u003e\n\u003cp\u003e0 (0%)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"75\"\u003e\n\u003cp\u003e2 (4%)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"75\"\u003e\n\u003cp\u003e19 (42%)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"75\"\u003e\n\u003cp\u003e24 (53%)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"93\"\u003e\n\u003cp\u003e3.49 (0.59)\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"107\"\u003e\n\u003cp\u003e\u003cstrong\u003eNovice \u003c/strong\u003e(n=39)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"110\"\u003e\n\u003cp\u003e5 (13%)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"75\"\u003e\n\u003cp\u003e11 (28%)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"75\"\u003e\n\u003cp\u003e14 (36%)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"75\"\u003e\n\u003cp\u003e9 (23%)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"93\"\u003e\n\u003cp\u003e2.69 (0.98)**\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"107\"\u003e\n\u003cp\u003e\u003cstrong\u003eAI\u003c/strong\u003e (n=40)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"110\"\u003e\n\u003cp\u003e0 (0%)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"75\"\u003e\n\u003cp\u003e6 (15%)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"75\"\u003e\n\u003cp\u003e14 (35%)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"75\"\u003e\n\u003cp\u003e20 (50%)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"93\"\u003e\n\u003cp\u003e3.35 (0.74)\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eWhen averaged within groups, mean scores for global impression whether including or excluding explanatory text were similar between Expert and AI items. However, AI items outperformed Novice items in global scores both including and excluding explanatory text (3.38 vs 2.77, p\u0026lt;0.001; and 3.35 vs 2.69, p\u0026lt;0.001; respectively).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eItem-writing flaws (IWFs)\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe number and type of IWFs per item across author groups is presented in \u003cstrong\u003eTable 6\u003c/strong\u003e. Though the overall rate of IWFs differed only slightly between Expert, Novice, and AI items (0.8 vs 1.3 vs 0.8, respectively, p\u0026lt;0.05), Novice items were least likely to have zero IWFs and most likely to have 3 or more. Most IWFs belonged to the category \u0026lsquo;Writing the choices\u0026rsquo; across all author groups, using the item-writing guidelines outlined in\u003cstrong\u003e Table 3\u003c/strong\u003e.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTable 6: Summary of item-writing flaws between groups. \u003c/strong\u003eAll data are presented as counts unless otherwise specified.\u003c/p\u003e\n\u003ctable border=\"1\" width=\"453\"\u003e\n\u003ctbody\u003e\n\u003ctr\u003e\n\u003ctd colspan=\"2\" width=\"187\"\u003e\n\u003cp\u003e\u0026nbsp;\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"89\"\u003e\n\u003cp\u003e\u003cstrong\u003eExpert \u003c/strong\u003e(n=45)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"89\"\u003e\n\u003cp\u003e\u003cstrong\u003eNovice \u003c/strong\u003e(n=39)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"88\"\u003e\n\u003cp\u003e\u003cstrong\u003eAI\u003cbr /\u003e \u0026nbsp;\u003c/strong\u003e(n=40)\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd colspan=\"5\" width=\"453\"\u003e\n\u003cp\u003e\u003cstrong\u003eIWFs per item\u003c/strong\u003e, count distribution\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"29\"\u003e\n\u003cp\u003e\u0026nbsp;\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"158\"\u003e\n\u003cp\u003e0\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"89\"\u003e\n\u003cp\u003e21\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"89\"\u003e\n\u003cp\u003e10\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"88\"\u003e\n\u003cp\u003e21\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"29\"\u003e\n\u003cp\u003e\u0026nbsp;\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"158\"\u003e\n\u003cp\u003e1\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"89\"\u003e\n\u003cp\u003e13\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"89\"\u003e\n\u003cp\u003e16\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"88\"\u003e\n\u003cp\u003e10\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"29\"\u003e\n\u003cp\u003e\u0026nbsp;\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"158\"\u003e\n\u003cp\u003e2\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"89\"\u003e\n\u003cp\u003e9\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"89\"\u003e\n\u003cp\u003e7\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"88\"\u003e\n\u003cp\u003e6\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"29\"\u003e\n\u003cp\u003e\u0026nbsp;\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"158\"\u003e\n\u003cp\u003e3\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"89\"\u003e\n\u003cp\u003e2\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"89\"\u003e\n\u003cp\u003e3\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"88\"\u003e\n\u003cp\u003e2\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"29\"\u003e\n\u003cp\u003e\u0026nbsp;\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"158\"\u003e\n\u003cp\u003e4\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"89\"\u003e\n\u003cp\u003e0\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"89\"\u003e\n\u003cp\u003e2\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"88\"\u003e\n\u003cp\u003e1\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"29\"\u003e\n\u003cp\u003e\u0026nbsp;\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"158\"\u003e\n\u003cp\u003e5\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"89\"\u003e\n\u003cp\u003e0\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"89\"\u003e\n\u003cp\u003e1\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"88\"\u003e\n\u003cp\u003e0\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd colspan=\"2\" width=\"187\"\u003e\n\u003cp\u003e\u003cstrong\u003eIWFs per item\u003c/strong\u003e, mean (SD)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"89\"\u003e\n\u003cp\u003e0.8 (0.91)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"89\"\u003e\n\u003cp\u003e1.3 (1.24)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"88\"\u003e\n\u003cp\u003e0.8 (1.04)\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd colspan=\"5\" width=\"453\"\u003e\n\u003cp\u003e\u003cstrong\u003eIWFs by type\u003c/strong\u003e, count distribution\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"29\"\u003e\n\u003cp\u003e\u0026nbsp;\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"158\"\u003e\n\u003cp\u003eContent concerns\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"89\"\u003e\n\u003cp\u003e8\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"89\"\u003e\n\u003cp\u003e14\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"88\"\u003e\n\u003cp\u003e10\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"29\"\u003e\n\u003cp\u003e\u0026nbsp;\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"158\"\u003e\n\u003cp\u003eStyle concerns\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"89\"\u003e\n\u003cp\u003e0\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"89\"\u003e\n\u003cp\u003e5\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"88\"\u003e\n\u003cp\u003e0\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"29\"\u003e\n\u003cp\u003e\u0026nbsp;\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"158\"\u003e\n\u003cp\u003eFormatting concerns\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"89\"\u003e\n\u003cp\u003e0\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"89\"\u003e\n\u003cp\u003e0\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"88\"\u003e\n\u003cp\u003e0\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"29\"\u003e\n\u003cp\u003e\u0026nbsp;\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"158\"\u003e\n\u003cp\u003eWriting the stem\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"89\"\u003e\n\u003cp\u003e11\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"89\"\u003e\n\u003cp\u003e14\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"88\"\u003e\n\u003cp\u003e2\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"29\"\u003e\n\u003cp\u003e\u0026nbsp;\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"158\"\u003e\n\u003cp\u003eWriting the choices\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"89\"\u003e\n\u003cp\u003e18\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"89\"\u003e\n\u003cp\u003e19\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"88\"\u003e\n\u003cp\u003e30\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCognitive skill level\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTable 7\u003c/strong\u003e summarises the distribution of cognitive skill level across author groups. The majority of Expert items were at cognitive skill level III, compared to level II when for Novice or AI items. Novice items were most likely to be assigned a cognitive skill level of I.\u003c/p\u003e\n\u003cp\u003e\u0026nbsp;\u003cstrong\u003eTable 7: Distribution of assigned cognitive skill levels vis a vis Bloom\u0026rsquo;s modified taxonomy.\u003c/strong\u003e Data presented as n (%).\u003c/p\u003e\n\u003ctable border=\"1\" width=\"431\"\u003e\n\u003ctbody\u003e\n\u003ctr style=\"height: 35.8173px;\"\u003e\n\u003ctd style=\"height: 70.8173px;\" rowspan=\"2\" width=\"117\"\u003e\n\u003cp\u003e\u0026nbsp;\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd style=\"height: 35.8173px;\" colspan=\"3\" width=\"314\"\u003e\n\u003cp\u003e\u003cstrong\u003eCOGNITIVE SKILL LEVEL\u003c/strong\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr style=\"height: 35px;\"\u003e\n\u003ctd style=\"height: 35px;\" width=\"104\"\u003e\n\u003cp\u003e\u003cstrong\u003eI\u003c/strong\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd style=\"height: 35px;\" width=\"104\"\u003e\n\u003cp\u003e\u003cstrong\u003eII\u003c/strong\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd style=\"height: 35px;\" width=\"105\"\u003e\n\u003cp\u003e\u003cstrong\u003eIII\u003c/strong\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr style=\"height: 35px;\"\u003e\n\u003ctd style=\"height: 35px;\" width=\"117\"\u003e\n\u003cp\u003e\u003cstrong\u003eExpert \u003c/strong\u003e(n=45)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd style=\"height: 35px;\" width=\"104\"\u003e\n\u003cp\u003e1 (2%)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd style=\"height: 35px;\" width=\"104\"\u003e\n\u003cp\u003e17 (38%)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd style=\"height: 35px;\" width=\"105\"\u003e\n\u003cp\u003e27 (60%)\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr style=\"height: 35px;\"\u003e\n\u003ctd style=\"height: 35px;\" width=\"117\"\u003e\n\u003cp\u003e\u003cstrong\u003eNovice \u003c/strong\u003e(n=39)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd style=\"height: 35px;\" width=\"104\"\u003e\n\u003cp\u003e4 (10%)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd style=\"height: 35px;\" width=\"104\"\u003e\n\u003cp\u003e21 (54%)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd style=\"height: 35px;\" width=\"105\"\u003e\n\u003cp\u003e14 (36%)\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr style=\"height: 35px;\"\u003e\n\u003ctd style=\"height: 35px;\" width=\"117\"\u003e\n\u003cp\u003e\u003cstrong\u003eAI \u003c/strong\u003e(n=40)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd style=\"height: 35px;\" width=\"104\"\u003e\n\u003cp\u003e2 (5%)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd style=\"height: 35px;\" width=\"104\"\u003e\n\u003cp\u003e26 (65%)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd style=\"height: 35px;\" width=\"105\"\u003e\n\u003cp\u003e12 (30%)\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u003cstrong\u003eCorrect option veracity and placement\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe correct answer was appropriately indicated (i.e. the option indicated to be correct by the author was corroborated by panel consensus) in 100% of Expert, 90% of Novice, and 85% of AI-generated items. Significant differences were observed in the distribution of the correct option placement, as shown in \u003cstrong\u003eTable 8\u003c/strong\u003e. Both AI-generated (45%, p=0.028) and Expert-authored (33%, p=0.002) items disproportionately positioned the correct answer as option C versus the Novice-authored items.\u003c/p\u003e\n\u003cp\u003e\u0026nbsp;\u003cstrong\u003eTable 8: Distribution of correct option position. \u003c/strong\u003eAll data are presented as percentages.\u003c/p\u003e\n\u003ctable border=\"1\" width=\"431\"\u003e\n\u003ctbody\u003e\n\u003ctr\u003e\n\u003ctd rowspan=\"2\" width=\"114\"\u003e\n\u003cp\u003e\u0026nbsp;\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd colspan=\"5\" width=\"317\"\u003e\n\u003cp\u003e\u003cstrong\u003eOption position of correct answer\u003c/strong\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"63\"\u003e\n\u003cp\u003e\u003cstrong\u003eA\u003c/strong\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"63\"\u003e\n\u003cp\u003e\u003cstrong\u003eB\u003c/strong\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"63\"\u003e\n\u003cp\u003e\u003cstrong\u003eC\u003c/strong\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"63\"\u003e\n\u003cp\u003e\u003cstrong\u003eD\u003c/strong\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"64\"\u003e\n\u003cp\u003e\u003cstrong\u003eE\u003c/strong\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"114\"\u003e\n\u003cp\u003e\u003cstrong\u003eExpert \u003c/strong\u003e(n=45)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"63\"\u003e\n\u003cp\u003e9\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"63\"\u003e\n\u003cp\u003e20\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"63\"\u003e\n\u003cp\u003e33\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"63\"\u003e\n\u003cp\u003e29\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"64\"\u003e\n\u003cp\u003e7\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"114\"\u003e\n\u003cp\u003e\u003cstrong\u003eNovice \u003c/strong\u003e(n=39)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"63\"\u003e\n\u003cp\u003e15\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"63\"\u003e\n\u003cp\u003e33\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"63\"\u003e\n\u003cp\u003e13\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"63\"\u003e\n\u003cp\u003e26\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"64\"\u003e\n\u003cp\u003e13\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"114\"\u003e\n\u003cp\u003e\u003cstrong\u003eAI \u003c/strong\u003e(n=40)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"63\"\u003e\n\u003cp\u003e8\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"63\"\u003e\n\u003cp\u003e15\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"63\"\u003e\n\u003cp\u003e45\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"63\"\u003e\n\u003cp\u003e25\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"64\"\u003e\n\u003cp\u003e8\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eReferences\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe references generated by GPT-4 varied in quality. Of the 18 variability testing items\u0026rsquo; 52 references that were scored (\u003cstrong\u003eAppendix 2\u003c/strong\u003e), the average score against the stated criteria was 3.06 out of 4 (range 2-4), with six items scoring 2, five items scoring 3, and seven deemed perfect with a score of 4. The most common flaws were an incorrect DOI, the reference being an old edition of a guideline, or incorrect details, as summarised in \u003cstrong\u003eTable 9\u003c/strong\u003e.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTable 9. Summary of types of errors in the references generated by GPT-4\u003c/strong\u003e\u003c/p\u003e\n\u003ctable border=\"1\" width=\"216\"\u003e\n\u003ctbody\u003e\n\u003ctr\u003e\n\u003ctd width=\"142\"\u003e\n\u003cp\u003e\u003cstrong\u003eFlaw in reference\u003c/strong\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"74\"\u003e\n\u003cp\u003e\u003cstrong\u003eCount\u003c/strong\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"142\"\u003e\n\u003cp\u003eIncorrect DOI\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"74\"\u003e\n\u003cp\u003e7\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"142\"\u003e\n\u003cp\u003eIncorrect URL\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"74\"\u003e\n\u003cp\u003e2\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"142\"\u003e\n\u003cp\u003eOld edition\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"74\"\u003e\n\u003cp\u003e5\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"142\"\u003e\n\u003cp\u003eIncorrect details\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"74\"\u003e\n\u003cp\u003e5\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"142\"\u003e\n\u003cp\u003eNot peer reviewed\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"74\"\u003e\n\u003cp\u003e3\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"142\"\u003e\n\u003cp\u003eNon-existent\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"74\"\u003e\n\u003cp\u003e2\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u0026nbsp;\u003c/p\u003e"},{"header":"DISCUSSION","content":"\u003cp\u003eThis study has added to the growing evidence base demonstrating that pre-trained generative AI systems can produce medical MCQs of broadly comparable structural quality to expert item-writers, with several important caveats relevant to educators interested in leveraging this technology. MCQs remain ubiquitous in medical education, and the resource-intensiveness of generating high quality items has prompted the exploration of new avenues of content generation. We present the most comprehensive evaluation to date of clinical medical MCQs generated by GPT-4. Taken together, these results suggest significant efficiency gains are feasible in MCQ creation via the conscientious integration of generative AI but underscore the ongoing necessity of human expert review as part of such a workflow to ensure quality, veracity, and fitness-for-purpose of AI-generated materials.\u003c/p\u003e\n\u003cp\u003eGranular appraisal of the intrinsic properties of MCQ items in this study yielded a rough hierarchy of quality from Novice-authored items (written by medical students or junior doctors and otherwise unedited), through AI-generated items (produced by GPT-4), to Expert items (written, edited, and/or approved by subject matter experts and experienced item-writers), with each successive group matching or outperforming the former on particular metrics. Regarding human authors, 39% of Novice items fell into a global impression category signifying either outright unsuitability or a need for major editing to achieve fitness for purpose, while 95% of Expert items were deemed fit for purpose requiring minor editing at most. This aligns with previously demonstrated distinctions between student-authored and expert-authored MCQs (Pham et al., 2023) and highlights that engaging medical trainees to author MCQs, while known to be beneficial for learning (Touissi et al., 2022), nonetheless necessitates a careful review process to guide item development to a usable standard.\u003c/p\u003e\n\u003cp\u003eThe interposition of AI capability between Novice and Expert human authors in our analysis is a novel finding, though perhaps intuitive to educators in light of the constellation of medical knowledge, clinical experience, and pedagogical training required to generate high-quality MCQs with practical relevance and verisimilitude. Existing comparative data is sparse but corroborative: the only other such study as yet found that expert authors tended to outperform AI on assessments of quality when MCQs on the same subject matter were compared head-to-head, having achieved comparable quality scores in aggregate (Cheung et al., 2023). As less direct context, GPT-4 has also been shown to summarise medical information into clinical synopses (Van Veen et al., 2024) or patient information sheets (Currie et al., 2023; Lockie \u0026amp; Choi, 2024; Verran, 2024) to standards that variably exceed, meet, or fall short of those by expert clinicians. Conversely, while AI has not been compared to student or trainee authors on the specific task of MCQ generation prior to our study, GPT-4\u0026rsquo;s ability to interpret clinical MCQ material has been repeatedly shown to rival or exceed that of medical students and trainees - cohorts corresponding to the Novice group in our dataset - in the profusion of reports in which it achieved passing grades in qualifying examinations around the world (Abbas et al., 2024; Knoedler et al., 2024; Maitland et al., 2024; Meyer et al., 2024; Rojas et al., 2024; Tanaka et al., 2024). Taken together, the available evidence loosely suggests that GPT-4\u0026rsquo;s proficiency in handling clinical MCQ materials falls somewhere toward the upper end of a range between students/trainees and experts.\u003c/p\u003e\n\u003cp\u003eFocussing on the AI-generated items in this study, multiple results are important to highlight in evaluating the applicability of this technology to medical MCQ creation. AI items demonstrated high quality in aggregate, with 85% rated as fit for use upfront with minor edits at most, 95% deemed to test higher-order cognitive skills (modified Bloom\u0026rsquo;s level II or III), and excellent average scores across content validity (3.7/4), scope (4/4), and item anatomy (3.9/4). Though prior non-comparative studies have reported widely variable AI-generated MCQ quality (Artsi et al., 2024), our results support the view that expert-standard items at a clinically complex content level are within the capability of GPT-4. This high-quality output is likely closely dependent upon the input of a detailed prompt invoking best-practice item-writing principles, the inclusion of high quality example MCQs, and a well-articulated learning point, which was a feature of this study and strongly endorsed in literature (Mesk\u0026oacute;, 2023). Development of such a prompt - especially one that encompasses local requirements for structure, content, and style - necessitates human input at the outset of an AI-based workflow. A generic prompt that relies on the encoded clinical knowledge in the LLM minimises the human input required at subsequent stages of MCQ design. In this study, the provision of a factually verified key learning point by the human superseded the need to provide GPT-4 with any lengthy reference texts.\u003c/p\u003e\n\u003cp\u003eHowever, it is also crucial to consider not just the average quality across an entire exam, but outputs on a per-item basis in a safety-critical field like healthcare where high minimum standards of quality and veracity are necessary for all individual assessment items (Giuffr\u0026egrave; et al., 2024). In this regard, it must be emphasised that\u003cem\u003e\u0026nbsp;\u003c/em\u003e1 in 7 AI-generated items were deemed unfit for use without major edits (versus 1 in 25 by human experts), 1 in 7 indicated an erroneous correct answer, and almost half had correct answers default positioned as option C. Any of these issues would severely compromise assessment validity if left unchecked, and their presence indicates the vital necessity of incorporating human expert review prior to deployment of AI-generated items. Some of these issues may arise from technical limitations of GPT-4 capability, but it is also very likely that the AI engine is recapitulating the shortcomings of its training data: for example, a pervasive \u0026lsquo;middle bias\u0026rsquo; is known to exist in MCQ assessments (Attali \u0026amp; Bar‐Hillel, 2003), in which correct options are disproportionately clustered in middle positions (e.g. option C in an A-E structure), and it is likely that this tendency has been encoded into the engine via its training data \u0026ndash; along with other more pernicious biases of which educators must be cognisant (Zack et al., 2024).\u003c/p\u003e\n\u003cp\u003eThough detailed technical inquiry into LLM clinical reasoning was beyond our scope (readers are directed to Li\u0026eacute;vin et al. (2024), this study demonstrated GPT-4\u0026rsquo;s capability to generate cogent and cohesive explanatory text that, while marginally outperformed by human experts on ratings of veracity and articulated reasoning, ultimately equalled experts and exceeded novices on global impression. The complete fabrication of references by GPT-4 in this study also echoes known issues with LLM hallucination. This contrasts to prior literature in which 32-76% of generated MCQ explanations were deemed valid (Agarwal et al., 2023; Choi, 2023; Ngo et al., 2024), and may reflect interval improvement from previous versions of the GPT engine. Explanatory feedback to MCQs may enhance the acquisition or consolidation of contextualised knowledge and clinical reasoning for learners, functioning as a \u0026lsquo;virtual teaching assistant\u0026rsquo;. GPT-4 represents a highly efficient potential avenue to generate such material.\u003c/p\u003e\n\u003cp\u003e\u0026nbsp;\u003c/p\u003e\n\u003ch3\u003eLIMITATIONS\u003c/h3\u003e\n\u003cp\u003e●\u0026nbsp; \u0026nbsp;\u0026nbsp; \u0026nbsp;This mixed-methods study used a combination of pre-existing (Novice and Expert) and prospectively generated (AI) items, which introduces a lack of standardisation between instructions given to human authors and the GPT-4 prompt that may have influenced the respective item quality.\u003cbr\u003e\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e●\u0026nbsp; \u0026nbsp;\u0026nbsp; \u0026nbsp;GPT-4 \u0026lsquo;temperature\u0026rsquo;: Preliminary investigations have yet to identify the ideal temperature for medical MCQ generation (Agarwal et al., 2024), and it is likely to vary in different settings. A temperature of 0.0 was selected in this study to maximise external validity and reproducibility but may potentially have constrained output quality.\u003cbr\u003e\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e●\u0026nbsp; \u0026nbsp;\u0026nbsp; \u0026nbsp;Cognitive level and IWFs were considered contributory to item quality, based on standard item-writing guidance, however impact on discriminatory power is empirically inconsistent (Tarrant \u0026amp; Ware, 2008; Caldwell \u0026amp; Pate, 2013; Ali \u0026amp; Ruit, 2015; Pais et al., 2016; Rush et al., 2016; Pham et al., 2018).\u003cbr\u003e\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e●\u0026nbsp; \u0026nbsp;\u0026nbsp; \u0026nbsp;This study explicitly focussed on items requiring higher-order cognitive processing. GPT-4 was recently shown to make the most errors with lower-order reasoning in a set of psychiatric MCQs (Herrmann-Werner et al., 2024), and the output quality in this study is therefore not generalisable to the creation of medical MCQs targeting these lower-order processes.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFUTURE DIRECTIONS\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe results of an evaluation of the psychometric properties of AI-generated MCQs will be reported by the authors in a future article. Further areas of study include exploring the quality of outputs using other LLMs, exploring the role of clinical images in AI-assisted design of MCQs, and evaluating the educational value of interacting with AI-generated explanatory feedback on student learning. More sophisticated prompt engineering with automatic provision of reference texts could also be explored, as well as investigating fine-tuning processes of LLMs with large volumes of high quality MCQs.\u003c/p\u003e"},{"header":"CONCLUSION","content":"\u003cp\u003eIn summary, this study suggests that while human experts most reliably produce superior-quality complex clinical MCQs, GPT-4 is capable of producing comparable items in most instances and outperforms human novices in this task. An AI-integrated workflow for creating such items would still necessitate direct human expert input at multiple points, including prompt engineering in accordance with item-writing guidelines and local pedagogy, formulation of curriculum-specific key learning points, and validation of subsequent outputs with editing as required.\u003c/p\u003e\n\u003cp\u003e\u003cbr\u003e\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eProject contributions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eHW and SCK conceived the project idea.\u003c/p\u003e\n\u003cp\u003eHW and EP designed the study.\u003c/p\u003e\n\u003cp\u003eDL constructed the ethics application, which HW contributed to.\u003c/p\u003e\n\u003cp\u003eHW, DL, and SCK conducted the literature review.\u003c/p\u003e\n\u003cp\u003eAll authors contributed to the design of the standardised scoring rubric, prompt engineering, and to data collection.\u003c/p\u003e\n\u003cp\u003eHW, PD, and TZ prepared the items for scoring.\u003c/p\u003e\n\u003cp\u003eEP conducted the statistical analysis.\u003c/p\u003e\n\u003cp\u003eHW and SCK drafted the manuscript, which all authors then contributed to and provided critical review. The tables were prepared by both HW and SCK.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAcknowledgements\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNone.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eDeclaration of interest statement\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAuthors HW, TZ, SCK, and PD are co-directors of eMedici, a commercial medical education platform. DL and EP have no conflicts of interest to declare. The authors did not receive support from any organization for the submitted work.\u0026nbsp;\u003cbr\u003e\u0026nbsp;\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eAbbas, A., Rehman, M. S., \u0026amp; Rehman, S. S. (2024). Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions. \u003cem\u003eCureus\u003c/em\u003e,\u003cem\u003e 16\u003c/em\u003e(3). https://doi.org/10.7759/cureus.55991\u003c/li\u003e\n\u003cli\u003eAbd-Alrazaq, A., AlSaad, R., Alhuwail, D., Ahmed, A., Healy, P. M., Latifi, S., Aziz, S., Damseh, R., Alrazak, S. A., \u0026amp; Sheikh, J. (2023). Large language models in medical education: opportunities, challenges, and future directions. \u003cem\u003eJMIR Medical Education\u003c/em\u003e,\u003cem\u003e 9\u003c/em\u003e(1), e48291. https://doi.org/10.2196/48291\u003c/li\u003e\n\u003cli\u003eAbozaid, H., Park, Y. S., \u0026amp; Tekian, A. (2017). Peer review improves psychometric characteristics of multiple choice questions. \u003cem\u003eMedical teacher\u003c/em\u003e,\u003cem\u003e 39\u003c/em\u003e(sup1), S50-S54. https://doi.org/10.1080/0142159X.2016.1254743\u003c/li\u003e\n\u003cli\u003eAgarwal, A., Mittal, K., Doyle, A., Sridhar, P., Wan, Z., Doughty, J. A., Savelka, J., \u0026amp; Sakr, M. (2024). Understanding the Role of Temperature in Diverse Question Generation by GPT-4. Proceedings of the 55th ACM Technical Symposium on Computer Science Education V. 2,\u003c/li\u003e\n\u003cli\u003eAgarwal, M., Goswami, A., \u0026amp; Sharma, P. (2023). Evaluating ChatGPT-3.5 and Claude-2 in answering and explaining conceptual medical physiology multiple-choice questions. \u003cem\u003eCureus\u003c/em\u003e,\u003cem\u003e 15\u003c/em\u003e(9). https://doi.org/10.7759/cureus.46222\u003c/li\u003e\n\u003cli\u003eAli, S. H., \u0026amp; Ruit, K. G. (2015). The Impact of item flaws, testing at low cognitive level, and low distractor functioning on multiple-choice question quality. \u003cem\u003ePerspectives on medical education\u003c/em\u003e,\u003cem\u003e 4\u003c/em\u003e, 244-251. https://doi.org/10.1007/s40037-015-0212-x\u003c/li\u003e\n\u003cli\u003eArtsi, Y., Sorin, V., Konen, E., Glicksberg, B. S., Nadkarni, G., \u0026amp; Klang, E. (2024). Large language models for generating medical examinations: systematic review. \u003cem\u003eBMC Medical Education\u003c/em\u003e,\u003cem\u003e 24\u003c/em\u003e(1), 354. https://doi.org/10.1186/s12909-024-05239-y\u003c/li\u003e\n\u003cli\u003eAttali, Y., \u0026amp; Bar‐Hillel, M. (2003). Guess where: The position of correct answers in multiple‐choice test items as a psychometric variable. \u003cem\u003eJournal of Educational Measurement\u003c/em\u003e,\u003cem\u003e 40\u003c/em\u003e(2), 109-128. https://doi.org/10.1111/j.1745-3984.2003.tb01099.x\u003c/li\u003e\n\u003cli\u003eAyub, I., Hamann, D., Hamann, C. R., \u0026amp; Davis, M. J. (2023). Exploring the potential and limitations of chat generative pre-trained transformer (ChatGPT) in generating board-style dermatology questions: a qualitative analysis. \u003cem\u003eCureus\u003c/em\u003e,\u003cem\u003e 15\u003c/em\u003e(8).\u003c/li\u003e\n\u003cli\u003eBen\u0026iacute;tez, T. M., Xu, Y., Boudreau, J. D., Kow, A. W. C., Bello, F., Van Phuoc, L., Wang, X., Sun, X., Leung, G. K.-K., \u0026amp; Lan, Y. (2024). Harnessing the potential of large language models in medical education: promise and pitfalls. \u003cem\u003eJournal of the American Medical Informatics Association\u003c/em\u003e,\u003cem\u003e 31\u003c/em\u003e(3), 776-783. https://academic.oup.com/jamia/article-abstract/31/3/776/7588721?redirectedFrom=fulltext\u003c/li\u003e\n\u003cli\u003eCaldwell, D. J., \u0026amp; Pate, A. N. (2013). Effects of question formats on student and item performance. \u003cem\u003eAmerican journal of pharmaceutical education\u003c/em\u003e,\u003cem\u003e 77\u003c/em\u003e(4), 71. https://doi.org/10.5688/ajpe77471\u003c/li\u003e\n\u003cli\u003eCheung, B. H. H., Lau, G. K. K., Wong, G. T. C., Lee, E. Y. P., Kulkarni, D., Seow, C. S., Wong, R., \u0026amp; Co, M. T.-H. (2023). ChatGPT versus human in generating medical graduate exam multiple choice questions\u0026mdash;A multinational prospective study (Hong Kong SAR, Singapore, Ireland, and the United Kingdom). \u003cem\u003ePLoS ONE\u003c/em\u003e,\u003cem\u003e 18\u003c/em\u003e(8), e0290691. https://doi.org/10.1371/journal.pone.0290691\u003c/li\u003e\n\u003cli\u003eChoi, W. (2023). Assessment of the capacity of ChatGPT as a self-learning tool in medical pharmacology: a study using MCQs. \u003cem\u003eBMC Medical Education\u003c/em\u003e,\u003cem\u003e 23\u003c/em\u003e(1), 864. https://doi.org/10.1186/s12909-023-04832-x\u003c/li\u003e\n\u003cli\u003eCoşkun, \u0026Ouml;., Kıyak, Y. S., \u0026amp; Budakoğlu, I. İ. (2024). ChatGPT to generate clinical vignettes for teaching and multiple-choice questions for assessment: A randomized controlled experiment. \u003cem\u003eMedical teacher\u003c/em\u003e, 1-7. https://doi.org/10.1080/0142159X.2024.2327477\u003c/li\u003e\n\u003cli\u003eCurrie, G., Robbie, S., \u0026amp; Tually, P. (2023). ChatGPT and patient information in nuclear medicine: GPT-3.5 versus GPT-4. \u003cem\u003eJournal of Nuclear Medicine Technology\u003c/em\u003e,\u003cem\u003e 51\u003c/em\u003e(4), 307-313. https://doi.org/10.2967/jnmt.123.266151\u003c/li\u003e\n\u003cli\u003eGiray, L. (2023). Prompt engineering with ChatGPT: a guide for academic writers. \u003cem\u003eAnnals of biomedical engineering\u003c/em\u003e,\u003cem\u003e 51\u003c/em\u003e(12), 2629-2633. https://doi.org/10.1007/s10439-023-03272-4\u003c/li\u003e\n\u003cli\u003eGiuffr\u0026egrave;, M., You, K., \u0026amp; Shung, D. L. (2024). Evaluating ChatGPT in medical contexts: the imperative to guard against hallucinations and partial accuracies. \u003cem\u003eClinical Gastroenterology and Hepatology\u003c/em\u003e,\u003cem\u003e 22\u003c/em\u003e(5), 1145-1146. https://www.cghjournal.org/article/S1542-3565(23)00835-2/pdf\u003c/li\u003e\n\u003cli\u003eGupta, P., Meena, P., Khan, A. M., Malhotra, R. K., \u0026amp; Singh, T. (2020). Effect of faculty training on quality of multiple-choice questions. \u003cem\u003eInternational Journal of Applied and Basic Medical Research\u003c/em\u003e,\u003cem\u003e 10\u003c/em\u003e(3), 210-214. https://doi.org/10.4103/ijabmr.IJABMR_30_20\u003c/li\u003e\n\u003cli\u003eHaladyna, T. M., \u0026amp; Downing, S. M. (1989). Validity of a taxonomy of multiple-choice item-writing rules. \u003cem\u003eApplied measurement in education\u003c/em\u003e,\u003cem\u003e 2\u003c/em\u003e(1), 51-78. https://doi.org/10.1207/s15324818ame0201_4\u003c/li\u003e\n\u003cli\u003eHaladyna, T. M., Downing, S. M., \u0026amp; Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. \u003cem\u003eApplied measurement in education\u003c/em\u003e,\u003cem\u003e 15\u003c/em\u003e(3), 309-333. https://doi.org/10.1207/S15324818AME1503_5\u003c/li\u003e\n\u003cli\u003eHerrmann-Werner, A., Festl-Wietek, T., Holderried, F., Herschbach, L., Griewatz, J., Masters, K., Zipfel, S., \u0026amp; Mahling, M. (2024). Assessing ChatGPT\u0026rsquo;s mastery of Bloom\u0026rsquo;s taxonomy using psychosomatic medicine exam questions: mixed-methods study. \u003cem\u003eJournal of medical Internet research\u003c/em\u003e,\u003cem\u003e 26\u003c/em\u003e, e52113. https://doi.org/10.2196/52113\u003c/li\u003e\n\u003cli\u003eHeston, T. F. (2023). Prompt engineering for students of medicine and their teachers. \u003cem\u003earXiv preprint arXiv:2308.11628\u003c/em\u003e.\u003c/li\u003e\n\u003cli\u003eHeston, T. F., \u0026amp; Khun, C. (2023). Prompt engineering in medical education. \u003cem\u003eInternational Medical Education\u003c/em\u003e,\u003cem\u003e 2\u003c/em\u003e(3), 198-205. https://doi.org/10.3390/ime2030019\u003c/li\u003e\n\u003cli\u003eHift, R. J. (2014). Should essays and other \u0026ldquo;open-ended\u0026rdquo;-type questions retain a place in written summative assessment in clinical medicine? \u003cem\u003eBMC Medical Education\u003c/em\u003e,\u003cem\u003e 14\u003c/em\u003e, 1-18. https://doi.org/10.1186/s12909-014-0249-2\u003c/li\u003e\n\u003cli\u003eJacobsen, L. J., \u0026amp; Weber, K. E. (2023). The promises and pitfalls of ChatGPT as a feedback provider in higher education: An exploratory study of prompt engineering and the quality of AI-driven feedback. https://doi.org/10.31219/osf.io/cr257\u003c/li\u003e\n\u003cli\u003eJobs, A., Twesten, C., G\u0026ouml;bel, A., Bonnemeier, H., Lehnert, H., \u0026amp; Weitz, G. (2013). Question-writing as a learning tool for students\u0026ndash;outcomes from curricular exams. \u003cem\u003eBMC Medical Education\u003c/em\u003e,\u003cem\u003e 13\u003c/em\u003e, 1-7. https://doi.org/10.1186/1472-6920-13-89\u003c/li\u003e\n\u003cli\u003eKıyak, Y. S., Coşkun, \u0026Ouml;., Budakoğlu, I. İ., \u0026amp; Uluoğlu, C. (2024). ChatGPT for generating multiple-choice questions: evidence on the use of artificial intelligence in automatic item generation for a rational pharmacotherapy exam. \u003cem\u003eEuropean journal of clinical pharmacology\u003c/em\u003e,\u003cem\u003e 80\u003c/em\u003e(5), 729-735. https://doi.org/10.1007/s00228-024-03649-x\u003c/li\u003e\n\u003cli\u003eKlang, E., Portugez, S., Gross, R., Brenner, A., Gilboa, M., Ortal, T., Ron, S., Robinzon, V., Meiri, H., \u0026amp; Segal, G. (2023). Advantages and pitfalls in utilizing artificial intelligence for crafting medical examinations: a medical education pilot study with GPT-4. \u003cem\u003eBMC Medical Education\u003c/em\u003e,\u003cem\u003e 23\u003c/em\u003e. https://doi.org/10.1186/s12909-023-04752-w\u003c/li\u003e\n\u003cli\u003eKnoedler, L., Alfertshofer, M., Knoedler, S., Hoch, C. C., Funk, P. F., Cotofana, S., Maheta, B., Frank, K., Br\u0026eacute;bant, V., \u0026amp; Prantl, L. (2024). Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative Analysis. \u003cem\u003eJMIR Medical Education\u003c/em\u003e,\u003cem\u003e 10\u003c/em\u003e(1), e51148. https://doi.org/10.2196/51148\u003c/li\u003e\n\u003cli\u003eLi\u0026eacute;vin, V., Hother, C. E., Motzfeldt, A. G., \u0026amp; Winther, O. (2024). Can large language models reason about medical questions? \u003cem\u003ePatterns\u003c/em\u003e,\u003cem\u003e 5\u003c/em\u003e(3).\u003c/li\u003e\n\u003cli\u003eLockie, E., \u0026amp; Choi, J. (2024). Evaluation of a chat GPT generated patient information leaflet about laparoscopic cholecystectomy. \u003cem\u003eANZ Journal of Surgery\u003c/em\u003e,\u003cem\u003e 94\u003c/em\u003e(3), 353-355. https://doi.org/10.1111/ans.18834\u003c/li\u003e\n\u003cli\u003eMaitland, A., Fowkes, R., \u0026amp; Maitland, S. (2024). Can ChatGPT pass the MRCP (UK) written examinations? Analysis of performance and errors using a clinical decision-reasoning framework. \u003cem\u003eBMJ open\u003c/em\u003e,\u003cem\u003e 14\u003c/em\u003e(3), e080558. https://doi.org/10.1136/bmjopen-2023-080558\u003c/li\u003e\n\u003cli\u003eMesk\u0026oacute;, B. (2023). Prompt engineering as an important emerging skill for medical professionals: tutorial. \u003cem\u003eJournal of medical Internet research\u003c/em\u003e,\u003cem\u003e 25\u003c/em\u003e, e50638.\u003c/li\u003e\n\u003cli\u003eMeyer, A., Riese, J., \u0026amp; Streichert, T. (2024). Comparison of the performance of GPT-3.5 and GPT-4 with that of medical students on the written German medical licensing examination: observational study. \u003cem\u003eJMIR Medical Education\u003c/em\u003e,\u003cem\u003e 10\u003c/em\u003e, e50965. https://doi.org/10.2196/50965\u003c/li\u003e\n\u003cli\u003eNaeem, N., van der Vleuten, C., \u0026amp; Alfaris, E. A. (2012). Faculty development on item writing substantially improves item quality. \u003cem\u003eAdvances in health sciences education\u003c/em\u003e,\u003cem\u003e 17\u003c/em\u003e, 369-376. https://doi.org/10.1007/s10459-011-9315-2\u003c/li\u003e\n\u003cli\u003eNgo, A., Gupta, S., Perrine, O., Reddy, R., Ershadi, S., \u0026amp; Remick, D. (2024). ChatGPT 3.5 fails to write appropriate multiple choice practice exam questions. \u003cem\u003eAcademic Pathology\u003c/em\u003e,\u003cem\u003e 11\u003c/em\u003e(1), 100099. https://www.academicpathologyjournal.org/article/S2374-2895(23)00031-3/pdf\u003c/li\u003e\n\u003cli\u003eOpenAI. (2023). Gpt-4 technical report. \u003cem\u003earXiv preprint arXiv:2303.08774\u003c/em\u003e.\u003c/li\u003e\n\u003cli\u003ePais, J., Silva, A., Guimar\u0026atilde;es, B., Povo, A., Coelho, E., Silva-Pereira, F., Lourinho, I., Ferreira, M. A., \u0026amp; Severo, M. (2016). Do item-writing flaws reduce examinations psychometric quality? \u003cem\u003eBMC Research Notes\u003c/em\u003e,\u003cem\u003e 9\u003c/em\u003e, 1-7. https://doi.org/10.1186/s13104-016-2202-4\u003c/li\u003e\n\u003cli\u003ePalmer, E. J., Duggan, P., Devitt, P. G., \u0026amp; Russell, R. (2010). The modified essay question: its exit from the exit examination? \u003cem\u003eMedical teacher\u003c/em\u003e,\u003cem\u003e 32\u003c/em\u003e(7), e300-e307. https://doi.org/10.3109/0142159X.2010.488705\u003c/li\u003e\n\u003cli\u003ePapinczak, T., Peterson, R., Babri, A. S., Ward, K., Kippers, V., \u0026amp; Wilkinson, D. (2012). Using student-generated questions for student-centred assessment. \u003cem\u003eAssessment \u0026amp; Evaluation in Higher Education\u003c/em\u003e,\u003cem\u003e 37\u003c/em\u003e(4), 439-452. https://doi.org/10.1080/02602938.2010.538666\u003c/li\u003e\n\u003cli\u003ePham, H., Besanko, J., \u0026amp; Devitt, P. (2018). Examining the impact of specific types of item-writing flaws on student performance and psychometric properties of the multiple choice question. \u003cem\u003eMedEdPublish\u003c/em\u003e,\u003cem\u003e 7\u003c/em\u003e. https://doi.org/10.15694/mep.2018.0000225.1\u003c/li\u003e\n\u003cli\u003ePham, H., Court-Kowalski, S., Chan, H., \u0026amp; Devitt, P. (2023). Writing Multiple Choice Questions\u0026mdash;Has the Student Become the Master? \u003cem\u003eTeaching and Learning in Medicine\u003c/em\u003e,\u003cem\u003e 35\u003c/em\u003e(3), 356-367. https://doi.org/10.1080/10401334.2022.2050240\u003c/li\u003e\n\u003cli\u003eRojas, M., Rojas, M., Burgess, V., Toro-P\u0026eacute;rez, J., \u0026amp; Salehi, S. (2024). Exploring the Performance of ChatGPT Versions 3.5, 4, and 4 With Vision in the Chilean Medical Licensing Examination: Observational Study. \u003cem\u003eJMIR Medical Education\u003c/em\u003e,\u003cem\u003e 10\u003c/em\u003e, e55048. https://doi.org/10.2196/55048\u003c/li\u003e\n\u003cli\u003eRush, B. R., Rankin, D. C., \u0026amp; White, B. J. (2016). The impact of item-writing flaws and item complexity on examination item difficulty and discrimination value. \u003cem\u003eBMC Medical Education\u003c/em\u003e,\u003cem\u003e 16\u003c/em\u003e, 1-10. https://doi.org/10.1186/s12909-016-0773-3\u003c/li\u003e\n\u003cli\u003eSchuwirth, L. W., \u0026amp; Van Der Vleuten, C. P. (2003). Written assessment.(ABC of learning and teaching in medicine). \u003cem\u003eBritish Medical Journal\u003c/em\u003e,\u003cem\u003e 326\u003c/em\u003e(7390), 643-646. https://doi.org/10.1136/bmj.326.7390.643\u003c/li\u003e\n\u003cli\u003eShah, M. P., Lin, B. R., Lee, M., Kahn, D., \u0026amp; Hernandez, E. (2019). Student-written multiple-choice questions\u0026mdash;a practical and educational approach. \u003cem\u003eMedical Science Educator\u003c/em\u003e,\u003cem\u003e 29\u003c/em\u003e, 41-43. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8368101/pdf/40670_2018_Article_646.pdf\u003c/li\u003e\n\u003cli\u003eSmeby, S. S., Lillebo, B., Gynnild, V., Samstad, E., Standal, R., Knobel, H., Vik, A., \u0026amp; Sl\u0026oslash;rdahl, T. S. (2019). Improving assessment quality in professional higher education: Could external peer review of items be the answer? \u003cem\u003eCogent Medicine\u003c/em\u003e,\u003cem\u003e 6\u003c/em\u003e(1), 1659746. https://doi.org/10.1080/2331205X.2019.1659746\u003c/li\u003e\n\u003cli\u003eTanaka, Y., Nakata, T., Aiga, K., Etani, T., Muramatsu, R., Katagiri, S., Kawai, H., Higashino, F., Enomoto, M., \u0026amp; Noda, M. (2024). Performance of generative pretrained transformer on the national medical licensing examination in Japan. \u003cem\u003ePLOS Digital Health\u003c/em\u003e,\u003cem\u003e 3\u003c/em\u003e(1), e0000433. https://doi.org/10.1371/journal.pdig.0000433\u003c/li\u003e\n\u003cli\u003eTarrant, M., \u0026amp; Ware, J. (2008). Impact of item‐writing flaws in multiple‐choice questions on student achievement in high‐stakes nursing assessments. \u003cem\u003eMedical Education\u003c/em\u003e,\u003cem\u003e 42\u003c/em\u003e(2), 198-206. https://doi.org/10.1111/j.1365-2923.2007.02957.x\u003c/li\u003e\n\u003cli\u003eTouissi, Y., Hjiej, G., Hajjioui, A., Ibrahimi, A., \u0026amp; Fourtassi, M. (2022). Does developing multiple-choice questions improve medical students\u0026rsquo; learning? A systematic review. \u003cem\u003eMedical Education Online\u003c/em\u003e,\u003cem\u003e 27\u003c/em\u003e(1), 2005505. https://doi.org/10.1080/10872981.2021.2005505\u003c/li\u003e\n\u003cli\u003eVan Veen, D., Van Uden, C., Blankemeier, L., Delbrouck, J.-B., Aali, A., Bluethgen, C., Pareek, A., Polacin, M., Reis, E. P., \u0026amp; Seehofnerov\u0026aacute;, A. (2024). Adapted large language models can outperform medical experts in clinical text summarization. \u003cem\u003eNature medicine\u003c/em\u003e,\u003cem\u003e 30\u003c/em\u003e(4), 1134-1142. https://doi.org/10.1038/s41591-024-02855-5\u003c/li\u003e\n\u003cli\u003eVerran, C. (2024). Artificial intelligence-generated patient information leaflets: a comparison of contents according to British Association of Dermatologists standards. \u003cem\u003eClinical and Experimental Dermatology\u003c/em\u003e, llad461. https://doi.org/10.1093/ced/llad461\u003c/li\u003e\n\u003cli\u003eZack, T., Lehman, E., Suzgun, M., Rodriguez, J. A., Celi, L. A., Gichoya, J., Jurafsky, D., Szolovits, P., Bates, D. W., \u0026amp; Abdulnour, R.-E. E. (2024). Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study. \u003cem\u003eThe Lancet Digital Health\u003c/em\u003e,\u003cem\u003e 6\u003c/em\u003e(1), e12-e22. https://doi.org/10.1016/S2589-7500(23)00225-X (Erratum in: Lancet Digit Health. 2024 Jul;6(7):e445. doi: 10.1016/S2589-7500(24)00120-1)\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Multiple choice question, higher order cognitive skills, Bloom’s taxonomy, artificial intelligence, prompt engineering, ChatGPT","lastPublishedDoi":"10.21203/rs.3.rs-4831476/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-4831476/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eMCQs are a popular assessment format in medical education. Creating clinically complex MCQs can be a time-consuming task for subject matter experts. Large language models such as GPT-4, a type of generative artificial intelligence (AI), are a potential tool for MCQ design.\u003c/p\u003e \u003cp\u003eClinically complex human-generated MCQs, at both novice and expert level, were compared with AI MCQs. A generic prompt for GPT-4 was engineered, which included item-writing guidance, example MCQs, and key learning points. A standardised scoring system was developed for a consensus panel to objectively evaluate each item, blinded to the author, on categories including content validity, scope, item anatomy, cognitive skill level, item-writing flaws (IWFs), feedback comprehensiveness, veracity, adequacy of clinical reasoning, and global impression of fitness for use.\u003c/p\u003e \u003cp\u003eAnalysis showed that all groups (novice, expert, and AI) were able generate items within scope. Expert items performed better than Novice items in all categories. Expert items performed better than AI in content validity, feedback veracity and clinical reasoning. They also tended to test higher order cognitive skills. There was no difference in the global impressions of Expert and AI items, which suggests they may be comparable overall.\u003c/p\u003e \u003cp\u003eWith adequate prompt engineering, GPT-4 can produce MCQs testing clinically complex concepts for medical assessment. The quality of AI outputs is comparable to experts, however human validation is necessary to ensure content validity. The AI-generated explanatory feedback was adequate in veracity and clinical reasoning, which may serve as an educational tool for learners.\u003c/p\u003e","manuscriptTitle":"GPT-4 versus human authors in clinically complex MCQ creation: a blinded analysis of item quality","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-08-21 05:55:27","doi":"10.21203/rs.3.rs-4831476/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"77d99607-87f4-4e08-80e5-732d9817ce54","owner":[],"postedDate":"August 21st, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2024-09-04T11:28:04+00:00","versionOfRecord":[],"versionCreatedAt":"2024-08-21 05:55:27","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-4831476","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-4831476","identity":"rs-4831476","version":["v1"]},"buildId":"qtupq5eGEP_6zYnWcrvyt","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.