Performance of OpenAI’s GPT-4 in a mock MRCS Part A Examination

doi:10.21203/rs.3.rs-4003159/v1

Performance of OpenAI’s GPT-4 in a mock MRCS Part A Examination

2024 · doi:10.21203/rs.3.rs-4003159/v1

preprint OA: closed CC-BY-4.0

📄 Open PDF Full text JSON View at publisher

Full text 51,393 characters · extracted from preprint-html · click to expand

Performance of OpenAI’s GPT-4 in a mock MRCS Part A Examination | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Performance of OpenAI’s GPT-4 in a mock MRCS Part A Examination Ibrahim Inzarul Haq, Siddarth Raj, Ali Ridha, Arun O'Sullivan, and 3 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-4003159/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Introduction : OpenAI’s latest iteration of a Large Language Model (LLM); GPT-4 (Generative Pre-trained Transformer 4) has demonstrated its proficiency against various professional examination standards like the USMLE (United States Medical Licensing Examination), FRCS (Fellowship of the Royal Colleges of Surgeons) and the United States Bar. However, GPT-4’s capability with the MRCS (Membership of the Royal College of Surgeons) Part A has not yet been investigated. Methodology : A representative MRCS Part A examination that was prepared and provided by “TeachMeSurgery”based on the MRCS Intercollegiate Curriculum was used to assess GPT-4's performance. Each question was processed via the web-based interface of ChatGPT (Chat Generative Pre-trained Transformer) Plus. Results : GPT-4 scored 87.2% on Applied Basic Sciences (157/180) and 86.7% on Principles of Surgery in General (104/120 questions), achieving an overall score of 261/300 (87%), which is above the typical passing threshold. GPT-4 scored 100% in four out of the eleven predefined curriculum areas, which included: Pharmacology, Microbiology, Data Interpretation and Audit, and The Surgical Care of Children. GPT-4’s weakest performance was in the Medico-Legal Aspects of Surgical Practice, in which it scored 33.3%. Conclusion : GPT-4 successfully passed the mock MRCS Part A without any specialised preparatory training, however further research could look at integrating the GPT-4 model to enhance a trainee surgeon’s learning and use it as an effective tool to deliver high quality & efficient patient care. Introduction Generative Pre-trained Transformer 4 (GPT-4) is a Large Language Model (LLM) released in March 2023 by OpenAI. LLMs are artificial intelligence (AI) models capable of higher order reasoning and generating human-like outputs [ 1 ]. Compared to traditional deep learning AI models, LLMs produce more coherent and relevant content due to their use of natural language processing [ 2 ]. The web-based interface of ChatGPT (a free-to-use AI system) has allowed public access to GPT-4 without the need for technical knowledge and has caused a rapid exploration into potential applications of AI in clinical medicine and medical education. For instance, AI has exhibited a rudimentary capability to analyse chest radiographs[ 3 ]. In medical education, AI can be used to assist medical students with history-taking with an AI bot posing as an imaginary patient [ 4 ]. This has been shown to be a good alternative to bedside teaching, especially within the early clinical years. However, the use of AI has not yet been adopted into mainstream medical education, despite the increasing number of research papers highlighting its potential[ 5 ]. This is partly due to a lack of high-level research demonstrating the accuracy and efficacy of AI in medical education [ 5 ]. Perhaps understandably, the adoption of AI into medical practice and education has been met with apprehension by many members of the medical community. Concerns regarding patient privacy and bias within source data have been raised[ 6 ]. In addition, a nationwide survey of medical students yielded concerns that AI could damage doctor-patient relationships, devalue the medical profession and potentially lead to unemployment amongst doctors [ 7 ]. However, the majority of those surveyed agreed that there was potential for AI to improve clinicians’ access to information and patients’ access to healthcare services, and to reduce errors. 93.8% of those surveyed thought they should be given structured training on AI applications in healthcare settings [ 7 ]. GPT-4 has demonstrated its proficiency against established professional examination standards like the United States Medical Licensing Examination (USMLE), the Fellow of the Royal College of Surgeons Examination (FRCS) and the United States Bar [ 8 – 10 ]. These evaluations have yielded mixed success rates, underscoring the dynamic nature of AI's performance. Nevertheless, an unexplored avenue remains: the application of GPT-4 in the context of the Membership of the Royal College of Surgeons (MRCS) Part A examination, which is a mandatory, written exam for postgraduate doctors within the UK healthcare system who wish to enter specialist surgical training. It has a pass rate of 30 to 40% and the pass mark varies from 69 to 75%[ 11 ]. In contrast to the FRCS, the MRCS Part A is a broader, less specialised examination designed for doctors at an earlier stage of their surgical training. Given that GPT-4’s inability to pass FRCS was in part due to its limitations in critical thinking, surgical principles and decision-making skills [ 6 ], it is possible GPT-4 would be more likely to pass the MRCS Part A. Exploring GPT-4's ability to pass the MRCS Part A can highlight LLMs potential in examination scenarios and evaluate its limitations. Aims To determine whether GPT-4 can pass the MRCS Part A and evaluate whether it has potential for further use in medical education and examination settings. Materials and Methods The MRCS Part A examination is structured into two distinct sections: Applied Basic Sciences (ABS) with 180 questions and Principles of Surgery in General (POSG) with 120 questions, resulting in a comprehensive total of 300 questions. The overall examination adheres to a predefined curriculum breakdown published by the Intercollegiate Committee for Basic Surgical Examinations (ICBSE), focusing heavily on Applied Surgical Anatomy (75 questions) but also includes a range of other topics, such as Applied Surgical Physiology (45 questions) and Common Surgical Conditions (45 questions) [ 11 ]. Candidates need to attain a pass mark in both papers to successfully clear the examination. A representative MRCS Part A examination was provided by TeachMeSurgery in an excel spreadsheet format [ 12 ]. It was chosen for this study, for having a comprehensive multiple choice question bank and its focus on surgical topics. The representative examination was based on the MRCS curriculum published by the ICBSE and was reviewed by senior authors at TeachMeSurgery. Each question was written as a Single Best Answer (SBA) with a clinical vignette and four potential answers. Each question was processed via the web-based interface of ChatGPT Plus, which is the most advanced LLM available at the time of this study and has been used to benchmark the capabilities of AI in the other papers [ 1 ]. ChatGPT Plus costs USD $ 20 per month and utilizes GPT-4 ‘Advanced Data Analysis’, which allows for documents and spreadsheets to be uploaded and analysed by ChatGPT. We uploaded the mock exam with the four potential answers in an Excel spreadsheet. We then proceeded to prompt ChatGPT to answer each question iteratively. GPT-4 does not have direct access to the internet and has been trained on a large database until September 2021 at the time of this study. Results GPT-4 achieved a score of 87% (261/300) in the representative paper provided. The score was consistent in most subsections of the exam. In the Applied Basic Sciences paper, the score was 87.2% (157/180) and in the Principles of Surgery in General the score was 86.7% (104/120). GPT-4 scored 100% in four out of the eleven predefined curriculum areas, which included: Pharmacology, Microbiology, Data Interpretation and Audit, and The Surgical Care of Children. GPT-4’s weakest performance was in the Medico-Legal Aspects of Surgical Practice, in which it scored 33.3%. A detailed score is presented in Table 1 . Development of the prompt: A meticulous prompt development process was implemented to ensure its success. The strategy involved crafting the simplest possible prompt without providing any additional information to GPT-4. We conveyed to GPT-4 that we had uploaded a dataset consisting of MRCS Part A examination questions in an Excel spreadsheet and requested it to provide the correct answer out of four potential choices. This initial prompt revealed several challenges that the authors promptly addressed. Subsequent prompts made it explicit that the questions required a selection of the single best answer, necessitating GPT-4 to pick the most appropriate response. Challenges related to batching questions were identified, where GPT-4 tended to oversimplify, modify data, or randomly guess the correct answer, often selecting 'A' without providing reasoning. Although GPT-4 in our preliminary testing always provided an answer we made it explicit that GPT-4 is to answer each question. This is because the MRCS Part A exam doesn't employ negative marking, meaning there are no penalties for incorrect answers. To counteract these issues, we instructed GPT-4 not to modify any data and conducted a one-by-one assessment of each question for accuracy. It's important to note that no updates were made to GPT-4 during the testing phase, and distinct chat environments were employed to evaluate GPT-4 independently of prior information. Each section, ABS, and PSOG, was evaluated separately, with recorded correct and incorrect answers, yielding percentage scores out of 100 for both individual sections and the overall score. The specific prompt, which gained consensus among all authors, is outlined in Appendix 1, providing details of the prompt and GPT-4's subsequent responses. This format enabled the authors to iterate through each question. Four sample outputs by GPT-4 can be found in Appendix 2. Table 1: Breakdown of GPT-4’s results on the Membership of the Royal College of Surgeons (MRCS) Part A examination by paper and category Applied Basic Sciences Total 87.2% (157/180) Applied Surgical Anatomy 82.5% (99/120) Applied Surgical Pathology 91.7% (33/36) Pharmacology 100.0% (9/9) Microbiology 100.0% (7/7) Imaging 80.0% (4/5) Data Interpretation & Audit 100.0% (5/5) Principles of Surgery in General Total 86.7% (104/120) Common Congenital and Acquired Surgical Conditions 86.7% (39/45) Pre-op Management 97.1% (34/35) Assessment & Management of Trauma 76.7% (23/30) Surgical Care of Children 100.0% (7/7) Medico-Legal Aspects of Surgical Practice 33.3% (1/3) Discussion This study showed that GPT 4 is able to pass the MRCS A, a UK postgraduate surgical exam. It was found that GPT-4 is less successful with answering questions regarding UK guidelines and with the management of conditions as shown in Appendix 2. GPT-4 is however very proficient with factual questions especially regarding anatomy, pharmacology and data analysis. This is likely because there is only one correct answer in the selection and there is no need for further clinical reasoning to choose the most appropriate answer. This contrasts with the work by Saad et al. [ 9 ], who demonstrated that GPT-4 lacked the clinical expertise to pass the FRCS Orthopaedic Part A examination. However, the FRCS requires a higher standard of knowledge compared to the MRCS examinations aimed at professionals in the early years of their career. The FRCS (Ortho) is sat by surgeons with at least 10 years of clinical experience prior to becoming a consultant in the United Kingdom. It was noted by Saad et al. [ 9 ] that GPT-4 struggled with the SBA format. This could be due to the nature of SBA questions as multiple answers can technically be correct however the single best choice answer may require clinical experience, knowledge and interpretation of a scenario[ 13 ]. In contrast to the MRCS, the questions in the FRCS examination exhibit increased complexity, featuring a greater number of distractors. Consequently, tackling FRCS questions necessitates a deeper reliance on both clinical expertise and theoretical knowledge. It is important to note that GPT-4 has been trained on a large database but as far as we are aware has not been designed specifically for medical examinations or for the MRCS Part A in particular. However, future investigations could explore the potential benefits of training the model using a sample of mock exam questions or by using different prompts for the different subsections of the examination. Similar to human students, refining theoretical knowledge through exam practice is crucial, particularly for formats like SBA questions. With further appropriate training, GPT-4's performance in successfully tackling forthcoming exams could potentially be enhanced. GPT-4 could also be enhanced if it was able to access up-to-date clinical guidelines in the UK. However, at the time of writing, GPT-4 only has access to knowledge up till September 2021. GPT-4 can pass the MRCS Part A and is also able to give the reasons for its answers. This could have implications in medical education such as helping students review questions with a explanation as to why they have gotten the question wrong. From this study the knowledge base of GPT-4 is accurate overall and can be relied upon mostly in pharmacology and microbiology as GPT-4 achieved 100%. Most of the sections were close to 100% but also if gpt-4 was updated it could in the future have more uses. Strengths and Limitations: One of the key strengths in this paper is that we carefully selected our prompts prior to full testing of the paper. Careful selection of a prompt is imperative for achieving meaningful output from GPT-4. A comprehensive evaluation of the prompt was conducted to ensure that GPT-4 was tested to its maximum potential. A significant limitation is that the official MRCS Part A exam includes five options for each question, however, the questions in this study only included four options, which could have made the exam easier for GPT-4. The representative examination used in this study was primarily text-based: none of the questions included required GPT-4 to interpret either images, such as prosections or radiographs or lab results, such as blood tests, which is unlike the official exam. This study only investigates GPT-4’s performance on the MRCS Part A examination, however, candidates must also pass the Part B examination to be fully certified. The latter extensively tests communication and clinical skills, which cannot currently be assessed in the context of GPT-4 or other LLMs. Future studies Future studies can also investigate the performance of different LLMs such as PALM2[ 14 ] in the context of the MRCS Part A examination. Thereafter, the ability of GPT-4 to generate questions to be used for revision can also be investigated, however the validity of these questions would need robust testing. Conclusion GPT-4 can pass a representative MRCS Part A paper and can be used as a tool for medical education to help students understand the rationale behind questions. This paper highlights the strengths and weaknesses of LLMs in sitting clinical examinations and how it can be improved in future iterations. Declarations Data is provided within the manuscript and supplementary information files Acknowledgements: The authors would like to express thanks to TeachMeSurgery for providing mock MRCS Part A questions free of charge. TeachMeSurgery was not involved in the design or conduct of the study. Competing Interests: None of the authors have any competing interests to declare. Author Contribution All authors reviewed the manuscript. References Introducing ChatGPT Plus. Accessed: October 5, 2023. https://openai.com/blog/chatgpt-plus. Bubeck S, Chandrasekaran V, Eldan R, et al.: Sparks of Artificial General Intelligence: Early experiments with GPT-4. 2023. 10.48550/arXiv.2303.12712 Akhter Y, Singh R, Vatsa M: AI-based radiodiagnosis using chest X-rays: A review. Front Big Data. 2023, 6:1120989. 10.3389/fdata.2023.1120989 Co M, John Yuen TH, Cheung HH: Using clinical history taking chatbot mobile app for clinical bedside teachings – A prospective case control study. Heliyon. 2022, 8:e09751. 10.1016/j.heliyon.2022.e09751 Sun L, Yin C, Xu Q, Zhao W: Artificial intelligence for healthcare and medical education: a systematic review. Am J Transl Res. 2023, 15:4820–8. Cooper A, Rodman A: AI and Medical Education - A 21st-Century Pandora’s Box. N Engl J Med. 2023, 389:385–7. 10.1056/NEJMp2304993 Civaner MM, Uncu Y, Bulut F, Chalil EG, Tatli A: Artificial intelligence in medical education: a cross-sectional needs assessment. BMC Med Educ. 2022, 22:772. 10.1186/s12909-022-03852-3 Kung TH, Cheatham M, Medenilla A, et al.: Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023, 2:e0000198. 10.1371/journal.pdig.0000198 Saad A, Iyengar KP, Kurisunkal V, Botchu R: Assessing ChatGPT’s ability to pass the FRCS orthopaedic part A exam: A critical analysis. Surg J R Coll Surg Edinb Irel. 2023, 21:263–6. 10.1016/j.surge.2023.07.001 Katz DM, Bommarito MJ, Gao S, Arredondo P: GPT-4 Passes the Bar Exam. 2023. 10.2139/ssrn.4389233 Intercollegiate Committee for Basic Surgical Examinations 2018/19 Annual Report [Internet]. Intercollegiate Committee for Basic Surgical Examinations. TeachMeSurgery - Making Surgery Simple. TeachMeSurgery. Accessed: October 5, 2023. https://teachmesurgery.com/. Mirbahai L, W Adie J: Applying the utility index to review single best answer questions in medical education assessment. Arch Epidemiol Public Health. 2020, 2:. 10.15761/AEPH.1000113 Google AI PaLM 2. Google AI. Accessed: October 5, 2023. https://ai.google/discover/palm2/. Additional Declarations No competing interests reported. Supplementary Files SupplementaryMaterial.docx Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-4003159","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":276144058,"identity":"b0033628-3948-43b7-a41d-46638ac8f776","order_by":0,"name":"Ibrahim Inzarul Haq","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA5klEQVRIiWNgGAWjYBAC+RkMDMwMBgwJQOoYWISNnYAWNgm4FrY0BjDFTJQWsFoeMzDFQFCLdO/DzwUF2/L423O+Pfj4Y5s8HzMD44ePOXi0yBw3lp5hcLtY4szb7YYzEm4btjEzMEvO3IbPYWkM0jwGtxMbbuRuk+ZJuM0I1MLGzItfC/NvkJb5N3KegbTYE6OFDWzLhhs5bCAtiYS1yBxjswZp2XjmmZnkjLTbyW3MjM14/SI/u435Ns+f24nzjic/k/hgc9t2fnvzwQ8f8WhBAgkwBmMDUeqRtYyCUTAKRsEoQAUACONMKoHJP6MAAAAASUVORK5CYII=","orcid":"","institution":"University Hospitals Coventry and Warwickshire NHS Trust","correspondingAuthor":true,"prefix":"","firstName":"Ibrahim","middleName":"Inzarul","lastName":"Haq","suffix":""},{"id":276144059,"identity":"4f678bcc-d1c6-4b33-9929-546e1dc7a40d","order_by":1,"name":"Siddarth Raj","email":"","orcid":"","institution":"University Hospitals Coventry and Warwickshire NHS Trust","correspondingAuthor":false,"prefix":"","firstName":"Siddarth","middleName":"","lastName":"Raj","suffix":""},{"id":276144060,"identity":"e1eb3404-31c8-4ef3-8b60-00f478ae28db","order_by":2,"name":"Ali Ridha","email":"","orcid":"","institution":"University of Warwick","correspondingAuthor":false,"prefix":"","firstName":"Ali","middleName":"","lastName":"Ridha","suffix":""},{"id":276144061,"identity":"54bd9ee5-0e62-41d6-8403-b92d81ac3906","order_by":3,"name":"Arun O'Sullivan","email":"","orcid":"","institution":"University Hospitals Coventry and Warwickshire NHS Trust","correspondingAuthor":false,"prefix":"","firstName":"Arun","middleName":"","lastName":"O'Sullivan","suffix":""},{"id":276144062,"identity":"c449732a-53cb-4a19-ace4-fe7999c4d705","order_by":4,"name":"Imran Ahmed","email":"","orcid":"","institution":"University of Warwick","correspondingAuthor":false,"prefix":"","firstName":"Imran","middleName":"","lastName":"Ahmed","suffix":""},{"id":276144063,"identity":"a9611a87-3ea1-4a62-bbbd-ec5dcbb29b11","order_by":5,"name":"Farhan Syed","email":"","orcid":"","institution":"University Hospitals Coventry and Warwickshire NHS Trust","correspondingAuthor":false,"prefix":"","firstName":"Farhan","middleName":"","lastName":"Syed","suffix":""},{"id":276144064,"identity":"ca29a09f-ad80-4a37-b82b-c6113369763d","order_by":6,"name":"Chetan Khatri","email":"","orcid":"","institution":"University of Warwick","correspondingAuthor":false,"prefix":"","firstName":"Chetan","middleName":"","lastName":"Khatri","suffix":""}],"badges":[],"createdAt":"2024-03-01 12:29:46","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-4003159/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-4003159/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":61167061,"identity":"5d62ee17-894b-432d-b880-90a0f7476009","added_by":"auto","created_at":"2024-07-26 13:36:54","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":288690,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4003159/v1/f961d434-28bc-47c1-b2f0-e8433bd08290.pdf"},{"id":52093318,"identity":"a3f6dcc2-4345-46d7-9747-9eeb54313afb","added_by":"auto","created_at":"2024-03-06 14:55:49","extension":"docx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":25464,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryMaterial.docx","url":"https://assets-eu.researchsquare.com/files/rs-4003159/v1/362f4a7aeedc384bcd80c3e8.docx"}],"financialInterests":"No competing interests reported.","formattedTitle":"Performance of OpenAI’s GPT-4 in a mock MRCS Part A Examination","fulltext":[{"header":"Introduction","content":"\u003cp\u003eGenerative Pre-trained Transformer 4 (GPT-4) is a Large Language Model (LLM) released in March 2023 by OpenAI. LLMs are artificial intelligence (AI) models capable of higher order reasoning and generating human-like outputs [\u003cspan class=\"CitationRef\"\u003e1\u003c/span\u003e]. Compared to traditional deep learning AI models, LLMs produce more coherent and relevant content due to their use of natural language processing [\u003cspan class=\"CitationRef\"\u003e2\u003c/span\u003e]. The web-based interface of ChatGPT (a free-to-use AI system) has allowed public access to GPT-4 without the need for technical knowledge and has caused a rapid exploration into potential applications of AI in clinical medicine and medical education. For instance, AI has exhibited a rudimentary capability to analyse chest radiographs[\u003cspan class=\"CitationRef\"\u003e3\u003c/span\u003e].\u003c/p\u003e\n\u003cp\u003eIn medical education, AI can be used to assist medical students with history-taking with an AI bot posing as an imaginary patient [\u003cspan class=\"CitationRef\"\u003e4\u003c/span\u003e]. This has been shown to be a good alternative to bedside teaching, especially within the early clinical years. However, the use of AI has not yet been adopted into mainstream medical education, despite the increasing number of research papers highlighting its potential[\u003cspan class=\"CitationRef\"\u003e5\u003c/span\u003e]. This is partly due to a lack of high-level research demonstrating the accuracy and efficacy of AI in medical education [\u003cspan class=\"CitationRef\"\u003e5\u003c/span\u003e].\u003c/p\u003e\n\u003cp\u003ePerhaps understandably, the adoption of AI into medical practice and education has been met with apprehension by many members of the medical community. Concerns regarding patient privacy and bias within source data have been raised[\u003cspan class=\"CitationRef\"\u003e6\u003c/span\u003e]. In addition, a nationwide survey of medical students yielded concerns that AI could damage doctor-patient relationships, devalue the medical profession and potentially lead to unemployment amongst doctors [\u003cspan class=\"CitationRef\"\u003e7\u003c/span\u003e]. However, the majority of those surveyed agreed that there was potential for AI to improve clinicians\u0026rsquo; access to information and patients\u0026rsquo; access to healthcare services, and to reduce errors. 93.8% of those surveyed thought they should be given structured training on AI applications in healthcare settings [\u003cspan class=\"CitationRef\"\u003e7\u003c/span\u003e].\u003c/p\u003e\n\u003cp\u003eGPT-4 has demonstrated its proficiency against established professional examination standards like the United States Medical Licensing Examination (USMLE), the Fellow of the Royal College of Surgeons Examination (FRCS) and the United States Bar [\u003cspan class=\"CitationRef\"\u003e8\u003c/span\u003e\u0026ndash;\u003cspan class=\"CitationRef\"\u003e10\u003c/span\u003e]. These evaluations have yielded mixed success rates, underscoring the dynamic nature of AI\u0026apos;s performance. Nevertheless, an unexplored avenue remains: the application of GPT-4 in the context of the Membership of the Royal College of Surgeons (MRCS) Part A examination, which is a mandatory, written exam for postgraduate doctors within the UK healthcare system who wish to enter specialist surgical training. It has a pass rate of 30 to 40% and the pass mark varies from 69 to 75%[\u003cspan class=\"CitationRef\"\u003e11\u003c/span\u003e]. In contrast to the FRCS, the MRCS Part A is a broader, less specialised examination designed for doctors at an earlier stage of their surgical training. Given that GPT-4\u0026rsquo;s inability to pass FRCS was in part due to its limitations in critical thinking, surgical principles and decision-making skills [\u003cspan class=\"CitationRef\"\u003e6\u003c/span\u003e], it is possible GPT-4 would be more likely to pass the MRCS Part A.\u003c/p\u003e\n\u003cp\u003eExploring GPT-4\u0026apos;s ability to pass the MRCS Part A can highlight LLMs potential in examination scenarios and evaluate its limitations.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAims\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTo determine whether GPT-4 can pass the MRCS Part A and evaluate whether it has potential for further use in medical education and examination settings.\u003c/p\u003e"},{"header":"Materials and Methods","content":"\u003cp\u003eThe MRCS Part A examination is structured into two distinct sections: Applied Basic Sciences (ABS) with 180 questions and Principles of Surgery in General (POSG) with 120 questions, resulting in a comprehensive total of 300 questions. The overall examination adheres to a predefined curriculum breakdown published by the Intercollegiate Committee for Basic Surgical Examinations (ICBSE), focusing heavily on Applied Surgical Anatomy (75 questions) but also includes a range of other topics, such as Applied Surgical Physiology (45 questions) and Common Surgical Conditions (45 questions) [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e]. Candidates need to attain a pass mark in both papers to successfully clear the examination.\u003c/p\u003e \u003cp\u003eA representative MRCS Part A examination was provided by TeachMeSurgery in an excel spreadsheet format [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e]. It was chosen for this study, for having a comprehensive multiple choice question bank and its focus on surgical topics. The representative examination was based on the MRCS curriculum published by the ICBSE and was reviewed by senior authors at TeachMeSurgery. Each question was written as a Single Best Answer (SBA) with a clinical vignette and four potential answers.\u003c/p\u003e \u003cp\u003eEach question was processed via the web-based interface of ChatGPT Plus, which is the most advanced LLM available at the time of this study and has been used to benchmark the capabilities of AI in the other papers [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e]. ChatGPT Plus costs USD \u003cspan\u003e$\u003c/span\u003e20 per month and utilizes GPT-4 \u0026lsquo;Advanced Data Analysis\u0026rsquo;, which allows for documents and spreadsheets to be uploaded and analysed by ChatGPT. We uploaded the mock exam with the four potential answers in an Excel spreadsheet. We then proceeded to prompt ChatGPT to answer each question iteratively.\u003c/p\u003e \u003cp\u003eGPT-4 does not have direct access to the internet and has been trained on a large database until September 2021 at the time of this study.\u003c/p\u003e"},{"header":"Results","content":"\u003cp\u003eGPT-4 achieved a score of 87% (261/300) in the representative paper provided. The score was consistent in most subsections of the exam. In the Applied Basic Sciences paper, the score was 87.2% (157/180) and in the Principles of Surgery in General the score was 86.7% (104/120). GPT-4 scored 100% in four out of the eleven predefined curriculum areas, which included: Pharmacology, Microbiology, Data Interpretation and Audit, and The Surgical Care of Children. GPT-4\u0026rsquo;s weakest performance was in the Medico-Legal Aspects of Surgical Practice, in which it scored 33.3%. A detailed score is presented in Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e.\u003c/p\u003e\n\u003ch3\u003eDevelopment of the prompt:\u003c/h3\u003e\n\u003cp\u003eA meticulous prompt development process was implemented to ensure its success. The strategy involved crafting the simplest possible prompt without providing any additional information to GPT-4. We conveyed to GPT-4 that we had uploaded a dataset consisting of MRCS Part A examination questions in an Excel spreadsheet and requested it to provide the correct answer out of four potential choices. This initial prompt revealed several challenges that the authors promptly addressed.\u003c/p\u003e \u003cp\u003eSubsequent prompts made it explicit that the questions required a selection of the single best answer, necessitating GPT-4 to pick the most appropriate response. Challenges related to batching questions were identified, where GPT-4 tended to oversimplify, modify data, or randomly guess the correct answer, often selecting 'A' without providing reasoning. Although GPT-4 in our preliminary testing always provided an answer we made it explicit that GPT-4 is to answer each question. This is because the MRCS Part A exam doesn't employ negative marking, meaning there are no penalties for incorrect answers. To counteract these issues, we instructed GPT-4 not to modify any data and conducted a one-by-one assessment of each question for accuracy.\u003c/p\u003e \u003cp\u003eIt's important to note that no updates were made to GPT-4 during the testing phase, and distinct chat environments were employed to evaluate GPT-4 independently of prior information. Each section, ABS, and PSOG, was evaluated separately, with recorded correct and incorrect answers, yielding percentage scores out of 100 for both individual sections and the overall score.\u003c/p\u003e \u003cp\u003eThe specific prompt, which gained consensus among all authors, is outlined in Appendix 1, providing details of the prompt and GPT-4's subsequent responses. This format enabled the authors to iterate through each question. Four sample outputs by GPT-4 can be found in Appendix 2.\u003c/p\u003e \n\u003cp\u003e\u003cstrong\u003eTable 1:\u003c/strong\u003e Breakdown of GPT-4\u0026rsquo;s results on the Membership of the Royal College of Surgeons (MRCS) Part A examination by paper and category\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\" width=\"624\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd width=\"50.32051282051282%\" valign=\"top\"\u003e\n \u003cp\u003e\u003cstrong\u003eApplied Basic Sciences Total\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"49.67948717948718%\" valign=\"top\"\u003e\n \u003cp\u003e\u003cstrong\u003e87.2% (157/180)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"50.32051282051282%\" valign=\"bottom\"\u003e\n \u003cp\u003e\u0026nbsp;Applied Surgical Anatomy\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"49.67948717948718%\" valign=\"bottom\"\u003e\n \u003cp\u003e\u0026nbsp;82.5% (99/120)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"50.32051282051282%\" valign=\"bottom\"\u003e\n \u003cp\u003e\u0026nbsp;Applied Surgical Pathology\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"49.67948717948718%\" valign=\"bottom\"\u003e\n \u003cp\u003e\u0026nbsp;91.7% (33/36)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"50.32051282051282%\" valign=\"bottom\"\u003e\n \u003cp\u003e\u0026nbsp;Pharmacology\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"49.67948717948718%\" valign=\"bottom\"\u003e\n \u003cp\u003e\u0026nbsp;100.0% (9/9)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"50.32051282051282%\" valign=\"bottom\"\u003e\n \u003cp\u003e\u0026nbsp;Microbiology\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"49.67948717948718%\" valign=\"bottom\"\u003e\n \u003cp\u003e\u0026nbsp;100.0% (7/7)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"50.32051282051282%\" valign=\"bottom\"\u003e\n \u003cp\u003e\u0026nbsp;Imaging\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"49.67948717948718%\" valign=\"bottom\"\u003e\n \u003cp\u003e\u0026nbsp;80.0% (4/5)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"50.32051282051282%\" valign=\"bottom\"\u003e\n \u003cp\u003e\u0026nbsp;Data Interpretation \u0026amp; Audit\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"49.67948717948718%\" valign=\"bottom\"\u003e\n \u003cp\u003e\u0026nbsp;100.0% (5/5)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u003cbr\u003e\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\" width=\"623\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd width=\"50.080256821829856%\" valign=\"top\"\u003e\n \u003cp\u003e\u003cstrong\u003ePrinciples of Surgery in General Total\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"49.919743178170144%\" valign=\"top\"\u003e\n \u003cp\u003e\u003cstrong\u003e86.7% (104/120)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"50.080256821829856%\" valign=\"bottom\"\u003e\n \u003cp\u003e\u0026nbsp;Common Congenital and Acquired Surgical Conditions\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"49.919743178170144%\" valign=\"bottom\"\u003e\n \u003cp\u003e\u0026nbsp;86.7% (39/45)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"50.080256821829856%\" valign=\"bottom\"\u003e\n \u003cp\u003e\u0026nbsp;Pre-op Management\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"49.919743178170144%\" valign=\"bottom\"\u003e\n \u003cp\u003e\u0026nbsp;97.1% (34/35)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"50.080256821829856%\" valign=\"bottom\"\u003e\n \u003cp\u003e\u0026nbsp;Assessment \u0026amp; Management of Trauma\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"49.919743178170144%\" valign=\"bottom\"\u003e\n \u003cp\u003e\u0026nbsp;76.7% (23/30)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"50.080256821829856%\" valign=\"bottom\"\u003e\n \u003cp\u003e\u0026nbsp;Surgical Care of Children\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"49.919743178170144%\" valign=\"bottom\"\u003e\n \u003cp\u003e\u0026nbsp;100.0% (7/7)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"50.080256821829856%\" valign=\"bottom\"\u003e\n \u003cp\u003e\u0026nbsp;Medico-Legal Aspects of Surgical Practice\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"49.919743178170144%\" valign=\"bottom\"\u003e\n \u003cp\u003e\u0026nbsp;33.3% (1/3)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e"},{"header":"Discussion","content":"\u003cp\u003eThis study showed that GPT 4 is able to pass the MRCS A, a UK postgraduate surgical exam. It was found that GPT-4 is less successful with answering questions regarding UK guidelines and with the management of conditions as shown in Appendix 2. GPT-4 is however very proficient with factual questions especially regarding anatomy, pharmacology and data analysis. This is likely because there is only one correct answer in the selection and there is no need for further clinical reasoning to choose the most appropriate answer.\u003c/p\u003e\n\u003cp\u003eThis contrasts with the work by Saad et al. [\u003cspan class=\"CitationRef\"\u003e9\u003c/span\u003e], who demonstrated that GPT-4 lacked the clinical expertise to pass the FRCS Orthopaedic Part A examination. However, the FRCS requires a higher standard of knowledge compared to the MRCS examinations aimed at professionals in the early years of their career. The FRCS (Ortho) is sat by surgeons with at least 10 years of clinical experience prior to becoming a consultant in the United Kingdom. It was noted by Saad et al. [\u003cspan class=\"CitationRef\"\u003e9\u003c/span\u003e] that GPT-4 struggled with the SBA format. This could be due to the nature of SBA questions as multiple answers can technically be correct however the single best choice answer may require clinical experience, knowledge and interpretation of a scenario[\u003cspan class=\"CitationRef\"\u003e13\u003c/span\u003e]. In contrast to the MRCS, the questions in the FRCS examination exhibit increased complexity, featuring a greater number of distractors. Consequently, tackling FRCS questions necessitates a deeper reliance on both clinical expertise and theoretical knowledge.\u003c/p\u003e\n\u003cp\u003eIt is important to note that GPT-4 has been trained on a large database but as far as we are aware has not been designed specifically for medical examinations or for the MRCS Part A in particular. However, future investigations could explore the potential benefits of training the model using a sample of mock exam questions or by using different prompts for the different subsections of the examination. Similar to human students, refining theoretical knowledge through exam practice is crucial, particularly for formats like SBA questions. With further appropriate training, GPT-4\u0026apos;s performance in successfully tackling forthcoming exams could potentially be enhanced. GPT-4 could also be enhanced if it was able to access up-to-date clinical guidelines in the UK. However, at the time of writing, GPT-4 only has access to knowledge up till September 2021.\u003c/p\u003e\n\u003cp\u003eGPT-4 can pass the MRCS Part A and is also able to give the reasons for its answers. This could have implications in medical education such as helping students review questions with a explanation as to why they have gotten the question wrong. From this study the knowledge base of GPT-4 is accurate overall and can be relied upon mostly in pharmacology and microbiology as GPT-4 achieved 100%. Most of the sections were close to 100% but also if gpt-4 was updated it could in the future have more uses.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eStrengths and Limitations:\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eOne of the key strengths in this paper is that we carefully selected our prompts prior to full testing of the paper. Careful selection of a prompt is imperative for achieving meaningful output from GPT-4. A comprehensive evaluation of the prompt was conducted to ensure that GPT-4 was tested to its maximum potential.\u003c/p\u003e\n\u003cp\u003eA significant limitation is that the official MRCS Part A exam includes five options for each question, however, the questions in this study only included four options, which could have made the exam easier for GPT-4. The representative examination used in this study was primarily text-based: none of the questions included required GPT-4 to interpret either images, such as prosections or radiographs or lab results, such as blood tests, which is unlike the official exam. This study only investigates GPT-4\u0026rsquo;s performance on the MRCS Part A examination, however, candidates must also pass the Part B examination to be fully certified. The latter extensively tests communication and clinical skills, which cannot currently be assessed in the context of GPT-4 or other LLMs.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFuture studies\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eFuture studies can also investigate the performance of different LLMs such as PALM2[\u003cspan class=\"CitationRef\"\u003e14\u003c/span\u003e] in the context of the MRCS Part A examination. Thereafter, the ability of GPT-4 to generate questions to be used for revision can also be investigated, however the validity of these questions would need robust testing.\u003c/p\u003e"},{"header":"Conclusion","content":"\u003cp\u003eGPT-4 can pass a representative MRCS Part A paper and can be used as a tool for medical education to help students understand the rationale behind questions. This paper highlights the strengths and weaknesses of LLMs in sitting clinical examinations and how it can be improved in future iterations.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003eData is provided within the manuscript and supplementary information files\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAcknowledgements:\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors would like to express thanks to TeachMeSurgery for providing mock MRCS Part A questions free of charge. TeachMeSurgery was not involved in the design or conduct of the study.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting Interests:\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNone of the authors have any competing interests to declare.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthor Contribution\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAll authors reviewed the manuscript.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eIntroducing ChatGPT Plus. Accessed: October 5, 2023. https://openai.com/blog/chatgpt-plus. \u003c/li\u003e\n\u003cli\u003eBubeck S, Chandrasekaran V, Eldan R, et al.: Sparks of Artificial General Intelligence: Early experiments with GPT-4. 2023. 10.48550/arXiv.2303.12712\u003c/li\u003e\n\u003cli\u003eAkhter Y, Singh R, Vatsa M: AI-based radiodiagnosis using chest X-rays: A review. Front Big Data. 2023, 6:1120989. 10.3389/fdata.2023.1120989\u003c/li\u003e\n\u003cli\u003eCo M, John Yuen TH, Cheung HH: Using clinical history taking chatbot mobile app for clinical bedside teachings \u0026ndash; A prospective case control study. Heliyon. 2022, 8:e09751. 10.1016/j.heliyon.2022.e09751\u003c/li\u003e\n\u003cli\u003eSun L, Yin C, Xu Q, Zhao W: Artificial intelligence for healthcare and medical education: a systematic review. Am J Transl Res. 2023, 15:4820\u0026ndash;8. \u003c/li\u003e\n\u003cli\u003eCooper A, Rodman A: AI and Medical Education - A 21st-Century Pandora\u0026rsquo;s Box. N Engl J Med. 2023, 389:385\u0026ndash;7. 10.1056/NEJMp2304993\u003c/li\u003e\n\u003cli\u003eCivaner MM, Uncu Y, Bulut F, Chalil EG, Tatli A: Artificial intelligence in medical education: a cross-sectional needs assessment. BMC Med Educ. 2022, 22:772. 10.1186/s12909-022-03852-3\u003c/li\u003e\n\u003cli\u003eKung TH, Cheatham M, Medenilla A, et al.: Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023, 2:e0000198. 10.1371/journal.pdig.0000198\u003c/li\u003e\n\u003cli\u003eSaad A, Iyengar KP, Kurisunkal V, Botchu R: Assessing ChatGPT\u0026rsquo;s ability to pass the FRCS orthopaedic part A exam: A critical analysis. Surg J R Coll Surg Edinb Irel. 2023, 21:263\u0026ndash;6. 10.1016/j.surge.2023.07.001\u003c/li\u003e\n\u003cli\u003eKatz DM, Bommarito MJ, Gao S, Arredondo P: GPT-4 Passes the Bar Exam. 2023. 10.2139/ssrn.4389233\u003c/li\u003e\n\u003cli\u003eIntercollegiate Committee for Basic Surgical Examinations 2018/19 Annual Report [Internet]. Intercollegiate Committee for Basic Surgical Examinations. \u003c/li\u003e\n\u003cli\u003eTeachMeSurgery - Making Surgery Simple. TeachMeSurgery. Accessed: October 5, 2023. https://teachmesurgery.com/. \u003c/li\u003e\n\u003cli\u003eMirbahai L, W Adie J: Applying the utility index to review single best answer questions in medical education assessment. Arch Epidemiol Public Health. 2020, 2:. 10.15761/AEPH.1000113\u003c/li\u003e\n\u003cli\u003eGoogle AI PaLM 2. Google AI. Accessed: October 5, 2023. https://ai.google/discover/palm2/. \u003cstrong\u003e\u003c/strong\u003e\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-4003159/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-4003159/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003e\u003cstrong\u003eIntroduction\u003c/strong\u003e:\u003c/p\u003e\n\u003cp\u003eOpenAI’s latest iteration of a Large Language Model (LLM); GPT-4 (Generative Pre-trained Transformer 4) has demonstrated its proficiency against various professional examination standards like the USMLE (United States Medical Licensing Examination), FRCS (Fellowship of the Royal Colleges of Surgeons) and the United States Bar. However, GPT-4’s capability with the MRCS (Membership of the Royal College of Surgeons) Part A has not yet been investigated.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eMethodology\u003c/strong\u003e:\u003c/p\u003e\n\u003cp\u003eA representative MRCS Part A examination that was prepared and provided by “TeachMeSurgery”based on the MRCS Intercollegiate Curriculum was used to assess GPT-4's performance. Each question was processed via the web-based interface of ChatGPT (Chat Generative Pre-trained Transformer) Plus.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eResults\u003c/strong\u003e:\u003c/p\u003e\n\u003cp\u003eGPT-4 scored 87.2% on Applied Basic Sciences (157/180) and 86.7% on Principles of Surgery in General (104/120 questions), achieving an overall score of 261/300 (87%), which is above the typical passing threshold. GPT-4 scored 100% in four out of the eleven predefined curriculum areas, which included: Pharmacology, Microbiology, Data Interpretation and Audit, and The Surgical Care of Children. GPT-4’s weakest performance was in the Medico-Legal Aspects of Surgical Practice, in which it scored 33.3%.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConclusion\u003c/strong\u003e:\u003c/p\u003e\n\u003cp\u003eGPT-4 successfully passed the mock MRCS Part A without any specialised preparatory training, however further research could look at integrating the GPT-4 model to enhance a trainee surgeon’s learning and use it as an effective tool to deliver high quality \u0026amp; efficient patient care.\u003c/p\u003e","manuscriptTitle":"Performance of OpenAI’s GPT-4 in a mock MRCS Part A Examination","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-03-06 14:55:44","doi":"10.21203/rs.3.rs-4003159/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"a4508ce5-bd55-452a-a880-0360e59e655b","owner":[],"postedDate":"March 6th, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2024-07-26T13:28:47+00:00","versionOfRecord":[],"versionCreatedAt":"2024-03-06 14:55:44","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-4003159","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-4003159","identity":"rs-4003159","version":["v1"]},"buildId":"qtupq5eGEP_6zYnWcrvyt","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-05-23T02:00:01.238055+00:00

License: CC-BY-4.0